Admin Ops & Metrics
The admin ops system provides operational visibility through health checks, queue monitoring, latency metrics, and build information.
Architecture
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ /admin/ops │────▶│ AdminOpsService │────▶│ Queue Snapshot │
│ /health │ │ │ │ Prometheus │
└─────────────────┘ └──────────────────┘ └─────────────────┘
│
▼
┌──────────────────┐
│ Health │
│ Indicators │
└──────────────────┘
Sources:
apps/teetime/teetime-backend/src/admin/admin-ops.controller.tsapps/teetime/teetime-backend/src/app/health/health.controller.ts
Ops Status
Endpoint
GET /admin/ops/status
// Response
{
"queues": [
{ "name": "tee-time-queue", "waiting": 5, "active": 2, "delayed": 0, "failed": 1 }
],
"apiLatencyMsP50": 45,
"apiLatencyMsP95": 120,
"replicaLagSeconds": 0.5,
"timestamp": "2025-12-15T08:00:00Z"
}
Queue Metrics
| Metric | Description |
|---|---|
waiting | Jobs waiting to be processed |
active | Jobs currently processing |
delayed | Jobs scheduled for future |
failed | Failed jobs |
Latency Metrics
Fetched from Prometheus:
// P50 latency
histogram_quantile(0.5, sum(rate(http_server_duration_seconds_bucket[5m])) by (le))
// P95 latency
histogram_quantile(0.95, sum(rate(http_server_duration_seconds_bucket[5m])) by (le))
// Replica lag
pg_last_wal_receive_lsn_lag_seconds
Build Info
Endpoint
GET /admin/build-info
// Response
{
"version": "2.1.0",
"commitSha": "abc123def456",
"environment": "production",
"builtAt": "2025-01-10T14:30:00Z"
}
Configuration Priority
- Environment variables (highest)
- Build info file (
BUILD_INFO_FILE) - Defaults (lowest)
| Env Var | Field |
|---|---|
BUILD_VERSION | version |
BUILD_COMMIT_SHA | commitSha |
NODE_ENV | environment |
BUILD_DATE | builtAt |
Health Checks
Endpoints
| Endpoint | Purpose |
|---|---|
GET /health | Aggregate readiness |
GET /health/readiness | Full readiness check |
GET /health/liveness | Lightweight alive check |
GET /health/providers | External provider health |
GET /health/config | Environment validation |
GET /health/external | External services (S3) |
GET /health/panels | All panels combined |
Health Indicators
| Indicator | What It Checks |
|---|---|
StorageHealthIndicator | Storage database ping |
TeeSheetDbHealthIndicator | Tee-sheet database (5s timeout) |
QueueHealthIndicator | Redis PING/PONG |
RedisReadyIndicator | Redis cache connectivity |
McaHealthIndicator | MCA API endpoint |
ProviderHealthIndicator | GolfNow, Lightspeed, ForeUp |
S3HealthIndicator | S3 bucket HeadBucket |
EnvHealthIndicator | Required env vars |
Response Format
{
"status": "ok",
"details": [
{ "storage": { "status": "up", "latencyMs": 42 } },
{ "teeSheetDb": { "status": "up", "latencyMs": 55 } },
{ "queue": { "status": "up", "latencyMs": 23 } }
]
}
Prometheus Metrics
Endpoint
GET /metrics
// Returns Prometheus-compatible metrics format
Infrastructure Gauges
| Metric | Description |
|---|---|
storage_up | Storage DB connectivity (1/0) |
tee_sheet_db_up | Tee-sheet DB connectivity |
queue_up | Queue/Redis connectivity |
service_ready | Overall service readiness |
provider_up | Provider availability |
s3_up | S3 bucket connectivity |
weather_up | Weather service availability |
geocoding_up | Geocoding service availability |
Queue Gauges
queue_jobs{queue="tee-time-queue", status="waiting"} 5
queue_jobs{queue="tee-time-queue", status="active"} 2
queue_jobs{queue="tee-time-queue", status="delayed"} 0
queue_jobs{queue="tee-time-queue", status="failed"} 1
Metrics Refresh
Background services refresh metrics periodically:
| Service | Interval | Config |
|---|---|---|
StorageMetricsService | 60s | METRICS_PING_INTERVAL_MS |
QueueMetricsService | 60s | METRICS_QUEUE_INTERVAL_MS |
Environment Validation
The config health check validates:
Required:
TEETIME_DATABASE_URL
Optional Groups (consistency rules):
| Group | Variables | Rule |
|---|---|---|
| Redis | TEE_CACHE_REDIS_URL | — |
| OpenMeteo | OPENMETEO_BASE_URL, OPENMETEO_APIKEY | Both required if either set |
| Geocoding | GEOCODING_BASE_URL, OPENMETEO_APIKEY | Both required if either set |
| Twilio | TWILIO_ACCOUNT_SID, TWILIO_AUTH_TOKEN | Both required if either set |
| Booking Policy | BOOKING_ADVANCE_WINDOW_DAYS | Must be numeric |
Configuration
Prometheus
| Env Var | Description |
|---|---|
PROMETHEUS_BASE_URL | Prometheus server URL |
API_LATENCY_P50_QUERY | Custom P50 query |
API_LATENCY_P95_QUERY | Custom P95 query |
REPLICA_LAG_QUERY | Custom replica lag query |
Queue Monitoring
| Env Var | Description |
|---|---|
OPS_STATUS_QUEUE_NAMES | Comma-separated queue names |
QUEUE_NAMES | Fallback queue names |
Health Checks
| Env Var | Description |
|---|---|
MCA_PING_ENDPOINTS | Additional MCA endpoints to check |
STORAGE_BUCKET | S3 bucket name |
S3_REGION | AWS region |
Authentication
All admin ops endpoints require:
JwtAuthGuardAudienceGuard('teetime-admin')
Health and metrics endpoints are typically public for monitoring systems.
Tenant Resolution
The ops service resolves tenant ID from multiple sources:
- User's
tenantIdsarray - JWT claims
x-tenant-idheader- Custom claim keys
GET /admin/tenant/profile
// Response
{
"tenantId": "tenant-123",
"displayName": "Golf Club Inc",
"description": "Premium golf management"
}
Troubleshooting
Health Check Failing
- Check database connectivity
- Verify Redis is running
- Check external service URLs
- Review timeout settings (default: 5s for DB)
Metrics Not Updating
- Verify
PROMETHEUS_BASE_URLis set - Check Prometheus is accessible
- Review query syntax
- Check metrics refresh interval
Queue Depth High
- Check worker processes are running
- Review failed job logs
- Check Redis memory usage
- Consider scaling workers