Skip to main content

Admin Ops & Metrics

The admin ops system provides operational visibility through health checks, queue monitoring, latency metrics, and build information.

Architecture

┌─────────────────┐     ┌──────────────────┐     ┌─────────────────┐
│ /admin/ops │────▶│ AdminOpsService │────▶│ Queue Snapshot │
│ /health │ │ │ │ Prometheus │
└─────────────────┘ └──────────────────┘ └─────────────────┘


┌──────────────────┐
│ Health │
│ Indicators │
└──────────────────┘

Sources:

  • apps/teetime/teetime-backend/src/admin/admin-ops.controller.ts
  • apps/teetime/teetime-backend/src/app/health/health.controller.ts

Ops Status

Endpoint

GET /admin/ops/status

// Response
{
"queues": [
{ "name": "tee-time-queue", "waiting": 5, "active": 2, "delayed": 0, "failed": 1 }
],
"apiLatencyMsP50": 45,
"apiLatencyMsP95": 120,
"replicaLagSeconds": 0.5,
"timestamp": "2025-12-15T08:00:00Z"
}

Queue Metrics

MetricDescription
waitingJobs waiting to be processed
activeJobs currently processing
delayedJobs scheduled for future
failedFailed jobs

Latency Metrics

Fetched from Prometheus:

// P50 latency
histogram_quantile(0.5, sum(rate(http_server_duration_seconds_bucket[5m])) by (le))

// P95 latency
histogram_quantile(0.95, sum(rate(http_server_duration_seconds_bucket[5m])) by (le))

// Replica lag
pg_last_wal_receive_lsn_lag_seconds

Build Info

Endpoint

GET /admin/build-info

// Response
{
"version": "2.1.0",
"commitSha": "abc123def456",
"environment": "production",
"builtAt": "2025-01-10T14:30:00Z"
}

Configuration Priority

  1. Environment variables (highest)
  2. Build info file (BUILD_INFO_FILE)
  3. Defaults (lowest)
Env VarField
BUILD_VERSIONversion
BUILD_COMMIT_SHAcommitSha
NODE_ENVenvironment
BUILD_DATEbuiltAt

Health Checks

Endpoints

EndpointPurpose
GET /healthAggregate readiness
GET /health/readinessFull readiness check
GET /health/livenessLightweight alive check
GET /health/providersExternal provider health
GET /health/configEnvironment validation
GET /health/externalExternal services (S3)
GET /health/panelsAll panels combined

Health Indicators

IndicatorWhat It Checks
StorageHealthIndicatorStorage database ping
TeeSheetDbHealthIndicatorTee-sheet database (5s timeout)
QueueHealthIndicatorRedis PING/PONG
RedisReadyIndicatorRedis cache connectivity
McaHealthIndicatorMCA API endpoint
ProviderHealthIndicatorGolfNow, Lightspeed, ForeUp
S3HealthIndicatorS3 bucket HeadBucket
EnvHealthIndicatorRequired env vars

Response Format

{
"status": "ok",
"details": [
{ "storage": { "status": "up", "latencyMs": 42 } },
{ "teeSheetDb": { "status": "up", "latencyMs": 55 } },
{ "queue": { "status": "up", "latencyMs": 23 } }
]
}

Prometheus Metrics

Endpoint

GET /metrics
// Returns Prometheus-compatible metrics format

Infrastructure Gauges

MetricDescription
storage_upStorage DB connectivity (1/0)
tee_sheet_db_upTee-sheet DB connectivity
queue_upQueue/Redis connectivity
service_readyOverall service readiness
provider_upProvider availability
s3_upS3 bucket connectivity
weather_upWeather service availability
geocoding_upGeocoding service availability

Queue Gauges

queue_jobs{queue="tee-time-queue", status="waiting"} 5
queue_jobs{queue="tee-time-queue", status="active"} 2
queue_jobs{queue="tee-time-queue", status="delayed"} 0
queue_jobs{queue="tee-time-queue", status="failed"} 1

Metrics Refresh

Background services refresh metrics periodically:

ServiceIntervalConfig
StorageMetricsService60sMETRICS_PING_INTERVAL_MS
QueueMetricsService60sMETRICS_QUEUE_INTERVAL_MS

Environment Validation

The config health check validates:

Required:

  • TEETIME_DATABASE_URL

Optional Groups (consistency rules):

GroupVariablesRule
RedisTEE_CACHE_REDIS_URL
OpenMeteoOPENMETEO_BASE_URL, OPENMETEO_APIKEYBoth required if either set
GeocodingGEOCODING_BASE_URL, OPENMETEO_APIKEYBoth required if either set
TwilioTWILIO_ACCOUNT_SID, TWILIO_AUTH_TOKENBoth required if either set
Booking PolicyBOOKING_ADVANCE_WINDOW_DAYSMust be numeric

Configuration

Prometheus

Env VarDescription
PROMETHEUS_BASE_URLPrometheus server URL
API_LATENCY_P50_QUERYCustom P50 query
API_LATENCY_P95_QUERYCustom P95 query
REPLICA_LAG_QUERYCustom replica lag query

Queue Monitoring

Env VarDescription
OPS_STATUS_QUEUE_NAMESComma-separated queue names
QUEUE_NAMESFallback queue names

Health Checks

Env VarDescription
MCA_PING_ENDPOINTSAdditional MCA endpoints to check
STORAGE_BUCKETS3 bucket name
S3_REGIONAWS region

Authentication

All admin ops endpoints require:

  • JwtAuthGuard
  • AudienceGuard('teetime-admin')

Health and metrics endpoints are typically public for monitoring systems.

Tenant Resolution

The ops service resolves tenant ID from multiple sources:

  1. User's tenantIds array
  2. JWT claims
  3. x-tenant-id header
  4. Custom claim keys
GET /admin/tenant/profile

// Response
{
"tenantId": "tenant-123",
"displayName": "Golf Club Inc",
"description": "Premium golf management"
}

Troubleshooting

Health Check Failing

  1. Check database connectivity
  2. Verify Redis is running
  3. Check external service URLs
  4. Review timeout settings (default: 5s for DB)

Metrics Not Updating

  1. Verify PROMETHEUS_BASE_URL is set
  2. Check Prometheus is accessible
  3. Review query syntax
  4. Check metrics refresh interval

Queue Depth High

  1. Check worker processes are running
  2. Review failed job logs
  3. Check Redis memory usage
  4. Consider scaling workers