Skip to main content

System Health Live

Two surfaces:

  1. /api/health — synchronous structural checks. 12 named checks, persisted to HealthCheckResult on every call. Hit by the hourly cron + post-deploy.
  2. /admin/observability — admin dashboard. Shows the latest stored health result + activity / business / traffic metrics + Test Coverage + Feature Health.

Rebuilt 2026-04-27. The previous version was checking the legacy Restaurant table (Eats archived 2026-04-17) and silent on every system shipped after January 2026. The rebuild added 6 new checks, removed 1 stale check, switched the Order metric to StoreOrder, and added Test Coverage + Feature Health panels to the dashboard.

1. The 12 health checks

GET /api/health?key=<HEALTH_CHECK_SECRET> runs all checks in parallel.

CheckVerifiesFailure modes
databasePrisma reachable; SELECT 1 succeedsDB outage
critical_pagesHomepage + main routes (shop / eats / reels / requests / profile) return 200Build broken; route handler crashed
auth_integrityConsumer auth round-trip works; reel ownership intactJWT secret mismatch; cookie scope drift
reel_data_integrity5 most-recent active reels have valid data shapeSchema drift
route_integrityProduct routes correctMissing dynamic route file
store_ordersActive live store + ≥1 product + ≥1 StoreOrder in last 7dCatalog empty; funnel broken
whatsapp_linksWhatsApp button data is valid for at least one consumerPhone number missing on consumer
bidding_systemNo PriceOffers stuck past expiry; lazy-expire cron workingBidding state-machine drift
admin_audit_logAdminAction is being written when admins actAudit helper bypassed; admin actions not flowing through /actions route
push_tokensActive push token count per app (consumer / seller / doctor); warns on zeroToken registration broken; users haven't enabled push
moderation_queueOpen ticket count + warns on tickets open >7 daysAdmin not keeping up; no admin assigned
broadcast_systemLatest sent broadcast's delivery ratio; fails if <30% delivered on >5 recipientsPush infra down; FCM/APNs token expiry

⚡ = added 2026-04-27 in the v2 rebuild.

The legacy checkOrderSystem (which checked the Restaurant table) was removed.

2. Status semantics

  • pass — check ran and result is healthy
  • warn — check ran, found something worth flagging but not breaking (e.g. Abdul Flower Works is live but has no products)
  • fail — check ran and found a real problem (e.g. Stuck offers: 6 past expiry)

Overall status:

  • Any failcritical (HTTP 503)
  • Any warn (no fails) → degraded (HTTP 200)
  • All passhealthy (HTTP 200)

3. Where the data goes

Every /api/health call writes a row to HealthCheckResult:

model HealthCheckResult {
id Int @id @default(autoincrement())
status String // healthy | degraded | critical
totalChecks Int
passed Int
failed Int
duration Int // total ms
results Json // detailed per-check results
triggeredBy String // scheduled | manual | post_deploy
alertSent Boolean
createdAt DateTime @default(now())
}

The hourly cron in .github/workflows/health-check.yml triggers it at :17 past every hour. The dashboard's "Health History" graph reads recent rows from this table.

4. Admin dashboard sections

/admin/observability (admin-gated). Top to bottom:

  1. Health status — latest result + 24h/7d/30d history graph
  2. Errors — frontend ClientError rows by type + recent 20 unresolved
  3. Activity — total UserEvents + active user count + events by type
  4. Business — orders count, revenue (₹), reels created, reel views, requests, new consumers (current range). Switched from legacy Order table to StoreOrder 2026-04-27.
  5. Page Views Heatmap — top entity types viewed
  6. Traffic — unique visitors (fingerprint) + sessions + top viewed pages
  7. System Totals — consumers, sellers, products, restaurants, reels (all-time)
  8. Test Coverage ⚡ — one tile per app: backend (vitest), mobile (consumer, jest-expo), mobile-seller (jest-expo), mobile-doctor (jest-expo). Each tile shows test file count + it()-case count + framework + path. Apps with zero tests get an amber warning.
  9. Feature Health ⚡ — bidding state breakdown (queued / accepted / converted / expired), moderation depth + staleness, push tokens per app, broadcast delivery counts (last 30d), admin actions by category (last 30d) + impersonation count.

⚡ = added 2026-04-27.

5. Test inventory helper

src/lib/test-inventory.ts — synchronous file-walker (~10ms) used by the dashboard's Test Coverage panel.

import { getTestInventory, getTestTotals } from "@/lib/test-inventory";

const inv = getTestInventory(repoRoot);
// → [{ app: "backend (web + API)", framework: "vitest", rootDir: "tests",
// testFiles: 66, testCases: 2478 }, ... 3 more apps]
const totals = getTestTotals(inv);
// → { files: 100, cases: 2700, apps: 4 }

Counts .test.ts and .test.tsx files; counts it( and test( declarations within them. NOT a replacement for actually running the suites — that's what the hourly cron + every CI deploy do (and they store results in HealthCheckResult). This static counter just shows shape so we notice if a worktree is broken or test files vanish.

6. Pinning contract — why the dashboard stays current

tests/system-health-coverage.test.ts (25 cases) is the regression net:

  • Every shipped major system MUST have a checkXyz() function declared AND a call site in the Promise.allSettled list.
  • The status-name fallback array MUST contain entries for the 6 new checks (so a thrown check gets the right name in the response).
  • Legacy checkOrderSystem MUST be gone.
  • test-inventory.ts walks all 4 apps + counts are consistent + backend has ≥50 test files (sanity floor).
  • API response includes tests: block + features: block.
  • Dashboard page renders Test Coverage and Feature Health sections.

If a future PR adds a new system without adding a health check, this test fails. If someone deletes a check by accident, this test fails. The dashboard cannot drift silently again.

7. Hourly cron + alerts

.github/workflows/health-check.yml runs at :17 past every hour:

  1. Hits https://ka26.shop/api/health?key=<secret> with 3 retries
  2. If status is degraded or critical, the route writes a SellerNotification for the admin (in-app bell icon) + logs to Cloud Run stderr (visible in Cloud Logging)
  3. Runs tests/e2e-smoke.test.ts, tests/e2e-signup-smoke.test.ts, tests/e2e-critical-funnels-smoke.test.ts against production
  4. Checks SSL cert expiry
  5. Emails on failure

8. Common signals + interpretation

SignalWhat it meansWhat to do
store_orders: warn "store X has no products"A live store with zero products = empty buyer experienceReach out to seller; help them list at least one product
push_tokens: warn "Zero active for doctor"Doctor app pushes silently failCheck doctor login flow; verify FCM/APNs token registration
bidding_system: fail "Stuck offers"Lazy-expire path is brokenVerify POST /api/products/[id]/offers still has the updateMany block at the top
moderation_queue: warn "tickets open >7d"Admin isn't keeping upTriage the queue at /admin/moderation
broadcast_system: fail "<30% delivered"Push infra down OR token validity expiredCheck Expo Push status; rotate tokens if needed
admin_audit_log: warn "0 actions in 30d (1 admin)"Either admin isn't acting OR something bypasses the helperVerify destructive flows actually go through /api/admin/entities/[type]/[id]/actions