System Health Live

Two surfaces:

/api/health — synchronous structural checks. 12 named checks, persisted to HealthCheckResult on every call. Hit by the hourly cron + post-deploy.
/admin/observability — admin dashboard. Shows the latest stored health result + activity / business / traffic metrics + Test Coverage + Feature Health.

Rebuilt 2026-04-27. The previous version was checking the legacy Restaurant table (Eats archived 2026-04-17) and silent on every system shipped after January 2026. The rebuild added 6 new checks, removed 1 stale check, switched the Order metric to StoreOrder, and added Test Coverage + Feature Health panels to the dashboard.

1. The 12 health checks

GET /api/health?key=<HEALTH_CHECK_SECRET> runs all checks in parallel.

Check	Verifies	Failure modes
`database`	Prisma reachable; `SELECT 1` succeeds	DB outage
`critical_pages`	Homepage + main routes (shop / eats / reels / requests / profile) return 200	Build broken; route handler crashed
`auth_integrity`	Consumer auth round-trip works; reel ownership intact	JWT secret mismatch; cookie scope drift
`reel_data_integrity`	5 most-recent active reels have valid data shape	Schema drift
`route_integrity`	Product routes correct	Missing dynamic route file
`store_orders` ⚡	Active live store + ≥1 product + ≥1 StoreOrder in last 7d	Catalog empty; funnel broken
`whatsapp_links`	WhatsApp button data is valid for at least one consumer	Phone number missing on consumer
`bidding_system` ⚡	No PriceOffers stuck past expiry; lazy-expire cron working	Bidding state-machine drift
`admin_audit_log` ⚡	AdminAction is being written when admins act	Audit helper bypassed; admin actions not flowing through `/actions` route
`push_tokens` ⚡	Active push token count per app (consumer / seller / doctor); warns on zero	Token registration broken; users haven't enabled push
`moderation_queue` ⚡	Open ticket count + warns on tickets open >7 days	Admin not keeping up; no admin assigned
`broadcast_system` ⚡	Latest sent broadcast's delivery ratio; fails if <30% delivered on >5 recipients	Push infra down; FCM/APNs token expiry

⚡ = added 2026-04-27 in the v2 rebuild.

The legacy checkOrderSystem (which checked the Restaurant table) was removed.

2. Status semantics

pass — check ran and result is healthy
warn — check ran, found something worth flagging but not breaking (e.g. Abdul Flower Works is live but has no products)
fail — check ran and found a real problem (e.g. Stuck offers: 6 past expiry)

Overall status:

Any fail → critical (HTTP 503)
Any warn (no fails) → degraded (HTTP 200)
All pass → healthy (HTTP 200)

3. Where the data goes

Every /api/health call writes a row to HealthCheckResult:

model HealthCheckResult {
  id          Int      @id @default(autoincrement())
  status      String   // healthy | degraded | critical
  totalChecks Int
  passed      Int
  failed      Int
  duration    Int      // total ms
  results     Json     // detailed per-check results
  triggeredBy String   // scheduled | manual | post_deploy
  alertSent   Boolean
  createdAt   DateTime @default(now())
}

The hourly cron in .github/workflows/health-check.yml triggers it at :17 past every hour. The dashboard's "Health History" graph reads recent rows from this table.

4. Admin dashboard sections

/admin/observability (admin-gated). Top to bottom:

Health status — latest result + 24h/7d/30d history graph
Errors — frontend ClientError rows by type + recent 20 unresolved
Activity — total UserEvents + active user count + events by type
Business — orders count, revenue (₹), reels created, reel views, requests, new consumers (current range). Switched from legacy Order table to StoreOrder 2026-04-27.
Page Views Heatmap — top entity types viewed
Traffic — unique visitors (fingerprint) + sessions + top viewed pages
System Totals — consumers, sellers, products, restaurants, reels (all-time)
Test Coverage ⚡ — one tile per app: backend (vitest), mobile (consumer, jest-expo), mobile-seller (jest-expo), mobile-doctor (jest-expo). Each tile shows test file count + it()-case count + framework + path. Apps with zero tests get an amber warning.
Feature Health ⚡ — bidding state breakdown (queued / accepted / converted / expired), moderation depth + staleness, push tokens per app, broadcast delivery counts (last 30d), admin actions by category (last 30d) + impersonation count.

⚡ = added 2026-04-27.

5. Test inventory helper

src/lib/test-inventory.ts — synchronous file-walker (~10ms) used by the dashboard's Test Coverage panel.

import { getTestInventory, getTestTotals } from "@/lib/test-inventory";

const inv = getTestInventory(repoRoot);
// → [{ app: "backend (web + API)", framework: "vitest", rootDir: "tests",
//     testFiles: 66, testCases: 2478 }, ... 3 more apps]
const totals = getTestTotals(inv);
// → { files: 100, cases: 2700, apps: 4 }

Counts .test.ts and .test.tsx files; counts it( and test( declarations within them. NOT a replacement for actually running the suites — that's what the hourly cron + every CI deploy do (and they store results in HealthCheckResult). This static counter just shows shape so we notice if a worktree is broken or test files vanish.

6. Pinning contract — why the dashboard stays current

tests/system-health-coverage.test.ts (25 cases) is the regression net:

Every shipped major system MUST have a checkXyz() function declared AND a call site in the Promise.allSettled list.
The status-name fallback array MUST contain entries for the 6 new checks (so a thrown check gets the right name in the response).
Legacy checkOrderSystem MUST be gone.
test-inventory.ts walks all 4 apps + counts are consistent + backend has ≥50 test files (sanity floor).
API response includes tests: block + features: block.
Dashboard page renders Test Coverage and Feature Health sections.

If a future PR adds a new system without adding a health check, this test fails. If someone deletes a check by accident, this test fails. The dashboard cannot drift silently again.

7. Hourly cron + alerts

.github/workflows/health-check.yml runs at :17 past every hour:

Hits https://ka26.shop/api/health?key=<secret> with 3 retries
If status is degraded or critical, the route writes a SellerNotification for the admin (in-app bell icon) + logs to Cloud Run stderr (visible in Cloud Logging)
Runs tests/e2e-smoke.test.ts, tests/e2e-signup-smoke.test.ts, tests/e2e-critical-funnels-smoke.test.ts against production
Checks SSL cert expiry
Emails on failure

8. Common signals + interpretation

Signal	What it means	What to do
`store_orders: warn` "store X has no products"	A live store with zero products = empty buyer experience	Reach out to seller; help them list at least one product
`push_tokens: warn` "Zero active for doctor"	Doctor app pushes silently fail	Check doctor login flow; verify FCM/APNs token registration
`bidding_system: fail` "Stuck offers"	Lazy-expire path is broken	Verify `POST /api/products/[id]/offers` still has the `updateMany` block at the top
`moderation_queue: warn` "tickets open >7d"	Admin isn't keeping up	Triage the queue at `/admin/moderation`
`broadcast_system: fail` "<30% delivered"	Push infra down OR token validity expired	Check Expo Push status; rotate tokens if needed
`admin_audit_log: warn` "0 actions in 30d (1 admin)"	Either admin isn't acting OR something bypasses the helper	Verify destructive flows actually go through `/api/admin/entities/[type]/[id]/actions`

Admin Panel — feeds AdminAction writes that the audit log check verifies
Bidding — bidding_system check catches lazy-expire regressions
Operations / Cloudflare — DNS health is separate (browser → Google anycast → Cloud Run)
CHANGELOG: 2026-04-27 System Health v2

1. The 12 health checks​

2. Status semantics​

3. Where the data goes​

4. Admin dashboard sections​

5. Test inventory helper​

6. Pinning contract — why the dashboard stays current​

7. Hourly cron + alerts​

8. Common signals + interpretation​

9. Related​