System Health Live
Two surfaces:
/api/health— synchronous structural checks. 12 named checks, persisted toHealthCheckResulton every call. Hit by the hourly cron + post-deploy./admin/observability— admin dashboard. Shows the latest stored health result + activity / business / traffic metrics + Test Coverage + Feature Health.
Rebuilt 2026-04-27. The previous version was checking the legacy Restaurant table (Eats archived 2026-04-17) and silent on every system shipped after January 2026. The rebuild added 6 new checks, removed 1 stale check, switched the Order metric to StoreOrder, and added Test Coverage + Feature Health panels to the dashboard.
1. The 12 health checks
GET /api/health?key=<HEALTH_CHECK_SECRET> runs all checks in parallel.
| Check | Verifies | Failure modes |
|---|---|---|
database | Prisma reachable; SELECT 1 succeeds | DB outage |
critical_pages | Homepage + main routes (shop / eats / reels / requests / profile) return 200 | Build broken; route handler crashed |
auth_integrity | Consumer auth round-trip works; reel ownership intact | JWT secret mismatch; cookie scope drift |
reel_data_integrity | 5 most-recent active reels have valid data shape | Schema drift |
route_integrity | Product routes correct | Missing dynamic route file |
store_orders ⚡ | Active live store + ≥1 product + ≥1 StoreOrder in last 7d | Catalog empty; funnel broken |
whatsapp_links | WhatsApp button data is valid for at least one consumer | Phone number missing on consumer |
bidding_system ⚡ | No PriceOffers stuck past expiry; lazy-expire cron working | Bidding state-machine drift |
admin_audit_log ⚡ | AdminAction is being written when admins act | Audit helper bypassed; admin actions not flowing through /actions route |
push_tokens ⚡ | Active push token count per app (consumer / seller / doctor); warns on zero | Token registration broken; users haven't enabled push |
moderation_queue ⚡ | Open ticket count + warns on tickets open >7 days | Admin not keeping up; no admin assigned |
broadcast_system ⚡ | Latest sent broadcast's delivery ratio; fails if <30% delivered on >5 recipients | Push infra down; FCM/APNs token expiry |
⚡ = added 2026-04-27 in the v2 rebuild.
The legacy checkOrderSystem (which checked the Restaurant table) was removed.
2. Status semantics
pass— check ran and result is healthywarn— check ran, found something worth flagging but not breaking (e.g.Abdul Flower Works is live but has no products)fail— check ran and found a real problem (e.g.Stuck offers: 6 past expiry)
Overall status:
- Any
fail→critical(HTTP 503) - Any
warn(no fails) →degraded(HTTP 200) - All
pass→healthy(HTTP 200)
3. Where the data goes
Every /api/health call writes a row to HealthCheckResult:
model HealthCheckResult {
id Int @id @default(autoincrement())
status String // healthy | degraded | critical
totalChecks Int
passed Int
failed Int
duration Int // total ms
results Json // detailed per-check results
triggeredBy String // scheduled | manual | post_deploy
alertSent Boolean
createdAt DateTime @default(now())
}
The hourly cron in .github/workflows/health-check.yml triggers it at :17 past every hour. The dashboard's "Health History" graph reads recent rows from this table.
4. Admin dashboard sections
/admin/observability (admin-gated). Top to bottom:
- Health status — latest result + 24h/7d/30d history graph
- Errors — frontend
ClientErrorrows by type + recent 20 unresolved - Activity — total UserEvents + active user count + events by type
- Business — orders count, revenue (₹), reels created, reel views, requests, new consumers (current range). Switched from legacy
Ordertable toStoreOrder2026-04-27. - Page Views Heatmap — top entity types viewed
- Traffic — unique visitors (fingerprint) + sessions + top viewed pages
- System Totals — consumers, sellers, products, restaurants, reels (all-time)
- Test Coverage ⚡ — one tile per app:
backend (vitest),mobile (consumer, jest-expo),mobile-seller (jest-expo),mobile-doctor (jest-expo). Each tile shows test file count +it()-case count + framework + path. Apps with zero tests get an amber warning. - Feature Health ⚡ — bidding state breakdown (queued / accepted / converted / expired), moderation depth + staleness, push tokens per app, broadcast delivery counts (last 30d), admin actions by category (last 30d) + impersonation count.
⚡ = added 2026-04-27.
5. Test inventory helper
src/lib/test-inventory.ts — synchronous file-walker (~10ms) used by the dashboard's Test Coverage panel.
import { getTestInventory, getTestTotals } from "@/lib/test-inventory";
const inv = getTestInventory(repoRoot);
// → [{ app: "backend (web + API)", framework: "vitest", rootDir: "tests",
// testFiles: 66, testCases: 2478 }, ... 3 more apps]
const totals = getTestTotals(inv);
// → { files: 100, cases: 2700, apps: 4 }
Counts .test.ts and .test.tsx files; counts it( and test( declarations within them. NOT a replacement for actually running the suites — that's what the hourly cron + every CI deploy do (and they store results in HealthCheckResult). This static counter just shows shape so we notice if a worktree is broken or test files vanish.
6. Pinning contract — why the dashboard stays current
tests/system-health-coverage.test.ts (25 cases) is the regression net:
- Every shipped major system MUST have a
checkXyz()function declared AND a call site in the Promise.allSettled list. - The status-name fallback array MUST contain entries for the 6 new checks (so a thrown check gets the right name in the response).
- Legacy
checkOrderSystemMUST be gone. test-inventory.tswalks all 4 apps + counts are consistent + backend has ≥50 test files (sanity floor).- API response includes
tests:block +features:block. - Dashboard page renders Test Coverage and Feature Health sections.
If a future PR adds a new system without adding a health check, this test fails. If someone deletes a check by accident, this test fails. The dashboard cannot drift silently again.
7. Hourly cron + alerts
.github/workflows/health-check.yml runs at :17 past every hour:
- Hits
https://ka26.shop/api/health?key=<secret>with 3 retries - If status is
degradedorcritical, the route writes aSellerNotificationfor the admin (in-app bell icon) + logs to Cloud Run stderr (visible in Cloud Logging) - Runs
tests/e2e-smoke.test.ts,tests/e2e-signup-smoke.test.ts,tests/e2e-critical-funnels-smoke.test.tsagainst production - Checks SSL cert expiry
- Emails on failure
8. Common signals + interpretation
| Signal | What it means | What to do |
|---|---|---|
store_orders: warn "store X has no products" | A live store with zero products = empty buyer experience | Reach out to seller; help them list at least one product |
push_tokens: warn "Zero active for doctor" | Doctor app pushes silently fail | Check doctor login flow; verify FCM/APNs token registration |
bidding_system: fail "Stuck offers" | Lazy-expire path is broken | Verify POST /api/products/[id]/offers still has the updateMany block at the top |
moderation_queue: warn "tickets open >7d" | Admin isn't keeping up | Triage the queue at /admin/moderation |
broadcast_system: fail "<30% delivered" | Push infra down OR token validity expired | Check Expo Push status; rotate tokens if needed |
admin_audit_log: warn "0 actions in 30d (1 admin)" | Either admin isn't acting OR something bypasses the helper | Verify destructive flows actually go through /api/admin/entities/[type]/[id]/actions |
9. Related
- Admin Panel — feeds
AdminActionwrites that the audit log check verifies - Bidding —
bidding_systemcheck catches lazy-expire regressions - Operations / Cloudflare — DNS health is separate (browser → Google anycast → Cloud Run)
- CHANGELOG: 2026-04-27 System Health v2