Skip to main content

Monitoring & Observability

How we detect problems before users complain, where alerts go, and what to do when one fires.

For the full setup runbook see Operations → Monitoring. This page is the high-level "what's monitoring what" map.


The 4-layer monitoring stack

Defense in depth — each layer catches what the others miss.

LayerToolProbeCatchesAlert latency
1. Synthetic uptimeUptimeRobotEvery 5 min, externalTotal site-down, DNS dead, SSL expired~2 min
2. Hourly application smokeGitHub Actions cronEvery hour at :17Page 500s, API contract drift, signup flow broken~1 hour
3. Real-time error trackingSentry (web + mobile)Every uncaught error in productionJS errors, mobile crashes, slow transactions~5 min
4. Application health endpoint/api/health?key=...Self-check on demand or via cronDB down, route integrity, missing critical dataimmediate when probed

Layer 1 — UptimeRobot (synthetic uptime)

What it does

Pings our domains from outside our infrastructure every 5 minutes. Catches "the entire stack is down" — including outages where Sentry can't fire (because the app couldn't even start).

Monitors

Friendly nameTypeURLInterval
KA26 ProductionHTTP(s)https://ka26.shop/api/health5 min
KA26 LandingHTTP(s)https://ka-26.com5 min
KA26 SSL Cert (shop)SSL/TLSka26.shop:4431 day
KA26 SSL Cert (landing)SSL/TLSka-26.com:4431 day

Alerts

  • Email to siddugkattimani@gmail.com
  • SMS (free tier limited per month — most important to keep enabled)
  • WhatsApp is paid tier — skip unless we upgrade

Status

  • 🟡 Account creation pending as of 2026-04-18 — see Monitoring setup Layer 1 section

Layer 2 — Hourly application smoke

What it does

Runs the full smoke test suite against production every hour at :17 past the hour. Catches "API responds but returns wrong data" or "specific funnel broke after a deploy."

Workflow

  • File: .github/workflows/health-check.yml
  • Schedule: cron 17 * * * *
  • Steps:
    1. Hit GET https://ka26.shop/api/health?key=ka26-health-2026 (3 retries with 10s spacing)
    2. Run tests/e2e-smoke.test.ts (84 page-existence + status-code checks)
    3. Run tests/e2e-signup-smoke.test.ts (1 real consumer registration with throwaway email)
    4. Run tests/e2e-critical-funnels-smoke.test.ts (6 tests: login bad-creds, categories, stores, products, UPI config, register-then-login chain)
    5. SSL cert expiry check for ka26.shop, ka-26.com, docs.ka-26.com (warn at 30 days, fail at 7 days)
    6. On any failure → email alert via Gmail SMTP

Alerts

  • Email to ALERT_TO GitHub secret (currently siddugkattimani@gmail.com)
  • Sender: noreply@ka-26.com via Gmail SMTP using SMTP_USER + SMTP_PASS GitHub secrets

Why "hourly" not "every 5 min"?

  • Free GitHub Actions minutes are limited
  • Most regressions are deployment-time bugs, not transient outages (UptimeRobot covers transient at 5 min)
  • 1-hour detection latency was the explicit decision after the 6-day "[object Object]" outage — anything faster would burn cron minutes for marginal benefit

Status

  • ✅ Live and green as of 2026-04-18 deploy

Layer 3 — Sentry (error tracking)

What it does

Catches every uncaught exception, unhandled promise rejection, React error boundary crash, and slow transaction in production. Web + mobile have separate Sentry projects.

Projects

ProjectPlatformReceives events fromDSN location
ka26-marketplaceNext.js (web + API)Browser JS errors + server-side errorsCloud Run env: SENTRY_DSN + NEXT_PUBLIC_SENTRY_DSN + SENTRY_ENV=production
ka26-mobileReact NativeMobile JS errors + native crashesmobile/app.jsonexpo.extra.sentryDsn

What you'll see in Sentry

  • Issues feed: every unique error type, grouped by stack trace
  • Performance: slow transactions, N+1 query patterns
  • Release health: crash-free user % (target 99.5%+)
  • Email alert: within ~5 min of a NEW error type appearing

PII scrubbing

  • Mobile init in mobile/app/_layout.tsx strips event.request?.data and event.request?.cookies so OTPs, tokens, and passwords NEVER reach Sentry
  • Web equivalent in sentry.client.config.ts

Alerts

  • Sentry default: email on new issue type
  • Configure in Sentry dashboard → Alerts → can add Slack/PagerDuty post-launch

Status

  • ✅ Live as of 2026-04-17. Both projects receiving production events.

Layer 4 — /api/health endpoint

What it does

A self-check endpoint that runs 7 internal probes when called. Returns:

  • 200 { status: "ok" } — everything healthy
  • 200 { status: "degraded" } — non-critical issues (warns)
  • 500 { status: "error" } — critical, page+restart

What it probes

  1. DatabaseSELECT 1 round-trip to Postgres
  2. Critical pages — fetches /, /shop, /reels, /requests, /profile
  3. Auth integrity — verifies admin user exists with correct ID
  4. Reel data integrity — 5 most recent reels have valid data
  5. Route integrity — product detail routes resolve correctly
  6. Order system — at least one active store + restaurant exists
  7. WhatsApp links — admin user's WhatsApp number is non-empty

Auth

Requires query param ?key=ka26-health-2026 (prevents random scanners from hitting it)

Used by

  • UptimeRobot Layer 1 (every 5 min)
  • Hourly cron Layer 2 (every hour)
  • Manual testing: curl 'https://ka26.shop/api/health?key=ka26-health-2026' | python3 -m json.tool

Application logs

Where they live

  • Web/API logs: GCP Cloud Logging → filter by service ka26-marketplace
  • Direct query: gcloud run services logs read ka26-marketplace --region us-central1 --limit 50
  • GCP console: https://console.cloud.google.com/logs/query — filter by resource.type="cloud_run_revision" AND resource.labels.service_name="ka26-marketplace"

What to look for when debugging

  • [Push] Expo push failed — push delivery error
  • [Push] Expo error for token X — bad token (auto-deactivated if DeviceNotRegistered)
  • [Email] failed after 3 attempts — SMTP issue, check EmailLog table for the row
  • [Delivery] Auto-assign failed for store order N — delivery engine couldn't find a rider
  • [ReelEarnings] Confirm failed — earnings flow broke for an order

Email audit log

Every outbound email logged to EmailLog Postgres table with to, subject, status, error, attempt. Query:

SELECT * FROM "EmailLog"
WHERE status = 'failed' AND "createdAt" > NOW() - INTERVAL '24 hours'
ORDER BY "createdAt" DESC;

What to do when an alert fires

"UptimeRobot says ka26.shop is down"

  1. Hit https://ka26.shop manually — does it load?
  2. If yes → false positive (network blip)
  3. If no → check Cloud Run console: https://console.cloud.google.com/run
  4. Look at the latest revision's logs for errors
  5. If a recent deploy looks bad → roll back:
    gcloud run revisions list --service ka26-marketplace --region us-central1 --project school-mgmt-saas
    gcloud run services update-traffic ka26-marketplace --region us-central1 --project school-mgmt-saas \
    --to-revisions ka26-marketplace-00XXX-yyy=100

"Hourly cron failed"

  1. Open the GitHub Actions run: https://github.com/sidgk/ka26-marketplace/actions/workflows/health-check.yml
  2. Click the failed run → see which step broke (smoke vs. signup vs. SSL)
  3. If smoke test failure → check Cloud Run logs for the exact endpoint that 500'd
  4. If SSL warning under 30 days → rotate the cert (Cloud Run-managed certs auto-renew, so this should never fire)
  5. If SSL fail under 7 days → URGENT, manually renew

"Sentry alert: new error type"

  1. Click into the issue in Sentry dashboard
  2. See: stack trace, breadcrumbs (recent user actions), browser/device info
  3. Reproduce locally if possible
  4. Fix → push → confirm fix via "Resolve in next release" in Sentry

"All 4 fire at once"

  • Production is in serious trouble
  • Roll back Cloud Run to the previous revision IMMEDIATELY
  • Then debug

"User reports something broken but no alerts fired"

  • Likely a UX bug not a server error (exactly the bug class the [object Object] outage was)
  • Check Sentry for client-side errors filtered to that user's session if possible
  • Add a contract test for that bug class so next time it would be caught
  • See Testing & QA for which test class to use

Monitoring philosophy

The 4 layers are defense in depth:

  • UptimeRobot catches what Sentry can't (full outage → no JS to error-report)
  • Sentry catches what UptimeRobot can't (200 OK page that's actually broken inside)
  • Hourly cron catches what both miss (specific user flow regressions, the [object Object] bug class)
  • /api/health self-check catches what's broken internally even when externally everything looks fine

If any single layer were perfect, we wouldn't need the others. Together they catch ~95% of issues before users see them.


Post-launch monitoring upgrades (deferred)

Worth adding once we have real users:

  • Real User Monitoring (RUM) — Sentry has this; would show real users' page-load times in Gadag
  • Funnel analytics — once payments are live, track signup → first order conversion
  • Database query monitoring — Cloud SQL has slow query logging; turn on if we hit performance issues
  • Synthetic transaction monitoring — UptimeRobot's transaction monitor (paid tier) — actually clicks through signup, browse, checkout end-to-end every X minutes
  • Status page — host at status.ka-26.com (UptimeRobot free tier provides this)