Monitoring & Observability
How we detect problems before users complain, where alerts go, and what to do when one fires.
For the full setup runbook see Operations → Monitoring. This page is the high-level "what's monitoring what" map.
The 4-layer monitoring stack
Defense in depth — each layer catches what the others miss.
| Layer | Tool | Probe | Catches | Alert latency |
|---|---|---|---|---|
| 1. Synthetic uptime | UptimeRobot | Every 5 min, external | Total site-down, DNS dead, SSL expired | ~2 min |
| 2. Hourly application smoke | GitHub Actions cron | Every hour at :17 | Page 500s, API contract drift, signup flow broken | ~1 hour |
| 3. Real-time error tracking | Sentry (web + mobile) | Every uncaught error in production | JS errors, mobile crashes, slow transactions | ~5 min |
| 4. Application health endpoint | /api/health?key=... | Self-check on demand or via cron | DB down, route integrity, missing critical data | immediate when probed |
Layer 1 — UptimeRobot (synthetic uptime)
What it does
Pings our domains from outside our infrastructure every 5 minutes. Catches "the entire stack is down" — including outages where Sentry can't fire (because the app couldn't even start).
Monitors
| Friendly name | Type | URL | Interval |
|---|---|---|---|
| KA26 Production | HTTP(s) | https://ka26.shop/api/health | 5 min |
| KA26 Landing | HTTP(s) | https://ka-26.com | 5 min |
| KA26 SSL Cert (shop) | SSL/TLS | ka26.shop:443 | 1 day |
| KA26 SSL Cert (landing) | SSL/TLS | ka-26.com:443 | 1 day |
Alerts
- Email to siddugkattimani@gmail.com
- SMS (free tier limited per month — most important to keep enabled)
- WhatsApp is paid tier — skip unless we upgrade
Status
- 🟡 Account creation pending as of 2026-04-18 — see Monitoring setup Layer 1 section
Layer 2 — Hourly application smoke
What it does
Runs the full smoke test suite against production every hour at :17 past the hour. Catches "API responds but returns wrong data" or "specific funnel broke after a deploy."
Workflow
- File:
.github/workflows/health-check.yml - Schedule: cron
17 * * * * - Steps:
- Hit
GET https://ka26.shop/api/health?key=ka26-health-2026(3 retries with 10s spacing) - Run
tests/e2e-smoke.test.ts(84 page-existence + status-code checks) - Run
tests/e2e-signup-smoke.test.ts(1 real consumer registration with throwaway email) - Run
tests/e2e-critical-funnels-smoke.test.ts(6 tests: login bad-creds, categories, stores, products, UPI config, register-then-login chain) - SSL cert expiry check for ka26.shop, ka-26.com, docs.ka-26.com (warn at 30 days, fail at 7 days)
- On any failure → email alert via Gmail SMTP
- Hit
Alerts
- Email to
ALERT_TOGitHub secret (currently siddugkattimani@gmail.com) - Sender:
noreply@ka-26.comvia Gmail SMTP usingSMTP_USER+SMTP_PASSGitHub secrets
Why "hourly" not "every 5 min"?
- Free GitHub Actions minutes are limited
- Most regressions are deployment-time bugs, not transient outages (UptimeRobot covers transient at 5 min)
- 1-hour detection latency was the explicit decision after the 6-day "[object Object]" outage — anything faster would burn cron minutes for marginal benefit
Status
- ✅ Live and green as of 2026-04-18 deploy
Layer 3 — Sentry (error tracking)
What it does
Catches every uncaught exception, unhandled promise rejection, React error boundary crash, and slow transaction in production. Web + mobile have separate Sentry projects.
Projects
| Project | Platform | Receives events from | DSN location |
|---|---|---|---|
ka26-marketplace | Next.js (web + API) | Browser JS errors + server-side errors | Cloud Run env: SENTRY_DSN + NEXT_PUBLIC_SENTRY_DSN + SENTRY_ENV=production |
ka26-mobile | React Native | Mobile JS errors + native crashes | mobile/app.json → expo.extra.sentryDsn |
What you'll see in Sentry
- Issues feed: every unique error type, grouped by stack trace
- Performance: slow transactions, N+1 query patterns
- Release health: crash-free user % (target 99.5%+)
- Email alert: within ~5 min of a NEW error type appearing
PII scrubbing
- Mobile init in
mobile/app/_layout.tsxstripsevent.request?.dataandevent.request?.cookiesso OTPs, tokens, and passwords NEVER reach Sentry - Web equivalent in
sentry.client.config.ts
Alerts
- Sentry default: email on new issue type
- Configure in Sentry dashboard → Alerts → can add Slack/PagerDuty post-launch
Status
- ✅ Live as of 2026-04-17. Both projects receiving production events.
Layer 4 — /api/health endpoint
What it does
A self-check endpoint that runs 7 internal probes when called. Returns:
200 { status: "ok" }— everything healthy200 { status: "degraded" }— non-critical issues (warns)500 { status: "error" }— critical, page+restart
What it probes
- Database —
SELECT 1round-trip to Postgres - Critical pages — fetches
/,/shop,/reels,/requests,/profile - Auth integrity — verifies admin user exists with correct ID
- Reel data integrity — 5 most recent reels have valid data
- Route integrity — product detail routes resolve correctly
- Order system — at least one active store + restaurant exists
- WhatsApp links — admin user's WhatsApp number is non-empty
Auth
Requires query param ?key=ka26-health-2026 (prevents random scanners from hitting it)
Used by
- UptimeRobot Layer 1 (every 5 min)
- Hourly cron Layer 2 (every hour)
- Manual testing:
curl 'https://ka26.shop/api/health?key=ka26-health-2026' | python3 -m json.tool
Application logs
Where they live
- Web/API logs: GCP Cloud Logging → filter by service
ka26-marketplace - Direct query:
gcloud run services logs read ka26-marketplace --region us-central1 --limit 50 - GCP console: https://console.cloud.google.com/logs/query — filter by
resource.type="cloud_run_revision"ANDresource.labels.service_name="ka26-marketplace"
What to look for when debugging
[Push] Expo push failed— push delivery error[Push] Expo error for token X— bad token (auto-deactivated ifDeviceNotRegistered)[Email] failed after 3 attempts— SMTP issue, checkEmailLogtable for the row[Delivery] Auto-assign failed for store order N— delivery engine couldn't find a rider[ReelEarnings] Confirm failed— earnings flow broke for an order
Email audit log
Every outbound email logged to EmailLog Postgres table with to, subject, status, error, attempt. Query:
SELECT * FROM "EmailLog"
WHERE status = 'failed' AND "createdAt" > NOW() - INTERVAL '24 hours'
ORDER BY "createdAt" DESC;
What to do when an alert fires
"UptimeRobot says ka26.shop is down"
- Hit https://ka26.shop manually — does it load?
- If yes → false positive (network blip)
- If no → check Cloud Run console: https://console.cloud.google.com/run
- Look at the latest revision's logs for errors
- If a recent deploy looks bad → roll back:
gcloud run revisions list --service ka26-marketplace --region us-central1 --project school-mgmt-saasgcloud run services update-traffic ka26-marketplace --region us-central1 --project school-mgmt-saas \--to-revisions ka26-marketplace-00XXX-yyy=100
"Hourly cron failed"
- Open the GitHub Actions run: https://github.com/sidgk/ka26-marketplace/actions/workflows/health-check.yml
- Click the failed run → see which step broke (smoke vs. signup vs. SSL)
- If smoke test failure → check Cloud Run logs for the exact endpoint that 500'd
- If SSL warning under 30 days → rotate the cert (Cloud Run-managed certs auto-renew, so this should never fire)
- If SSL fail under 7 days → URGENT, manually renew
"Sentry alert: new error type"
- Click into the issue in Sentry dashboard
- See: stack trace, breadcrumbs (recent user actions), browser/device info
- Reproduce locally if possible
- Fix → push → confirm fix via "Resolve in next release" in Sentry
"All 4 fire at once"
- Production is in serious trouble
- Roll back Cloud Run to the previous revision IMMEDIATELY
- Then debug
"User reports something broken but no alerts fired"
- Likely a UX bug not a server error (exactly the bug class the [object Object] outage was)
- Check Sentry for client-side errors filtered to that user's session if possible
- Add a contract test for that bug class so next time it would be caught
- See Testing & QA for which test class to use
Monitoring philosophy
The 4 layers are defense in depth:
- UptimeRobot catches what Sentry can't (full outage → no JS to error-report)
- Sentry catches what UptimeRobot can't (200 OK page that's actually broken inside)
- Hourly cron catches what both miss (specific user flow regressions, the [object Object] bug class)
/api/healthself-check catches what's broken internally even when externally everything looks fine
If any single layer were perfect, we wouldn't need the others. Together they catch ~95% of issues before users see them.
Post-launch monitoring upgrades (deferred)
Worth adding once we have real users:
- Real User Monitoring (RUM) — Sentry has this; would show real users' page-load times in Gadag
- Funnel analytics — once payments are live, track signup → first order conversion
- Database query monitoring — Cloud SQL has slow query logging; turn on if we hit performance issues
- Synthetic transaction monitoring — UptimeRobot's transaction monitor (paid tier) — actually clicks through signup, browse, checkout end-to-end every X minutes
- Status page — host at status.ka-26.com (UptimeRobot free tier provides this)
Related docs
- Operations → Monitoring — full setup runbook (UptimeRobot account creation, Sentry project setup, etc.)
- Operations → Incident playbook — what to do when things break
- Testing & QA — test classes, including the smoke tests that feed Layer 2
- Email infrastructure — Email log + alerting plumbing