Monitoring & Observability

How we detect problems before users complain, where alerts go, and what to do when one fires.

For the full setup runbook see Operations → Monitoring. This page is the high-level "what's monitoring what" map.

The 4-layer monitoring stack

Defense in depth — each layer catches what the others miss.

Layer	Tool	Probe	Catches	Alert latency
1. Synthetic uptime	UptimeRobot	Every 5 min, external	Total site-down, DNS dead, SSL expired	~2 min
2. Hourly application smoke	GitHub Actions cron	Every hour at `:17`	Page 500s, API contract drift, signup flow broken	~1 hour
3. Real-time error tracking	Sentry (web + mobile)	Every uncaught error in production	JS errors, mobile crashes, slow transactions	~5 min
4. Application health endpoint	`/api/health?key=...`	Self-check on demand or via cron	DB down, route integrity, missing critical data	immediate when probed

Layer 1 — UptimeRobot (synthetic uptime)

What it does

Pings our domains from outside our infrastructure every 5 minutes. Catches "the entire stack is down" — including outages where Sentry can't fire (because the app couldn't even start).

Monitors

Friendly name	Type	URL	Interval
KA26 Production	HTTP(s)	https://ka26.shop/api/health	5 min
KA26 Landing	HTTP(s)	https://ka-26.com	5 min
KA26 SSL Cert (shop)	SSL/TLS	ka26.shop:443	1 day
KA26 SSL Cert (landing)	SSL/TLS	ka-26.com:443	1 day

Alerts

Email to siddugkattimani@gmail.com
SMS (free tier limited per month — most important to keep enabled)
WhatsApp is paid tier — skip unless we upgrade

Status

🟡 Account creation pending as of 2026-04-18 — see Monitoring setup Layer 1 section

Layer 2 — Hourly application smoke

What it does

Runs the full smoke test suite against production every hour at :17 past the hour. Catches "API responds but returns wrong data" or "specific funnel broke after a deploy."

Workflow

File: .github/workflows/health-check.yml
Schedule: cron 17 * * * *
Steps:
1. Hit GET https://ka26.shop/api/health?key=ka26-health-2026 (3 retries with 10s spacing)
2. Run tests/e2e-smoke.test.ts (84 page-existence + status-code checks)
3. Run tests/e2e-signup-smoke.test.ts (1 real consumer registration with throwaway email)
4. Run tests/e2e-critical-funnels-smoke.test.ts (6 tests: login bad-creds, categories, stores, products, UPI config, register-then-login chain)
5. SSL cert expiry check for ka26.shop, ka-26.com, docs.ka-26.com (warn at 30 days, fail at 7 days)
6. On any failure → email alert via Gmail SMTP

Alerts

Email to ALERT_TO GitHub secret (currently siddugkattimani@gmail.com)
Sender: noreply@ka-26.com via Gmail SMTP using SMTP_USER + SMTP_PASS GitHub secrets

Why "hourly" not "every 5 min"?

Free GitHub Actions minutes are limited
Most regressions are deployment-time bugs, not transient outages (UptimeRobot covers transient at 5 min)
1-hour detection latency was the explicit decision after the 6-day "[object Object]" outage — anything faster would burn cron minutes for marginal benefit

Status

✅ Live and green as of 2026-04-18 deploy

Layer 3 — Sentry (error tracking)

What it does

Catches every uncaught exception, unhandled promise rejection, React error boundary crash, and slow transaction in production. Web + mobile have separate Sentry projects.

Projects

Project	Platform	Receives events from	DSN location
`ka26-marketplace`	Next.js (web + API)	Browser JS errors + server-side errors	Cloud Run env: `SENTRY_DSN` + `NEXT_PUBLIC_SENTRY_DSN` + `SENTRY_ENV=production`
`ka26-mobile`	React Native	Mobile JS errors + native crashes	`mobile/app.json` → `expo.extra.sentryDsn`

What you'll see in Sentry

Issues feed: every unique error type, grouped by stack trace
Performance: slow transactions, N+1 query patterns
Release health: crash-free user % (target 99.5%+)
Email alert: within ~5 min of a NEW error type appearing

PII scrubbing

Mobile init in mobile/app/_layout.tsx strips event.request?.data and event.request?.cookies so OTPs, tokens, and passwords NEVER reach Sentry
Web equivalent in sentry.client.config.ts

Alerts

Sentry default: email on new issue type
Configure in Sentry dashboard → Alerts → can add Slack/PagerDuty post-launch

Status

✅ Live as of 2026-04-17. Both projects receiving production events.

Layer 4 — `/api/health` endpoint

What it does

A self-check endpoint that runs 7 internal probes when called. Returns:

200 { status: "ok" } — everything healthy
200 { status: "degraded" } — non-critical issues (warns)
500 { status: "error" } — critical, page+restart

What it probes

Database — SELECT 1 round-trip to Postgres
Critical pages — fetches /, /shop, /reels, /requests, /profile
Auth integrity — verifies admin user exists with correct ID
Reel data integrity — 5 most recent reels have valid data
Route integrity — product detail routes resolve correctly
Order system — at least one active store + restaurant exists
WhatsApp links — admin user's WhatsApp number is non-empty

Auth

Requires query param ?key=ka26-health-2026 (prevents random scanners from hitting it)

Used by

UptimeRobot Layer 1 (every 5 min)
Hourly cron Layer 2 (every hour)
Manual testing: curl 'https://ka26.shop/api/health?key=ka26-health-2026' | python3 -m json.tool

Application logs

Where they live

Web/API logs: GCP Cloud Logging → filter by service ka26-marketplace
Direct query: gcloud run services logs read ka26-marketplace --region us-central1 --limit 50
GCP console: https://console.cloud.google.com/logs/query — filter by resource.type="cloud_run_revision" AND resource.labels.service_name="ka26-marketplace"

What to look for when debugging

[Push] Expo push failed — push delivery error
[Push] Expo error for token X — bad token (auto-deactivated if DeviceNotRegistered)
[Email] failed after 3 attempts — SMTP issue, check EmailLog table for the row
[Delivery] Auto-assign failed for store order N — delivery engine couldn't find a rider
[ReelEarnings] Confirm failed — earnings flow broke for an order

Email audit log

Every outbound email logged to EmailLog Postgres table with to, subject, status, error, attempt. Query:

SELECT * FROM "EmailLog"
WHERE status = 'failed' AND "createdAt" > NOW() - INTERVAL '24 hours'
ORDER BY "createdAt" DESC;

What to do when an alert fires

"UptimeRobot says ka26.shop is down"

Hit https://ka26.shop manually — does it load?
If yes → false positive (network blip)
If no → check Cloud Run console: https://console.cloud.google.com/run
Look at the latest revision's logs for errors

If a recent deploy looks bad → roll back:

gcloud run revisions list --service ka26-marketplace --region us-central1 --project school-mgmt-saas
gcloud run services update-traffic ka26-marketplace --region us-central1 --project school-mgmt-saas \
  --to-revisions ka26-marketplace-00XXX-yyy=100

"Hourly cron failed"

Open the GitHub Actions run: https://github.com/sidgk/ka26-marketplace/actions/workflows/health-check.yml
Click the failed run → see which step broke (smoke vs. signup vs. SSL)
If smoke test failure → check Cloud Run logs for the exact endpoint that 500'd
If SSL warning under 30 days → rotate the cert (Cloud Run-managed certs auto-renew, so this should never fire)
If SSL fail under 7 days → URGENT, manually renew

"Sentry alert: new error type"

Click into the issue in Sentry dashboard
See: stack trace, breadcrumbs (recent user actions), browser/device info
Reproduce locally if possible
Fix → push → confirm fix via "Resolve in next release" in Sentry

"All 4 fire at once"

Production is in serious trouble
Roll back Cloud Run to the previous revision IMMEDIATELY
Then debug

"User reports something broken but no alerts fired"

Likely a UX bug not a server error (exactly the bug class the [object Object] outage was)
Check Sentry for client-side errors filtered to that user's session if possible
Add a contract test for that bug class so next time it would be caught
See Testing & QA for which test class to use

Monitoring philosophy

The 4 layers are defense in depth:

UptimeRobot catches what Sentry can't (full outage → no JS to error-report)
Sentry catches what UptimeRobot can't (200 OK page that's actually broken inside)
Hourly cron catches what both miss (specific user flow regressions, the [object Object] bug class)
/api/health self-check catches what's broken internally even when externally everything looks fine

If any single layer were perfect, we wouldn't need the others. Together they catch ~95% of issues before users see them.

Post-launch monitoring upgrades (deferred)

Worth adding once we have real users:

Real User Monitoring (RUM) — Sentry has this; would show real users' page-load times in Gadag
Funnel analytics — once payments are live, track signup → first order conversion
Database query monitoring — Cloud SQL has slow query logging; turn on if we hit performance issues
Synthetic transaction monitoring — UptimeRobot's transaction monitor (paid tier) — actually clicks through signup, browse, checkout end-to-end every X minutes
Status page — host at status.ka-26.com (UptimeRobot free tier provides this)

Operations → Monitoring — full setup runbook (UptimeRobot account creation, Sentry project setup, etc.)
Operations → Incident playbook — what to do when things break
Testing & QA — test classes, including the smoke tests that feed Layer 2
Email infrastructure — Email log + alerting plumbing

The 4-layer monitoring stack​

Layer 1 — UptimeRobot (synthetic uptime)​

What it does​

Monitors​

Alerts​

Status​

Layer 2 — Hourly application smoke​

What it does​

Workflow​

Alerts​

Why "hourly" not "every 5 min"?​

Status​

Layer 3 — Sentry (error tracking)​

What it does​

Projects​

What you'll see in Sentry​

PII scrubbing​

Alerts​

Status​

Layer 4 — /api/health endpoint​

What it does​

What it probes​

Auth​

Used by​

Application logs​

Where they live​

What to look for when debugging​

Email audit log​

What to do when an alert fires​

"UptimeRobot says ka26.shop is down"​

"Hourly cron failed"​

"Sentry alert: new error type"​

"All 4 fire at once"​

"User reports something broken but no alerts fired"​

Monitoring philosophy​

Post-launch monitoring upgrades (deferred)​

Related docs​

The 4-layer monitoring stack

Layer 1 — UptimeRobot (synthetic uptime)

What it does

Monitors

Alerts

Status

Layer 2 — Hourly application smoke

What it does

Workflow

Alerts

Why "hourly" not "every 5 min"?

Status

Layer 3 — Sentry (error tracking)

What it does

Projects

What you'll see in Sentry

PII scrubbing

Alerts

Status

Layer 4 — `/api/health` endpoint

What it does

What it probes

Auth

Used by

Application logs

Where they live

What to look for when debugging

Email audit log

What to do when an alert fires

"UptimeRobot says ka26.shop is down"

"Hourly cron failed"

"Sentry alert: new error type"

"All 4 fire at once"

"User reports something broken but no alerts fired"

Monitoring philosophy

Post-launch monitoring upgrades (deferred)

Related docs