Incident Playbook

When something is broken in production. Read this BEFORE the incident, not during.

Severity ladder

Level	Symptom	Response time	Who decides
P0 — Outage	Site fully down (UptimeRobot fires SMS + email), payments broken, data loss	< 5 min	Anyone seeing it
P1 — Degradation	Major feature broken (login, checkout, push notifications) but site loads	< 30 min	Whoever notices first
P2 — Bug	Specific feature broken for a subset of users	Same business day	Triage in standup
P3 — Polish	Minor UI issue, nothing blocking	Next sprint	Triage in standup

P0 / P1 — Live response

Step 1: Acknowledge (60 seconds)

See the alert (UptimeRobot SMS, Sentry, hourly cron failure email)
Reply to the alert thread / WhatsApp: "I'm on it"
Open the Sentry issues feed (once DSN set)

Step 2: Assess (3 minutes)

Open /api/health?key=ka26-health-2026 in a browser → see which of the 7 checks failed
Check Cloud Run logs: gcloud logging read --project=school-mgmt-saas --freshness=10m 'resource.labels.service_name="ka26-marketplace"' --limit=50
Check recent deploys: gcloud run revisions list --service ka26-marketplace --region us-central1 --limit 5
Check recent commits: gh run list -R sidgk/ka26-marketplace --limit 5

Step 3: Decide

Symptom	Action
Last deploy was minutes ago + things broke	Roll back (see below)
All revisions are sick	Database / external dep issue — check Cloud SQL console
Single feature broken	Patch + deploy — don't roll back
Database connection refused	Cloud SQL instance might be down — check console
503 from Cloud Run	Container failed to start — check logs
500 from API routes	Application error — check Sentry once DSN set

Step 4: Mitigate

Rollback to a known-good Cloud Run revision:

# List recent revisions
gcloud run revisions list --service ka26-marketplace --region us-central1 --limit 10

# Roll all traffic to a known-good one (e.g. ka26-marketplace-00229-abc)
gcloud run services update-traffic ka26-marketplace \
  --region us-central1 --project school-mgmt-saas \
  --to-revisions ka26-marketplace-00229-abc=100

# Verify
curl -s https://ka26.shop/api/health

Takes ~30s. The next request hits the rolled-back code.

Disable a feature via flag (faster than rollback for one-off bad features):

gcloud run services update ka26-marketplace --region us-central1 \
  --update-env-vars MY_FEATURE_ENABLED=false

Scale up Cloud Run (rare — for traffic spikes):

gcloud run services update ka26-marketplace --region us-central1 \
  --max-instances=20

Step 5: Communicate

Post in WhatsApp: "Issue identified: [X]. ETA to fix: [Y]"
If it's a paying-customer-affecting outage, draft a status page note
Update the incident thread every 15 min until resolved

Step 6: Post-incident (within 48h)

Write a post-mortem:

What broke (1 paragraph)
Why it broke (root cause — not "human error", but the system that allowed the human error)
How long it lasted (timeline)
What we'll change (concrete action items with owners + deadlines)

Add the post-mortem as a CHANGELOG entry under ## [DATE] Incident: <short title>.

Common scenarios

"Site is slow"

Check Cloud SQL for slow queries (SELECT query, mean_time, calls FROM pg_stat_statements ORDER BY mean_time DESC LIMIT 20)
Look for missing indexes — EXPLAIN ANALYZE on the slow query
Check Cloud Run CPU + memory in the GCP console → if pegged, scale up --cpu or --memory

Check /api/auth/consumer-login directly with curl
Check JWT_SECRET hasn't been rotated mid-flight (would invalidate all sessions)
Check Cloud SQL is reachable

"I can't deploy"

See gh run view <run-id> --log-failed
Common: tsconfig exclude missing for a new mobile-only file → see Deploy

"Payment confirmed but no order created"

The order is created in /api/payments/callback (PhonePe webhook) — check Cloud Run logs for the callback
If webhook didn't arrive: check PhonePe dashboard for delivery status
Cleanup cron at /api/payments/cleanup auto-cancels stuck payments after 5 min

"Sentry isn't catching errors"

Check NEXT_PUBLIC_SENTRY_DSN and SENTRY_DSN are both set on Cloud Run
Sentry is no-op safe — empty DSN = silently disabled; no error, just no data

What NEVER to do during an incident

❌ Push a "quick fix" to production without code review (push to a branch, deploy that revision via gcloud, then PR)
❌ Run a destructive SQL on prod without a transaction + tested rollback
❌ Change an env var via the GCP console UI (ephemeral on next deploy — use gcloud command)
❌ Restart Cloud Run "to clear cache" (Cloud Run is stateless; restarting just adds a cold-start)

Contact escalation

Level 1: Whoever's on the on-call rotation
Level 2: CTO / engineering lead
Level 3: Founder

There's no formal on-call rotation pre-launch — it's siddu. Set one up in the first month post-launch.

Severity ladder​

P0 / P1 — Live response​

Step 1: Acknowledge (60 seconds)​

Step 2: Assess (3 minutes)​

Step 3: Decide​

Step 4: Mitigate​

Step 5: Communicate​

Step 6: Post-incident (within 48h)​

Common scenarios​

"Site is slow"​

"Login is broken"​

"I can't deploy"​

"Payment confirmed but no order created"​

"Sentry isn't catching errors"​

What NEVER to do during an incident​

Contact escalation​