Incident Playbook
When something is broken in production. Read this BEFORE the incident, not during.
Severity ladder
| Level | Symptom | Response time | Who decides |
|---|---|---|---|
| P0 — Outage | Site fully down (UptimeRobot fires SMS + email), payments broken, data loss | < 5 min | Anyone seeing it |
| P1 — Degradation | Major feature broken (login, checkout, push notifications) but site loads | < 30 min | Whoever notices first |
| P2 — Bug | Specific feature broken for a subset of users | Same business day | Triage in standup |
| P3 — Polish | Minor UI issue, nothing blocking | Next sprint | Triage in standup |
P0 / P1 — Live response
Step 1: Acknowledge (60 seconds)
- See the alert (UptimeRobot SMS, Sentry, hourly cron failure email)
- Reply to the alert thread / WhatsApp: "I'm on it"
- Open the Sentry issues feed (once DSN set)
Step 2: Assess (3 minutes)
- Open
/api/health?key=ka26-health-2026in a browser → see which of the 7 checks failed - Check Cloud Run logs:
gcloud logging read --project=school-mgmt-saas --freshness=10m 'resource.labels.service_name="ka26-marketplace"' --limit=50 - Check recent deploys:
gcloud run revisions list --service ka26-marketplace --region us-central1 --limit 5 - Check recent commits:
gh run list -R sidgk/ka26-marketplace --limit 5
Step 3: Decide
| Symptom | Action |
|---|---|
| Last deploy was minutes ago + things broke | Roll back (see below) |
| All revisions are sick | Database / external dep issue — check Cloud SQL console |
| Single feature broken | Patch + deploy — don't roll back |
| Database connection refused | Cloud SQL instance might be down — check console |
| 503 from Cloud Run | Container failed to start — check logs |
| 500 from API routes | Application error — check Sentry once DSN set |
Step 4: Mitigate
Rollback to a known-good Cloud Run revision:
# List recent revisions
gcloud run revisions list --service ka26-marketplace --region us-central1 --limit 10
# Roll all traffic to a known-good one (e.g. ka26-marketplace-00229-abc)
gcloud run services update-traffic ka26-marketplace \
--region us-central1 --project school-mgmt-saas \
--to-revisions ka26-marketplace-00229-abc=100
# Verify
curl -s https://ka26.shop/api/health
Takes ~30s. The next request hits the rolled-back code.
Disable a feature via flag (faster than rollback for one-off bad features):
gcloud run services update ka26-marketplace --region us-central1 \
--update-env-vars MY_FEATURE_ENABLED=false
Scale up Cloud Run (rare — for traffic spikes):
gcloud run services update ka26-marketplace --region us-central1 \
--max-instances=20
Step 5: Communicate
- Post in WhatsApp: "Issue identified: [X]. ETA to fix: [Y]"
- If it's a paying-customer-affecting outage, draft a status page note
- Update the incident thread every 15 min until resolved
Step 6: Post-incident (within 48h)
Write a post-mortem:
- What broke (1 paragraph)
- Why it broke (root cause — not "human error", but the system that allowed the human error)
- How long it lasted (timeline)
- What we'll change (concrete action items with owners + deadlines)
Add the post-mortem as a CHANGELOG entry under ## [DATE] Incident: <short title>.
Common scenarios
"Site is slow"
- Check Cloud SQL for slow queries (
SELECT query, mean_time, calls FROM pg_stat_statements ORDER BY mean_time DESC LIMIT 20) - Look for missing indexes —
EXPLAIN ANALYZEon the slow query - Check Cloud Run CPU + memory in the GCP console → if pegged, scale up
--cpuor--memory
"Login is broken"
- Check
/api/auth/consumer-logindirectly with curl - Check JWT_SECRET hasn't been rotated mid-flight (would invalidate all sessions)
- Check Cloud SQL is reachable
"I can't deploy"
- See
gh run view <run-id> --log-failed - Common: tsconfig exclude missing for a new mobile-only file → see Deploy
"Payment confirmed but no order created"
- The order is created in
/api/payments/callback(PhonePe webhook) — check Cloud Run logs for the callback - If webhook didn't arrive: check PhonePe dashboard for delivery status
- Cleanup cron at
/api/payments/cleanupauto-cancels stuck payments after 5 min
"Sentry isn't catching errors"
- Check
NEXT_PUBLIC_SENTRY_DSNandSENTRY_DSNare both set on Cloud Run - Sentry is no-op safe — empty DSN = silently disabled; no error, just no data
What NEVER to do during an incident
- ❌ Push a "quick fix" to production without code review (push to a branch, deploy that revision via gcloud, then PR)
- ❌ Run a destructive SQL on prod without a transaction + tested rollback
- ❌ Change an env var via the GCP console UI (ephemeral on next deploy — use gcloud command)
- ❌ Restart Cloud Run "to clear cache" (Cloud Run is stateless; restarting just adds a cold-start)
Contact escalation
- Level 1: Whoever's on the on-call rotation
- Level 2: CTO / engineering lead
- Level 3: Founder
There's no formal on-call rotation pre-launch — it's siddu. Set one up in the first month post-launch.