Skip to main content

Incident Playbook

When something is broken in production. Read this BEFORE the incident, not during.

Severity ladder

LevelSymptomResponse timeWho decides
P0 — OutageSite fully down (UptimeRobot fires SMS + email), payments broken, data loss< 5 minAnyone seeing it
P1 — DegradationMajor feature broken (login, checkout, push notifications) but site loads< 30 minWhoever notices first
P2 — BugSpecific feature broken for a subset of usersSame business dayTriage in standup
P3 — PolishMinor UI issue, nothing blockingNext sprintTriage in standup

P0 / P1 — Live response

Step 1: Acknowledge (60 seconds)

  • See the alert (UptimeRobot SMS, Sentry, hourly cron failure email)
  • Reply to the alert thread / WhatsApp: "I'm on it"
  • Open the Sentry issues feed (once DSN set)

Step 2: Assess (3 minutes)

  • Open /api/health?key=ka26-health-2026 in a browser → see which of the 7 checks failed
  • Check Cloud Run logs: gcloud logging read --project=school-mgmt-saas --freshness=10m 'resource.labels.service_name="ka26-marketplace"' --limit=50
  • Check recent deploys: gcloud run revisions list --service ka26-marketplace --region us-central1 --limit 5
  • Check recent commits: gh run list -R sidgk/ka26-marketplace --limit 5

Step 3: Decide

SymptomAction
Last deploy was minutes ago + things brokeRoll back (see below)
All revisions are sickDatabase / external dep issue — check Cloud SQL console
Single feature brokenPatch + deploy — don't roll back
Database connection refusedCloud SQL instance might be down — check console
503 from Cloud RunContainer failed to start — check logs
500 from API routesApplication error — check Sentry once DSN set

Step 4: Mitigate

Rollback to a known-good Cloud Run revision:

# List recent revisions
gcloud run revisions list --service ka26-marketplace --region us-central1 --limit 10

# Roll all traffic to a known-good one (e.g. ka26-marketplace-00229-abc)
gcloud run services update-traffic ka26-marketplace \
--region us-central1 --project school-mgmt-saas \
--to-revisions ka26-marketplace-00229-abc=100

# Verify
curl -s https://ka26.shop/api/health

Takes ~30s. The next request hits the rolled-back code.

Disable a feature via flag (faster than rollback for one-off bad features):

gcloud run services update ka26-marketplace --region us-central1 \
--update-env-vars MY_FEATURE_ENABLED=false

Scale up Cloud Run (rare — for traffic spikes):

gcloud run services update ka26-marketplace --region us-central1 \
--max-instances=20

Step 5: Communicate

  • Post in WhatsApp: "Issue identified: [X]. ETA to fix: [Y]"
  • If it's a paying-customer-affecting outage, draft a status page note
  • Update the incident thread every 15 min until resolved

Step 6: Post-incident (within 48h)

Write a post-mortem:

  • What broke (1 paragraph)
  • Why it broke (root cause — not "human error", but the system that allowed the human error)
  • How long it lasted (timeline)
  • What we'll change (concrete action items with owners + deadlines)

Add the post-mortem as a CHANGELOG entry under ## [DATE] Incident: <short title>.

Common scenarios

"Site is slow"

  • Check Cloud SQL for slow queries (SELECT query, mean_time, calls FROM pg_stat_statements ORDER BY mean_time DESC LIMIT 20)
  • Look for missing indexes — EXPLAIN ANALYZE on the slow query
  • Check Cloud Run CPU + memory in the GCP console → if pegged, scale up --cpu or --memory

"Login is broken"

  • Check /api/auth/consumer-login directly with curl
  • Check JWT_SECRET hasn't been rotated mid-flight (would invalidate all sessions)
  • Check Cloud SQL is reachable

"I can't deploy"

  • See gh run view <run-id> --log-failed
  • Common: tsconfig exclude missing for a new mobile-only file → see Deploy

"Payment confirmed but no order created"

  • The order is created in /api/payments/callback (PhonePe webhook) — check Cloud Run logs for the callback
  • If webhook didn't arrive: check PhonePe dashboard for delivery status
  • Cleanup cron at /api/payments/cleanup auto-cancels stuck payments after 5 min

"Sentry isn't catching errors"

  • Check NEXT_PUBLIC_SENTRY_DSN and SENTRY_DSN are both set on Cloud Run
  • Sentry is no-op safe — empty DSN = silently disabled; no error, just no data

What NEVER to do during an incident

  • ❌ Push a "quick fix" to production without code review (push to a branch, deploy that revision via gcloud, then PR)
  • ❌ Run a destructive SQL on prod without a transaction + tested rollback
  • ❌ Change an env var via the GCP console UI (ephemeral on next deploy — use gcloud command)
  • ❌ Restart Cloud Run "to clear cache" (Cloud Run is stateless; restarting just adds a cold-start)

Contact escalation

  • Level 1: Whoever's on the on-call rotation
  • Level 2: CTO / engineering lead
  • Level 3: Founder

There's no formal on-call rotation pre-launch — it's siddu. Set one up in the first month post-launch.