Backups & Recovery
Backups are useless until tested. This page is the contract: what we have, where it lives, how to restore.
What gets backed up
| Asset | Where | Retention | Verified? |
|---|---|---|---|
| Postgres (Cloud SQL) | Cloud SQL automated backups + 7-day binary logs (PITR) | 30 days | ⚠️ Should be tested monthly |
| GCS images / videos | Versioned bucket + lifecycle rules | Forever (paid storage) | ✅ Implicit (object versioning) |
| App code | GitHub | Forever | ✅ |
| Secrets | GCP Secret Manager + 1Password export | Forever | ⚠️ 1Password export should be quarterly |
| Env vars | Cloud Run revision config | 5 most recent revisions | ✅ Implicit (revision history) |
Postgres — Cloud SQL backup
Verify backups are running
gcloud sql backups list --instance=school-db --project=school-mgmt-saas --limit=5
If empty / stale → enable in console: Cloud SQL → school-db → Backups → Edit settings
- Automated backups: ON
- Backup window: 02:00–06:00 IST (off-peak)
- Backup retention: 30 days
- Enable point-in-time recovery: ON (this enables binary logs)
Restore options
Option A: Restore to a new instance (safest, doesn't affect prod):
gcloud sql backups restore <BACKUP_ID> \
--restore-instance=school-db-restored \
--project=school-mgmt-saas
Then dump from the restored instance, drop selected rows in prod, restore those rows.
Option B: Point-in-time recovery (rewinds the whole DB): GCP Console → Cloud SQL → school-db → Clones → "Clone with point-in-time recovery". Only do this if the entire DB is corrupted — rewinds ALL data, not just the broken table.
Option C: pg_restore from a manual dump:
# Take a dump (do this monthly as a cold backup)
PGPASSWORD=Ka26Mkt2026 pg_dump -h 34.123.40.64 -U ka26user -d ka26 -F c -f ka26-$(date +%Y%m%d).dump
# Restore (to a fresh DB, NOT prod directly)
PGPASSWORD=... pg_restore -h ... -U ... -d ka26-fresh ka26-20260417.dump
GCS — image/video backups
GCS bucket ka26-uploads (and video bucket) should have:
- Versioning enabled — accidentally-overwritten files can be restored
- Lifecycle rule: delete versioned objects after 30 days
- Soft delete (Cloud Storage feature, available 2024+) — recovers deleted objects within 7 days
Verify:
gsutil versioning get gs://ka26-uploads
gsutil lifecycle get gs://ka26-uploads
Secrets — Secret Manager
Secret Manager versions everything by default. To rotate a secret:
echo -n "new-value" | gcloud secrets versions add my-secret --data-file=-
The old version stays around (revoke it explicitly with gcloud secrets versions destroy). This means even if someone overwrites a secret, the previous version is recoverable for 30 days (default).
Quarterly task: export all secrets to a 1Password vault entry as a flat dump (in case GCP is compromised entirely):
for s in $(gcloud secrets list --format="value(name)"); do
echo "=== $s ==="
gcloud secrets versions access latest --secret="$s"
echo
done > /tmp/secrets-dump.txt
# Manually paste into 1Password "GCP Secrets Backup" item
rm /tmp/secrets-dump.txt
Recovery time objectives
| Scenario | RTO (target) | RPO (target) |
|---|---|---|
| Single revision rollback | 30 sec | 0 |
| Cloud Run service down | 5 min | 0 |
| Single table data corruption | 1 hour | < 5 min (PITR) |
| Entire DB corrupted | 2 hours | < 5 min (PITR) |
| GCP region outage | 4 hours | < 1 hour |
| Lost GCP project access | 1 day | 0 (everything recoverable from 1Password + GitHub) |
What we DON'T have (gaps to close post-launch)
- ❌ Tested restore — backups exist but we've never actually run a restore. Schedule a test restore in month 1.
- ❌ Cross-region backup — single point of failure if
us-central1is wiped. GCS auto-replicates; Cloud SQL needs a manual scheduled cross-region replica. - ❌ Backup of EmailLog — currently in main DB; lose the DB → lose audit log
- ❌ Backup notifications — no alert if a backup fails. Add Cloud Monitoring alert post-launch.
How to drill (annual)
- Spin up a fresh Cloud SQL instance from a 7-day-old backup
- Dump a known-changed table (e.g., yesterday's
StoreOrderrows) - Verify row counts + a sample of rows match
- Tear down the test instance
- Document time-taken in this page
If we never drill, we don't have backups — we have hopes.