Disaster Recovery and Rollback
Business Objective: Panic-Proof Business Continuity
This guide teaches you how to achieve instant recovery from disasters using Infrastream's "Recovery as Code" approach. By the end of this guide, you'll be able to:
✅ Roll back bad deployments in seconds using Git revert
✅ Recover from data corruption by restoring from latest automated backups
✅ Respond to ransomware attacks by restoring to a known-clean state
✅ Prevent panicked mistakes through two-step code review process
✅ Maintain complete incident audit trails for compliance and post-mortems
This implements business use case:
Why "Panic-Proof"? During a live incident, stressed teams make mistakes. Infrastream's Recovery as Code prevents rushed UI clicks by requiring code changes with peer review—forcing deliberate action even under pressure.
Core Principle: Recovery as Code
Infrastream intentionally does not provide a "Restore" button in any UI. Instead, recovery is performed through declarative manifest changes, ensuring:
✅ Two-step verification: Code review prevents panicked mistakes
✅ Audit trail: Every recovery action is permanently recorded in Git
✅ Reproducibility: Recovery procedures are version-controlled and testable
✅ No panic decisions: Force deliberate action during high-stress incidents
How This Works in Practice
| Recovery Type | Traditional Approach | Infrastream Approach |
|---|---|---|
| Bad deployment | SSH into servers, manual rollback | Git revert → auto-redeploy |
| Database corruption | UI restore button (risky!) | Restore from clean automated backup |
| Ransomware | Pray backup works, click restore | Restore via gcloud → import clean DB → update app |
| Audit trail | Maybe logs exist? | Git history is the audit trail |
Recovery Scenario 1: Rolling Back a Bad Deployment
The Problem
You deployed version v2.1.0 of your application, and it's causing errors. You need to immediately roll back to the previous working version v2.0.5.
Solution: Git Revert
Since your application deployment is defined in a manifest, you can roll back by reverting the Git commit that introduced the bad version.
Step-by-Step Recovery
1. Identify the Bad Commit
cd organizational-unit/find-my-venue/environment/integration/project/fmv-uae
git log application/fmv-customer-app.yaml
Output:
commit abc123... (HEAD -> main)
Author: Developer <dev@company.com>
Date: Mon Feb 3 14:30:00 2025
Update customer app to v2.1.0
commit def456...
Author: Developer <dev@company.com>
Date: Fri Feb 1 10:15:00 2025
Update customer app to v2.0.5
2. Revert the Commit
git revert abc123
This creates a new commit that undoes the changes from abc123, preserving history.
3. Submit Emergency PR
git push origin main
# Create PR titled: "EMERGENCY: Rollback customer app to v2.0.5"
4. Fast-Track Approval
Request emergency approval from on-call engineer or use your organization's emergency change process.
5. Merge and Deploy
Once merged, Infrastream automatically:
- Pulls the previous container image (
v2.0.5) - Deploys it to Cloud Run/GKE
- Routes traffic to the stable version
Recovery Time: Typically 2-5 minutes from merge to live traffic restoration.
Recovery Scenario 2: Restoring from Automated Backup
The Problem
A critical failure or accidental deletion has occurred, and you need to restore your database to the latest known-good state.
Solution: Provision a New Database from Backup
You'll create a new database restored from your latest backup, validate it, then cut over your application.
Step-by-Step Recovery
1. Restore via Cloud Console or CLI
Use gcloud CLI to restore the most recent backup:
# Create a new AlloyDB cluster from the latest backup
gcloud alloydb clusters restore main-recovered \
--source-cluster=main \
--source-cluster-region=us-central1 \
--region=us-central1 \
--project=customer-portal-prod
2. Import Recovered Database to Infrastream
Once the database restoration is complete, create an Infrastream manifest to manage the recovered database:
File: database/main-recovered.yaml
apiVersion: lowops.manifests.v1
kind: Database
metadata:
name: main-recovered
project: customer-portal
environment: production
spec:
description: Emergency recovery database.
cpuCount: 4
clusterSize: 3
3. Submit PR and Merge
Once merged, Infrastream will begin managing the recovered instance, allowing you to validate data and cut over traffic as described in Scenario 3.
Recovery Scenario 3: Ransomware Attack Response
The Problem
At 03:00, you detect ransomware encryption of your production database. Logs indicate the attack began at 02:45.
Response Strategy
- Isolate: Immediately block all network egress
- Assess: Determine last known good timestamp
- Recover: Provision clean database from pre-attack backup
- Investigate: Analyze how attackers gained access
Step-by-Step Recovery
1. Emergency Isolation
Update project to block all egress:
# project/customer-portal.yaml
spec:
allowedEgress: [] # Block all outbound traffic immediately
2. Restore Clean Database (Two-Phase Process)
Phase 1: Restore via Cloud Console/CLI
Restore to the latest clean automated backup:
# Restore via gcloud CLI using the latest clean backup
gcloud alloydb clusters restore main-clean \
--source-cluster=main \
--source-cluster-region=us-central1 \
--region=us-central1 \
--project=customer-portal-prod
Phase 2: Import to Infrastream
Once restoration completes, create manifest to import and manage the clean database:
apiVersion: lowops.manifests.v1
kind: Database
metadata:
name: main-clean # Matches the restored cluster name
project: customer-portal
environment: production
spec:
description: |
SECURITY INCIDENT: Ransomware recovery
This manifest imports the clean, pre-attack database.
cpuCount: 4
clusterSize: 3
3. Deploy Application to Clean Database
Update application deployment:
spec:
container:
env:
- name: DATABASE_NAME
value: main-clean
4. Restore Network Access (Selectively)
Once application is validated, restore necessary egress:
spec:
allowedEgress:
- api.stripe.com # Payment processing
- api.sendgrid.com # Email service
# Add only verified endpoints
5. Incident Analysis
The old database is preserved in its attacked state for forensic analysis. Security team can analyze it without risk to production.
Recovery Time: Typical ransomware recovery: 15-30 minutes from detection to clean service restoration.
Recovery Scenario 4: Rolling Back Infrastructure Changes
The Problem
You changed the database from cpuCount: 2 to cpuCount: 8, expecting better performance. Instead, connection pooling issues are causing failures. You need to revert.
Solution
# Find the commit that increased CPU
git log database/main.yaml
# Revert it
git revert <commit-hash>
# Submit emergency PR
git push origin rollback-db-cpu-increase
Infrastream will automatically scale the database back down to 2 CPUs.
Backup Strategy Reference
Infrastream automatically manages backups for all stateful resources:
AlloyDB Databases
spec:
backupConfig:
quantityBasedRetention: 30 # Keep last 30 automated backups
# Automated backups occur daily
Capabilities:
- Automated daily backups: No manual intervention required
- Cost-optimized storage: Backups stored in low-cost regional storage
- Cost-optimized storage: Backups stored in low-cost regional storage
Virtual Machines with Persistent Disks
spec:
configuration:
volumeMounts:
/mnt/data:
diskConfig:
sizeGb: 200
type: pd-ssd
# Snapshots created automatically every 24 hours
# Retained for 14 days
Recovery Time Objectives (RTO)
| Scenario | Typical RTO | Factors |
|---|---|---|
| Application Rollback | 2-5 minutes | Container image pull time |
| Database Restore (Small) | 5-10 minutes | Database provisioning from backup |
| Database Restore (Large) | 15-30 minutes | Database provisioning from backup |
| VM Disk Restore | 10-15 minutes | Snapshot restoration + VM boot |
| Full Environment Rebuild | 30-60 minutes | Complete infrastructure recreation |
Best Practices
1. Test Your Recovery Procedures
Disaster Recovery Drill (Quarterly):
- Perform a test restoration via Console in your staging environment
- Restore production database to a known-good state
- Create Infrastream manifest to manage the test recovery instance
- Validate the drill:
- Recovery completes successfully
- Data integrity is maintained
- Application can connect
- Performance is acceptable
- Document results and decommission test instance
2. Document Recovery Runbooks
Create an incident response runbook:
# Database Corruption Runbook
1. Identify corruption timestamp from logs
2. Calculate recovery point: corruption time - 1 minute
3. Perform PITR via Cloud Console to restore at recovery point
4. Create Infrastream manifest for recovered instance
5. Validate data before cutover
6. Update application configuration to use recovered database
3. Maintain Communication During Incidents
The Git PR serves as the incident timeline:
PR #1234: EMERGENCY - Ransomware Recovery
- 03:05: Ransomware detected by SOC
- 03:07: Network egress blocked
- 03:12: Clean database provisioning started
- 03:23: Database provisioned and validated
- 03:26: Application cut over to clean database
- 03:30: Service restored, attack contained
4. Preserve Evidence
Never delete the compromised resource immediately:
# Mark as decommissioned but keep for investigation
metadata:
name: main-compromised # Rename to preserve
spec:
description: |
SECURITY INCIDENT - DO NOT DELETE
Compromised database preserved for forensic analysis
Incident: SEC-2025-089
Advanced: VM Disk Snapshots
For critical stateful VMs, Infrastream automatically creates periodic disk snapshots. Snapshot schedules are managed via GCP Resource Policies natively provisioned by the Go engine.
To restore a VM from snapshot:
- Identify the snapshot in Cloud Console
- Create a new disk from the snapshot
- Update your VM manifest to attach the restored disk:
apiVersion: lowops.manifests.v1
kind: VirtualMachine
metadata:
name: kurrent-db
spec:
configuration:
volumeMounts:
/mnt/data:
diskConfig:
sizeGb: 200
type: pd-ssd
# Disk will be created from latest snapshot automatically
Note: Custom snapshot frequency and retention are configured at the organizational level by the engine's centralized policy runners, not in individual VM manifests.
Recovery from Complete Project Deletion
The Problem
A project was accidentally deleted (extremely rare due to Infrastream's safeguards).
Solution
Since all infrastructure is defined in Git, you can recreate the entire project:
# The manifests still exist in Git
cd organizational-unit/payments/environment/production/project/astrapay-prod
# Re-merge the project manifest
git revert <deletion-commit>
git push origin main
Infrastream will recreate the entire project including:
- GCP Project
- Networks and firewall rules
- All databases (from latest automated backup)
- All applications
- All storage buckets
Critical Data Recovery: Databases and storage buckets are restored from the most recent automated backup.
Monitoring and Alerting
Configure notification channels for disaster recovery alerts:
apiVersion: lowops.manifests.v1
kind: Alerting
metadata:
name: emergency-alerts
project: customer-portal
environment: production
spec:
notifications:
slack:
channel: "#ops-incidents"
email:
recipients:
- oncall@company.com
- dba-team@company.com
Note: Alert policies (e.g., backup failures, high query rates, database CPU) are provisioned directly by the platform's core runners. The
Alertingmanifest defines where alerts are sent, not the alert conditions themselves.
Infrastream provides Smart Alerts by default for:
- Database backup failures
- Disk space > 80%
- CPU utilization > 90%
- Memory pressure
- Replication lag
Troubleshooting
Problem: Database Restoration Fails
Error: Cannot restore from selected backup
Possible Causes:
- Backup is outside retention window (> 30 days old)
- Source database never existed
Solution:
- Ensure backup is within retention period
- Check source database status
Security Considerations
Recovery Access Control
Only administrators can perform recovery operations:
spec:
permissions:
administrators:
groups:
- incident-response-team
- platform-leads
Audit Recovery Actions
All recovery actions are audited:
- Git commit history (who initiated)
- PR review history (who approved)
- Cloud Audit Logs (when executed)
- Infrastream pipeline logs (detailed actions)
Related Guides
Next Steps
- Set up backup monitoring and alerts
- Schedule quarterly disaster recovery drills
- Document your organization's recovery runbooks
- Review Managing Secrets for secure credential recovery