Skip to main content

Disaster Recovery and Rollback

Business Objective: Panic-Proof Business Continuity

This guide teaches you how to achieve instant recovery from disasters using Infrastream's "Recovery as Code" approach. By the end of this guide, you'll be able to:

Roll back bad deployments in seconds using Git revert
Recover from data corruption by restoring from latest automated backups
Respond to ransomware attacks by restoring to a known-clean state
Prevent panicked mistakes through two-step code review process
Maintain complete incident audit trails for compliance and post-mortems

This implements business use case:

Why "Panic-Proof"? During a live incident, stressed teams make mistakes. Infrastream's Recovery as Code prevents rushed UI clicks by requiring code changes with peer review—forcing deliberate action even under pressure.


Core Principle: Recovery as Code

Infrastream intentionally does not provide a "Restore" button in any UI. Instead, recovery is performed through declarative manifest changes, ensuring:

Two-step verification: Code review prevents panicked mistakes
Audit trail: Every recovery action is permanently recorded in Git
Reproducibility: Recovery procedures are version-controlled and testable
No panic decisions: Force deliberate action during high-stress incidents

How This Works in Practice

Recovery TypeTraditional ApproachInfrastream Approach
Bad deploymentSSH into servers, manual rollbackGit revert → auto-redeploy
Database corruptionUI restore button (risky!)Restore from clean automated backup
RansomwarePray backup works, click restoreRestore via gcloud → import clean DB → update app
Audit trailMaybe logs exist?Git history is the audit trail

Recovery Scenario 1: Rolling Back a Bad Deployment

The Problem

You deployed version v2.1.0 of your application, and it's causing errors. You need to immediately roll back to the previous working version v2.0.5.

Solution: Git Revert

Since your application deployment is defined in a manifest, you can roll back by reverting the Git commit that introduced the bad version.

Step-by-Step Recovery

1. Identify the Bad Commit

cd organizational-unit/find-my-venue/environment/integration/project/fmv-uae
git log application/fmv-customer-app.yaml

Output:

commit abc123... (HEAD -> main)
Author: Developer <dev@company.com>
Date: Mon Feb 3 14:30:00 2025

Update customer app to v2.1.0

commit def456...
Author: Developer <dev@company.com>
Date: Fri Feb 1 10:15:00 2025

Update customer app to v2.0.5

2. Revert the Commit

git revert abc123

This creates a new commit that undoes the changes from abc123, preserving history.

3. Submit Emergency PR

git push origin main
# Create PR titled: "EMERGENCY: Rollback customer app to v2.0.5"

4. Fast-Track Approval

Request emergency approval from on-call engineer or use your organization's emergency change process.

5. Merge and Deploy

Once merged, Infrastream automatically:

  1. Pulls the previous container image (v2.0.5)
  2. Deploys it to Cloud Run/GKE
  3. Routes traffic to the stable version

Recovery Time: Typically 2-5 minutes from merge to live traffic restoration.



Recovery Scenario 2: Restoring from Automated Backup

The Problem

A critical failure or accidental deletion has occurred, and you need to restore your database to the latest known-good state.

Solution: Provision a New Database from Backup

You'll create a new database restored from your latest backup, validate it, then cut over your application.

Step-by-Step Recovery

1. Restore via Cloud Console or CLI

Use gcloud CLI to restore the most recent backup:

# Create a new AlloyDB cluster from the latest backup
gcloud alloydb clusters restore main-recovered \
--source-cluster=main \
--source-cluster-region=us-central1 \
--region=us-central1 \
--project=customer-portal-prod

2. Import Recovered Database to Infrastream

Once the database restoration is complete, create an Infrastream manifest to manage the recovered database:

File: database/main-recovered.yaml

apiVersion: lowops.manifests.v1
kind: Database
metadata:
name: main-recovered
project: customer-portal
environment: production
spec:
description: Emergency recovery database.
cpuCount: 4
clusterSize: 3

3. Submit PR and Merge

Once merged, Infrastream will begin managing the recovered instance, allowing you to validate data and cut over traffic as described in Scenario 3.


Recovery Scenario 3: Ransomware Attack Response

The Problem

At 03:00, you detect ransomware encryption of your production database. Logs indicate the attack began at 02:45.

Response Strategy

  1. Isolate: Immediately block all network egress
  2. Assess: Determine last known good timestamp
  3. Recover: Provision clean database from pre-attack backup
  4. Investigate: Analyze how attackers gained access

Step-by-Step Recovery

1. Emergency Isolation

Update project to block all egress:

# project/customer-portal.yaml
spec:
allowedEgress: [] # Block all outbound traffic immediately

2. Restore Clean Database (Two-Phase Process)

Phase 1: Restore via Cloud Console/CLI

Restore to the latest clean automated backup:

# Restore via gcloud CLI using the latest clean backup
gcloud alloydb clusters restore main-clean \
--source-cluster=main \
--source-cluster-region=us-central1 \
--region=us-central1 \
--project=customer-portal-prod

Phase 2: Import to Infrastream

Once restoration completes, create manifest to import and manage the clean database:

apiVersion: lowops.manifests.v1
kind: Database
metadata:
name: main-clean # Matches the restored cluster name
project: customer-portal
environment: production
spec:
description: |
SECURITY INCIDENT: Ransomware recovery
This manifest imports the clean, pre-attack database.

cpuCount: 4
clusterSize: 3

3. Deploy Application to Clean Database

Update application deployment:

spec:
container:
env:
- name: DATABASE_NAME
value: main-clean

4. Restore Network Access (Selectively)

Once application is validated, restore necessary egress:

spec:
allowedEgress:
- api.stripe.com # Payment processing
- api.sendgrid.com # Email service
# Add only verified endpoints

5. Incident Analysis

The old database is preserved in its attacked state for forensic analysis. Security team can analyze it without risk to production.

Recovery Time: Typical ransomware recovery: 15-30 minutes from detection to clean service restoration.


Recovery Scenario 4: Rolling Back Infrastructure Changes

The Problem

You changed the database from cpuCount: 2 to cpuCount: 8, expecting better performance. Instead, connection pooling issues are causing failures. You need to revert.

Solution

# Find the commit that increased CPU
git log database/main.yaml

# Revert it
git revert <commit-hash>

# Submit emergency PR
git push origin rollback-db-cpu-increase

Infrastream will automatically scale the database back down to 2 CPUs.


Backup Strategy Reference

Infrastream automatically manages backups for all stateful resources:

AlloyDB Databases

spec:
backupConfig:
quantityBasedRetention: 30 # Keep last 30 automated backups
# Automated backups occur daily

Capabilities:

  • Automated daily backups: No manual intervention required
  • Cost-optimized storage: Backups stored in low-cost regional storage
  • Cost-optimized storage: Backups stored in low-cost regional storage

Virtual Machines with Persistent Disks

spec:
configuration:
volumeMounts:
/mnt/data:
diskConfig:
sizeGb: 200
type: pd-ssd
# Snapshots created automatically every 24 hours
# Retained for 14 days

Recovery Time Objectives (RTO)

ScenarioTypical RTOFactors
Application Rollback2-5 minutesContainer image pull time
Database Restore (Small)5-10 minutesDatabase provisioning from backup
Database Restore (Large)15-30 minutesDatabase provisioning from backup
VM Disk Restore10-15 minutesSnapshot restoration + VM boot
Full Environment Rebuild30-60 minutesComplete infrastructure recreation

Best Practices

1. Test Your Recovery Procedures

Disaster Recovery Drill (Quarterly):

  1. Perform a test restoration via Console in your staging environment
  2. Restore production database to a known-good state
  3. Create Infrastream manifest to manage the test recovery instance
  4. Validate the drill:
    • Recovery completes successfully
    • Data integrity is maintained
    • Application can connect
    • Performance is acceptable
  5. Document results and decommission test instance

2. Document Recovery Runbooks

Create an incident response runbook:

# Database Corruption Runbook

1. Identify corruption timestamp from logs
2. Calculate recovery point: corruption time - 1 minute
3. Perform PITR via Cloud Console to restore at recovery point
4. Create Infrastream manifest for recovered instance
5. Validate data before cutover
6. Update application configuration to use recovered database

3. Maintain Communication During Incidents

The Git PR serves as the incident timeline:

PR #1234: EMERGENCY - Ransomware Recovery

- 03:05: Ransomware detected by SOC
- 03:07: Network egress blocked
- 03:12: Clean database provisioning started
- 03:23: Database provisioned and validated
- 03:26: Application cut over to clean database
- 03:30: Service restored, attack contained

4. Preserve Evidence

Never delete the compromised resource immediately:

# Mark as decommissioned but keep for investigation
metadata:
name: main-compromised # Rename to preserve
spec:
description: |
SECURITY INCIDENT - DO NOT DELETE
Compromised database preserved for forensic analysis
Incident: SEC-2025-089

Advanced: VM Disk Snapshots

For critical stateful VMs, Infrastream automatically creates periodic disk snapshots. Snapshot schedules are managed via GCP Resource Policies natively provisioned by the Go engine.

To restore a VM from snapshot:

  1. Identify the snapshot in Cloud Console
  2. Create a new disk from the snapshot
  3. Update your VM manifest to attach the restored disk:
apiVersion: lowops.manifests.v1
kind: VirtualMachine
metadata:
name: kurrent-db
spec:
configuration:
volumeMounts:
/mnt/data:
diskConfig:
sizeGb: 200
type: pd-ssd
# Disk will be created from latest snapshot automatically

Note: Custom snapshot frequency and retention are configured at the organizational level by the engine's centralized policy runners, not in individual VM manifests.


Recovery from Complete Project Deletion

The Problem

A project was accidentally deleted (extremely rare due to Infrastream's safeguards).

Solution

Since all infrastructure is defined in Git, you can recreate the entire project:

# The manifests still exist in Git
cd organizational-unit/payments/environment/production/project/astrapay-prod

# Re-merge the project manifest
git revert <deletion-commit>
git push origin main

Infrastream will recreate the entire project including:

  • GCP Project
  • Networks and firewall rules
  • All databases (from latest automated backup)
  • All applications
  • All storage buckets

Critical Data Recovery: Databases and storage buckets are restored from the most recent automated backup.


Monitoring and Alerting

Configure notification channels for disaster recovery alerts:

apiVersion: lowops.manifests.v1
kind: Alerting
metadata:
name: emergency-alerts
project: customer-portal
environment: production
spec:
notifications:
slack:
channel: "#ops-incidents"
email:
recipients:
- oncall@company.com
- dba-team@company.com

Note: Alert policies (e.g., backup failures, high query rates, database CPU) are provisioned directly by the platform's core runners. The Alerting manifest defines where alerts are sent, not the alert conditions themselves.

Infrastream provides Smart Alerts by default for:

  • Database backup failures
  • Disk space > 80%
  • CPU utilization > 90%
  • Memory pressure
  • Replication lag

Troubleshooting

Problem: Database Restoration Fails

Error: Cannot restore from selected backup

Possible Causes:

  1. Backup is outside retention window (> 30 days old)
  2. Source database never existed

Solution:

  • Ensure backup is within retention period
  • Check source database status

Security Considerations

Recovery Access Control

Only administrators can perform recovery operations:

spec:
permissions:
administrators:
groups:
- incident-response-team
- platform-leads

Audit Recovery Actions

All recovery actions are audited:

  • Git commit history (who initiated)
  • PR review history (who approved)
  • Cloud Audit Logs (when executed)
  • Infrastream pipeline logs (detailed actions)


Next Steps

  • Set up backup monitoring and alerts
  • Schedule quarterly disaster recovery drills
  • Document your organization's recovery runbooks
  • Review Managing Secrets for secure credential recovery