Disaster Recovery and Rollback

Business Objective: Panic-Proof Business Continuity

This guide teaches you how to achieve instant recovery from disasters using Infrastream's "Recovery as Code" approach. By the end of this guide, you'll be able to:

✅ Roll back bad deployments in seconds using Git revert
✅ Recover from data corruption by restoring from latest automated backups
✅ Respond to ransomware attacks by restoring to a known-clean state
✅ Prevent panicked mistakes through two-step code review process
✅ Maintain complete incident audit trails for compliance and post-mortems

This implements business use case:

Use Case 4: Security First Resilience and Recovery

Why "Panic-Proof"? During a live incident, stressed teams make mistakes. Infrastream's Recovery as Code prevents rushed UI clicks by requiring code changes with peer review—forcing deliberate action even under pressure.

Core Principle: Recovery as Code

Infrastream intentionally does not provide a "Restore" button in any UI. Instead, recovery is performed through declarative manifest changes, ensuring:

✅ Two-step verification: Code review prevents panicked mistakes
✅ Audit trail: Every recovery action is permanently recorded in Git
✅ Reproducibility: Recovery procedures are version-controlled and testable
✅ No panic decisions: Force deliberate action during high-stress incidents

How This Works in Practice

Recovery Type	Traditional Approach	Infrastream Approach
Bad deployment	SSH into servers, manual rollback	Git revert → auto-redeploy
Database corruption	UI restore button (risky!)	Restore from clean automated backup
Ransomware	Pray backup works, click restore	Restore via gcloud → import clean DB → update app
Audit trail	Maybe logs exist?	Git history is the audit trail

Recovery Scenario 1: Rolling Back a Bad Deployment

The Problem

You deployed version v2.1.0 of your application, and it's causing errors. You need to immediately roll back to the previous working version v2.0.5.

Solution: Git Revert

Since your application deployment is defined in a manifest, you can roll back by reverting the Git commit that introduced the bad version.

Step-by-Step Recovery

1. Identify the Bad Commit

cd organizational-unit/find-my-venue/environment/integration/project/fmv-uae
git log application/fmv-customer-app.yaml

Output:

commit abc123... (HEAD -> main)
Author: Developer <dev@company.com>
Date:   Mon Feb 3 14:30:00 2025

    Update customer app to v2.1.0

commit def456...
Author: Developer <dev@company.com>  
Date:   Fri Feb 1 10:15:00 2025

    Update customer app to v2.0.5

2. Revert the Commit

git revert abc123

This creates a new commit that undoes the changes from abc123, preserving history.

3. Submit Emergency PR

git push origin main
# Create PR titled: "EMERGENCY: Rollback customer app to v2.0.5"

4. Fast-Track Approval

Request emergency approval from on-call engineer or use your organization's emergency change process.

5. Merge and Deploy

Once merged, Infrastream automatically:

Pulls the previous container image (v2.0.5)
Deploys it to Cloud Run/GKE
Routes traffic to the stable version

Recovery Time: Typically 2-5 minutes from merge to live traffic restoration.

Recovery Scenario 2: Restoring from Automated Backup

The Problem

A critical failure or accidental deletion has occurred, and you need to restore your database to the latest known-good state.

Solution: Provision a New Database from Backup

You'll create a new database restored from your latest backup, validate it, then cut over your application.

Step-by-Step Recovery

1. Restore via Cloud Console or CLI

Use gcloud CLI to restore the most recent backup:

# Create a new AlloyDB cluster from the latest backup
gcloud alloydb clusters restore main-recovered \
  --source-cluster=main \
  --source-cluster-region=us-central1 \
  --region=us-central1 \
  --project=customer-portal-prod

2. Import Recovered Database to Infrastream

Once the database restoration is complete, create an Infrastream manifest to manage the recovered database:

File: database/main-recovered.yaml

apiVersion: lowops.manifests.v1
kind: Database
metadata:
  name: main-recovered
  project: customer-portal
  environment: production
spec:
  description: Emergency recovery database.
  cpuCount: 4
  clusterSize: 3

3. Submit PR and Merge

Once merged, Infrastream will begin managing the recovered instance, allowing you to validate data and cut over traffic as described in Scenario 3.

Recovery Scenario 3: Ransomware Attack Response

The Problem

At 03:00, you detect ransomware encryption of your production database. Logs indicate the attack began at 02:45.

Response Strategy

Isolate: Immediately block all network egress
Assess: Determine last known good timestamp
Recover: Provision clean database from pre-attack backup
Investigate: Analyze how attackers gained access

Step-by-Step Recovery

1. Emergency Isolation

Update project to block all egress:

# project/customer-portal.yaml
spec:
  allowedEgress: []  # Block all outbound traffic immediately

2. Restore Clean Database (Two-Phase Process)

Phase 1: Restore via Cloud Console/CLI

Restore to the latest clean automated backup:

# Restore via gcloud CLI using the latest clean backup
gcloud alloydb clusters restore main-clean \
  --source-cluster=main \
  --source-cluster-region=us-central1 \
  --region=us-central1 \
  --project=customer-portal-prod

Phase 2: Import to Infrastream

Once restoration completes, create manifest to import and manage the clean database:

apiVersion: lowops.manifests.v1
kind: Database
metadata:
  name: main-clean  # Matches the restored cluster name
  project: customer-portal
  environment: production
spec:
  description: |
    SECURITY INCIDENT: Ransomware recovery
    This manifest imports the clean, pre-attack database.
  
  cpuCount: 4
  clusterSize: 3

3. Deploy Application to Clean Database

Update application deployment:

spec:
  container:
    env:
      - name: DATABASE_NAME
        value: main-clean

4. Restore Network Access (Selectively)

Once application is validated, restore necessary egress:

spec:
  allowedEgress:
    - api.stripe.com  # Payment processing
    - api.sendgrid.com  # Email service
    # Add only verified endpoints

5. Incident Analysis

The old database is preserved in its attacked state for forensic analysis. Security team can analyze it without risk to production.

Recovery Time: Typical ransomware recovery: 15-30 minutes from detection to clean service restoration.

Recovery Scenario 4: Rolling Back Infrastructure Changes

The Problem

You changed the database from cpuCount: 2 to cpuCount: 8, expecting better performance. Instead, connection pooling issues are causing failures. You need to revert.

Solution

# Find the commit that increased CPU
git log database/main.yaml

# Revert it
git revert <commit-hash>

# Submit emergency PR
git push origin rollback-db-cpu-increase

Infrastream will automatically scale the database back down to 2 CPUs.

Backup Strategy Reference

Infrastream automatically manages backups for all stateful resources:

AlloyDB Databases

spec:
  backupConfig:
    quantityBasedRetention: 30  # Keep last 30 automated backups
    # Automated backups occur daily

Capabilities:

Automated daily backups: No manual intervention required
Cost-optimized storage: Backups stored in low-cost regional storage
Cost-optimized storage: Backups stored in low-cost regional storage

Virtual Machines with Persistent Disks

spec:
  configuration:
    volumeMounts:
      /mnt/data:
        diskConfig:
          sizeGb: 200
          type: pd-ssd
        # Snapshots created automatically every 24 hours
        # Retained for 14 days

Recovery Time Objectives (RTO)

Scenario	Typical RTO	Factors
Application Rollback	2-5 minutes	Container image pull time
Database Restore (Small)	5-10 minutes	Database provisioning from backup
Database Restore (Large)	15-30 minutes	Database provisioning from backup
VM Disk Restore	10-15 minutes	Snapshot restoration + VM boot
Full Environment Rebuild	30-60 minutes	Complete infrastructure recreation

Best Practices

1. Test Your Recovery Procedures

Disaster Recovery Drill (Quarterly):

Perform a test restoration via Console in your staging environment
Restore production database to a known-good state
Create Infrastream manifest to manage the test recovery instance
Validate the drill:
- Recovery completes successfully
- Data integrity is maintained
- Application can connect
- Performance is acceptable
Document results and decommission test instance

2. Document Recovery Runbooks

Create an incident response runbook:

# Database Corruption Runbook

Identify corruption timestamp from logs
Calculate recovery point: corruption time - 1 minute
Perform PITR via Cloud Console to restore at recovery point
Create Infrastream manifest for recovered instance
Validate data before cutover
Update application configuration to use recovered database

3. Maintain Communication During Incidents

The Git PR serves as the incident timeline:

PR #1234: EMERGENCY - Ransomware Recovery

- 03:05: Ransomware detected by SOC
- 03:07: Network egress blocked
- 03:12: Clean database provisioning started
- 03:23: Database provisioned and validated
- 03:26: Application cut over to clean database
- 03:30: Service restored, attack contained

4. Preserve Evidence

Never delete the compromised resource immediately:

# Mark as decommissioned but keep for investigation
metadata:
  name: main-compromised  # Rename to preserve
spec:
  description: |
    SECURITY INCIDENT - DO NOT DELETE
    Compromised database preserved for forensic analysis
    Incident: SEC-2025-089

Advanced: VM Disk Snapshots

For critical stateful VMs, Infrastream automatically creates periodic disk snapshots. Snapshot schedules are managed via GCP Resource Policies natively provisioned by the Go engine.

To restore a VM from snapshot:

Identify the snapshot in Cloud Console
Create a new disk from the snapshot
Update your VM manifest to attach the restored disk:

apiVersion: lowops.manifests.v1
kind: VirtualMachine
metadata:
  name: kurrent-db
spec:
  configuration:
    volumeMounts:
      /mnt/data:
        diskConfig:
          sizeGb: 200
          type: pd-ssd
          # Disk will be created from latest snapshot automatically

Note: Custom snapshot frequency and retention are configured at the organizational level by the engine's centralized policy runners, not in individual VM manifests.

Recovery from Complete Project Deletion

The Problem

A project was accidentally deleted (extremely rare due to Infrastream's safeguards).

Solution

Since all infrastructure is defined in Git, you can recreate the entire project:

# The manifests still exist in Git
cd organizational-unit/payments/environment/production/project/astrapay-prod

# Re-merge the project manifest
git revert <deletion-commit>
git push origin main

Infrastream will recreate the entire project including:

GCP Project
Networks and firewall rules
All databases (from latest automated backup)
All applications
All storage buckets

Critical Data Recovery: Databases and storage buckets are restored from the most recent automated backup.

Monitoring and Alerting

Configure notification channels for disaster recovery alerts:

apiVersion: lowops.manifests.v1
kind: Alerting
metadata:
  name: emergency-alerts
  project: customer-portal
  environment: production
spec:
  notifications:
    slack:
      channel: "#ops-incidents"
    email:
      recipients:
        - oncall@company.com
        - dba-team@company.com

Note: Alert policies (e.g., backup failures, high query rates, database CPU) are provisioned directly by the platform's core runners. The Alerting manifest defines where alerts are sent, not the alert conditions themselves.

Infrastream provides Smart Alerts by default for:

Database backup failures
Disk space > 80%
CPU utilization > 90%
Memory pressure
Replication lag

Troubleshooting

Problem: Database Restoration Fails

Error: Cannot restore from selected backup

Possible Causes:

Backup is outside retention window (> 30 days old)
Source database never existed

Solution:

Ensure backup is within retention period
Check source database status

Security Considerations

Recovery Access Control

Only administrators can perform recovery operations:

spec:
  permissions:
    administrators:
      groups:
        - incident-response-team
        - platform-leads

Audit Recovery Actions

All recovery actions are audited:

Git commit history (who initiated)
PR review history (who approved)
Cloud Audit Logs (when executed)
Infrastream pipeline logs (detailed actions)

Next Steps

Set up backup monitoring and alerts
Schedule quarterly disaster recovery drills
Document your organization's recovery runbooks
Review Managing Secrets for secure credential recovery

Business Objective: Panic-Proof Business Continuity​

Core Principle: Recovery as Code​

How This Works in Practice​

Recovery Scenario 1: Rolling Back a Bad Deployment​

The Problem​

Solution: Git Revert​

Step-by-Step Recovery​

1. Identify the Bad Commit​

2. Revert the Commit​

3. Submit Emergency PR​

4. Fast-Track Approval​

5. Merge and Deploy​

Recovery Scenario 2: Restoring from Automated Backup​

The Problem​

Solution: Provision a New Database from Backup​

Step-by-Step Recovery​

1. Restore via Cloud Console or CLI​

2. Import Recovered Database to Infrastream​

3. Submit PR and Merge​

Recovery Scenario 3: Ransomware Attack Response​

The Problem​

Response Strategy​

Step-by-Step Recovery​

1. Emergency Isolation​

2. Restore Clean Database (Two-Phase Process)​

3. Deploy Application to Clean Database​

4. Restore Network Access (Selectively)​

5. Incident Analysis​

Recovery Scenario 4: Rolling Back Infrastructure Changes​

The Problem​

Solution​

Backup Strategy Reference​

AlloyDB Databases​

Virtual Machines with Persistent Disks​

Recovery Time Objectives (RTO)​

Best Practices​

1. Test Your Recovery Procedures​

2. Document Recovery Runbooks​

3. Maintain Communication During Incidents​

4. Preserve Evidence​

Advanced: VM Disk Snapshots​

Recovery from Complete Project Deletion​

The Problem​

Solution​

Monitoring and Alerting​

Troubleshooting​

Problem: Database Restoration Fails​

Security Considerations​

Recovery Access Control​

Audit Recovery Actions​

Related Guides​

Next Steps​

Business Objective: Panic-Proof Business Continuity

Core Principle: Recovery as Code

How This Works in Practice

Recovery Scenario 1: Rolling Back a Bad Deployment

The Problem

Solution: Git Revert

Step-by-Step Recovery

1. Identify the Bad Commit

2. Revert the Commit

3. Submit Emergency PR

4. Fast-Track Approval

5. Merge and Deploy

Recovery Scenario 2: Restoring from Automated Backup

The Problem

Solution: Provision a New Database from Backup

Step-by-Step Recovery

1. Restore via Cloud Console or CLI

2. Import Recovered Database to Infrastream

3. Submit PR and Merge

Recovery Scenario 3: Ransomware Attack Response

The Problem

Response Strategy

Step-by-Step Recovery

1. Emergency Isolation

2. Restore Clean Database (Two-Phase Process)

3. Deploy Application to Clean Database

4. Restore Network Access (Selectively)

5. Incident Analysis

Recovery Scenario 4: Rolling Back Infrastructure Changes

The Problem

Solution

Backup Strategy Reference

AlloyDB Databases

Virtual Machines with Persistent Disks

Recovery Time Objectives (RTO)

Best Practices

1. Test Your Recovery Procedures

2. Document Recovery Runbooks

3. Maintain Communication During Incidents

4. Preserve Evidence

Advanced: VM Disk Snapshots

Recovery from Complete Project Deletion

The Problem

Solution

Monitoring and Alerting

Troubleshooting

Problem: Database Restoration Fails

Security Considerations

Recovery Access Control

Audit Recovery Actions

Related Guides

Next Steps