Database Point-in-Time Recovery (PITR) Example

This example demonstrates Infrastream's "panic-proof" database recovery capabilities, implementing the Security Resilience and Recovery use case.

Scenario: Data Corruption Recovery

Timeline:

14:25 UTC: Production database running normally
14:30 UTC: Developer accidentally runs UPDATE statement without WHERE clause, corrupting customer records
14:45 UTC: Corruption discovered by QA team
14:46 UTC: Incident response initiated

Objective: Restore database to 14:29 UTC (one minute before corruption) without disrupting production service.

Why Not "Restore in Place"?

Infrastream deliberately prevents restoring over a live database to avoid:

❌ Overwriting good data that arrived between corruption and discovery
❌ No rollback if you restore the wrong timestamp
❌ Panic mistakes during high-stress incidents
❌ Loss of forensic evidence (corrupted database is destroyed)

✅ Instead: Create a NEW database from backup, validate it, then cutover your application

Step 1: Restore Database via Cloud Console/CLI

[!IMPORTANT] Critical: Infrastream does not perform PITR directly. You must restore via Google Cloud Console or gcloud CLI first, then import the recovered database to Infrastream.

Phase 1: Perform PITR (Outside Infrastream)

Use the gcloud CLI to restore the corrupted database to a specific point in time:

# Navigate to your project
cd organizational-unit/banking/environment/production

# Execute PITR to create new cluster from backup
gcloud alloydb clusters restore main-recovered \
  --source-cluster=main \
  --source-cluster-region=us-central1 \
  --region=us-central1 \
  --point-in-time=2025-02-03T14:29:00Z \
  --project=customer-portal-prod

# Monitor restoration progress
gcloud alloydb operations list --filter="targetId:main-recovered"

Alternative: Cloud Console

Navigate to AlloyDB → Clusters → main
Click "Restore" → "Point-in-time recovery"
Timestamp: 2025-02-03T14:29:00Z
New cluster name: main-recovered
Click "Restore"

Wait 5-10 minutes for provisioning to complete.

Step 2: Import Recovered Database to Infrastream

Phase 2: Create Manifest (After Restoration Completes)

Once the GCP restore operation finishes, create an Infrastream manifest to import and manage the recovered database.

File: organizational-unit/banking/environment/production/project/customer-portal/database/main-recovered.yaml

apiVersion: lowops.manifests.v1
kind: Database
metadata:
  name: main-recovered  # MUST match the cluster name from Phase 1
  project: customer-portal
  environment: production
  organizational-unit: banking
  organization: acme-corp
spec:
  description: |
    EMERGENCY RECOVERY DATABASE
    
    Incident: INC-2025-047
    Issue: Accidental data corruption from UPDATE without WHERE clause
    Corruption Time: 2025-02-03 14:30:00 UTC
    Recovery Point: 2025-02-03 14:29:00 UTC (1 minute before corruption)
    
    Restored via: gcloud alloydb clusters restore (Phase 1)
    This manifest imports the already-restored cluster into Infrastream.
    
    Incident Commander: ops-lead@acme-corp.com
    Approved By: cto@acme-corp.com
    
    DO NOT DELETE - Preserve for post-incident review
  
  # Match the original database specifications exactly
  cpuCount: 8
  clusterSize: 3  # High availability configuration
  
  configuration:
    maxConnections: 1000
    googleDbAdvisor: true
  
  # Backup retention for the recovered database
  backupConfig:
    quantityBasedRetention: 30  # Keep 30 days of backups

Step 3: Submit PR to Import Database

# Navigate to project directory
cd organizational-unit/banking/environment/production/project/customer-portal

# Create the recovery database manifest
# Create the manifest for the ALREADY-RESTORED database
cat > database/main-recovered.yaml << 'EOF'
apiVersion: lowops.manifests.v1
kind: Database
metadata:
  name: main-recovered  # Must match restored cluster from Phase 1
  project: customer-portal
  environment: production
  organizational-unit: banking
  organization: acme-corp
spec:
  description: |
    EMERGENCY RECOVERY: Data corruption at 14:30 UTC
    Incident: INC-2025-047
    
    Restored via: gcloud PITR at 2025-02-03T14:29:00Z
    This manifest imports the recovered database.
  
  cpuCount: 8
  clusterSize: 3
  configuration:
    maxConnections: 1000
    googleDbAdvisor: true
  backupConfig:
    quantityBasedRetention: 30
EOF

# Commit with detailed message
git add database/main-recovered.yaml
git commit -m "EMERGENCY: Provision recovered database from 14:29 UTC

Incident: INC-2025-047
Issue: Data corruption from accidental UPDATE without WHERE clause
Corrupted Database: main (DO NOT DELETE - preserved for forensics)
Recovery Database: main-recovered
Recovery Timestamp: 2025-02-03T14:29:00Z (1 min before corruption)

Incident Commander: ops-lead@acme-corp.com
Approved By: cto@acme-corp.com (verbal approval, ticket #INC-2025-047)"

# Push to remote
git push origin emergency/database-recovery-inc-2025-047

# Create PR (use emergency change process for fast-track approval)

Step 4: Infrastream Imports Database

Once PR is merged, Infrastream:

Discovers the existing cluster main-recovered in GCP (restored in Step 1)
Imports it into Infrastream management
Configures networking and IAM permissions
Enables monitoring and logging
Manages backups going forward

Import Time: Typically 2-5 minutes after merge.

Monitor Progress: Infrastream Portal \u2192 Projects \u2192 customer-portal \u2192 Databases \u2192 main-recovered

Step 5: Validate Recovered Data

CRITICAL: Before cutover, validate the recovered database contains correct data and NO corruption.

Connection String

Infrastream provides the connection details via automatic DNS:

main-recovered.customer-portal.production.banking.acme.internal:5432

Validation Queries

-- Connect to recovered database
psql "host=main-recovered.customer-portal.production.banking.acme.internal \
      port=5432 \
      dbname=postgres \
      user=postgres \
      sslmode=require"

-- 1. Verify record count matches expectations
SELECT COUNT(*) FROM customers;
-- Expected: ~1,250,000 (pre-corruption count)

-- 2. Verify no corrupted records exist
SELECT COUNT(*) FROM customers WHERE email IS NULL;
-- Expected: 0 (corruption set emails to NULL)

-- 3. Verify critical customer data integrity
SELECT customer_id, email, created_at 
FROM customers 
WHERE customer_id IN (12345, 67890, 11111)
LIMIT 10;
-- Verify known good customers have correct data

-- 4. Check timestamp of latest record
SELECT MAX(updated_at) FROM customers;
-- Should be <= 2025-02-03 14:29:00 UTC

-- 5. Verify referential integrity
SELECT COUNT(*) 
FROM orders o 
LEFT JOIN customers c ON o.customer_id = c.customer_id 
WHERE c.customer_id IS NULL;
-- Expected: 0 (no orphaned orders)

Validation Checklist

Record counts match expectations
No corrupted/NULL values in critical fields
Known test records have correct data
Latest timestamp is before corruption time
Referential integrity maintained
Sample queries return expected results

Step 5: Application Cutover

Once validation is complete, update the application to use the recovered database.

File: deployment-config/customer-portal-api.yaml

Before (Using Corrupted Database)

apiVersion: lowops.manifests.v1
kind: DeploymentConfig
metadata:
  name: customer-portal-api
  organization: acme-corp
  organizational-unit: banking
  environment: production
  project: customer-portal
spec:
  description: Customer portal API service
  version: "1.0.0"
  
  container:
    env:
      - name: DATABASE_HOST
        value: main.customer-portal.production.banking.acme.internal
      - name: DATABASE_NAME
        value: postgres
      - name: LOG_LEVEL
        value: info

After (Using Recovered Database)

apiVersion: lowops.manifests.v1
kind: DeploymentConfig
metadata:
  name: customer-portal-api
  organization: acme-corp
  organizational-unit: banking
  environment: production
  project: customer-portal
spec:
  description: Customer portal API service
  version: "1.0.0"
  
  container:
    env:
      - name: DATABASE_HOST
        value: main-recovered.customer-portal.production.banking.acme.internal  # ← CHANGED
      - name: DATABASE_NAME
        value: postgres
      - name: LOG_LEVEL
        value: info
      # Track recovery incident
      - name: RECOVERY_INCIDENT
        value: INC-2025-047

Cutover Process

# Update deployment config
vim deployment-config/customer-portal-api.yaml

# Commit the change
git add deployment-config/customer-portal-api.yaml
git commit -m "EMERGENCY: Cut over to recovered database

Incident: INC-2025-047
Validation: PASSED (see incident notes)
Cutover Time: 2025-02-03 15:05 UTC
Approved By: cto@acme-corp.com"

# Push and merge
git push origin emergency/database-recovery-inc-2025-047

Result: Application immediately reconnects to the clean database. Service restored within 2-5 minutes of merge.

Step 6: Preserve Corrupted Database for Forensics

DO NOT DELETE the corrupted database immediately. Preserve it for:

Post-incident analysis
Security investigation
Data recovery of any transactions between 14:30-14:45

Rename for clarity:

# Rename the corrupted database manifest for preservation
apiVersion: lowops.manifests.v1
kind: Database
metadata:
  name: main-corrupted-inc-2025-047  # ← Renamed
  project: customer-portal
  environment: production
  organizational-unit: banking
  organization: acme-corp
spec:
  description: |
    CORRUPTED DATABASE - PRESERVED FOR FORENSIC ANALYSIS
    
    DO NOT DELETE WITHOUT SECURITY TEAM APPROVAL
    
    Incident: INC-2025-047
    Corruption Time: 2025-02-03 14:30:00 UTC
    Preserved for: Forensic analysis, transaction recovery
    
    Contact: security@acme-corp.com
    Ticket: INC-2025-047
  
  cpuCount: 2  # Downsize to save costs while preserved
  clusterSize: 1
  
  configuration:
    maxConnections: 50  # Minimal connections for analysis only
  
  backupConfig:
    quantityBasedRetention: 7  # Minimal backup retention

Step 7: Post-Recovery Cleanup (After 30 Days)

Once incident analysis is complete and any needed data is recovered:

1. Rename Recovered Database to Primary

# Delete old corrupted database
git rm database/main-corrupted-inc-2025-047.yaml

# Rename recovered database manifest
mv database/main-recovered.yaml database/main.yaml

# Update metadata
sed -i 's/name: main-recovered/name: main/g' database/main.yaml

# Update description
vim database/main.yaml
# Remove incident details, restore normal description

# Commit
git commit -m "Post-incident cleanup: Promote recovered  DB to primary

Incident INC-2025-047 resolved.
Corrupted database deleted after 30-day preservation.
Recovered database promoted to production primary."

2. Update Application Config

# Revert to standard database name
vim deployment-config/customer-portal-api.yaml
# Change: main-recovered → main

# Remove incident tracking
# Delete: RECOVERY_INCIDENT env var

Alternative Scenario: Ransomware Attack

Two-Phase Recovery for Ransomware

Phase 1: Restore via Console/CLI

gcloud alloydb clusters restore main-clean \
  --source-cluster=main \
  --point-in-time=2025-02-03T02:44:00Z \
  --region=us-central1 \
  --project=customer-portal-prod

Phase 2: Import via Infrastream Manifest

apiVersion: lowops.manifests.v1
kind: Database
metadata:
  name: main-clean  # Matches cluster name from Phase 1
  project: customer-portal
  environment: production
  organizational-unit: banking
  organization: acme-corp
spec:
  description: |
    SECURITY INCIDENT RESPONSE
    
    Incident: SEC-2025-089
    Attack Type: Ransomware (database encryption detected)
    Attack Start: 2025-02-03 02:45:00 UTC (per security logs)
    Recovery Point: 2025-02-03 02:44:00 UTC (before attack)
    
    Restored via: gcloud alloydb PITR (Phase 1)
    This manifest imports the clean, pre-attack database.
    
    Incident Commander: security-lead@acme-corp.com
    CISO Approval: ciso@acme-corp.com
    
    Compromised database preserved for forensic analysis
  
  cpuCount: 8
  clusterSize: 3
  
  configuration:
    maxConnections: 1000
    googleDbAdvisor: true
  
  backupConfig:
    quantityBasedRetention: 90  # Extended retention for security incident

Recovery Time Objectives (RTO)

Database Size	Provisioning Time	Validation Time	Total RTO
< 10GB	5 minutes	5 minutes	10 min
10-100GB	7 minutes	10 minutes	17 min
100-500GB	15 minutes	15 minutes	30 min
> 500GB	30 minutes	20 minutes	50 min

Real-World Example: A healthcare company recovered from ransomware attack in 18 minutes (10GB database, pre-attack timestamp).

Best Practices

1. Document Recovery Timestamp Selection

Always explain WHY you chose a specific timestamp:

spec:
  description: |
    Recovery Timestamp: 2025-02-03 14:29:00 UTC
    
    Rationale:
    - Corruption detected in audit logs at 14:30:12 UTC
    - Last known-good transaction timestamp: 14:29:47 UTC
    - Recovery set to 14:29:00 UTC for 1-minute safety buffer
    - Estimated data loss: 0-60 seconds of transactions

2. Validate Before Cutover

Never cut over to a recovered database without validation:

## Validation Checklist - INC-2025-047

- [x] Database provisioned successfully
- [x] Connection established
- [x] Record counts verified (1,249,837 customers)
- [x] Sample queries return expected data
- [x] No NULL emails (corruption indicator)
- [x] Referential integrity intact
- [x] Latest timestamp: 2025-02-03 14:29:42 UTC ✓
- [x] QA team approval
- [x] CTO verbal approval

Validated By: ops-lead@acme-corp.com
Validation Time: 2025-02-03 15:02 UTC

3. Communicate During Recovery

Use the Git PR as the incident timeline:

PR #5678: EMERGENCY DATABASE RECOVERY - INC-2025-047

Timeline:
- 14:45: Corruption discovered
- 14:46: Incident declared, response team assembled
- 14:48: Recovery manifest created (14:29 UTC timestamp)
- 14:50: PR submitted and emergency-approved
- 14:52: Database provisioning started
- 15:02: Database ready, validation PASSED
- 15:05: Application cutover complete
- 15:07: Service fully restored

Data Loss: ~30 seconds of transactions (14:29:00 - 14:29:30)
Recovery Time: 22 minutes (discovery to full restoration)

Disaster Recovery Guide

Scenario: Data Corruption Recovery​

Why Not "Restore in Place"?​

Step 1: Restore Database via Cloud Console/CLI​

Phase 1: Perform PITR (Outside Infrastream)​

Step 2: Import Recovered Database to Infrastream​

Phase 2: Create Manifest (After Restoration Completes)​

Step 3: Submit PR to Import Database​

Step 4: Infrastream Imports Database​

Step 5: Validate Recovered Data​

Connection String​

Validation Queries​

Validation Checklist​

Step 5: Application Cutover​

Before (Using Corrupted Database)​

After (Using Recovered Database)​

Cutover Process​

Step 6: Preserve Corrupted Database for Forensics​

Step 7: Post-Recovery Cleanup (After 30 Days)​

1. Rename Recovered Database to Primary​

2. Update Application Config​

Alternative Scenario: Ransomware Attack​

Two-Phase Recovery for Ransomware​

Recovery Time Objectives (RTO)​

Best Practices​

1. Document Recovery Timestamp Selection​

2. Validate Before Cutover​

3. Communicate During Recovery​

Related Examples​