Database Point-in-Time Recovery (PITR) Example
This example demonstrates Infrastream's "panic-proof" database recovery capabilities, implementing the Security Resilience and Recovery use case.
Scenario: Data Corruption Recovery
Timeline:
- 14:25 UTC: Production database running normally
- 14:30 UTC: Developer accidentally runs UPDATE statement without WHERE clause, corrupting customer records
- 14:45 UTC: Corruption discovered by QA team
- 14:46 UTC: Incident response initiated
Objective: Restore database to 14:29 UTC (one minute before corruption) without disrupting production service.
Why Not "Restore in Place"?
Infrastream deliberately prevents restoring over a live database to avoid:
❌ Overwriting good data that arrived between corruption and discovery
❌ No rollback if you restore the wrong timestamp
❌ Panic mistakes during high-stress incidents
❌ Loss of forensic evidence (corrupted database is destroyed)
✅ Instead: Create a NEW database from backup, validate it, then cutover your application
Step 1: Restore Database via Cloud Console/CLI
[!IMPORTANT] Critical: Infrastream does not perform PITR directly. You must restore via Google Cloud Console or
gcloudCLI first, then import the recovered database to Infrastream.
Phase 1: Perform PITR (Outside Infrastream)
Use the gcloud CLI to restore the corrupted database to a specific point in time:
# Navigate to your project
cd organizational-unit/banking/environment/production
# Execute PITR to create new cluster from backup
gcloud alloydb clusters restore main-recovered \
--source-cluster=main \
--source-cluster-region=us-central1 \
--region=us-central1 \
--point-in-time=2025-02-03T14:29:00Z \
--project=customer-portal-prod
# Monitor restoration progress
gcloud alloydb operations list --filter="targetId:main-recovered"
Alternative: Cloud Console
- Navigate to AlloyDB → Clusters →
main - Click "Restore" → "Point-in-time recovery"
- Timestamp:
2025-02-03T14:29:00Z - New cluster name:
main-recovered - Click "Restore"
Wait 5-10 minutes for provisioning to complete.
Step 2: Import Recovered Database to Infrastream
Phase 2: Create Manifest (After Restoration Completes)
Once the GCP restore operation finishes, create an Infrastream manifest to import and manage the recovered database.
File: organizational-unit/banking/environment/production/project/customer-portal/database/main-recovered.yaml
apiVersion: lowops.manifests.v1
kind: Database
metadata:
name: main-recovered # MUST match the cluster name from Phase 1
project: customer-portal
environment: production
organizational-unit: banking
organization: acme-corp
spec:
description: |
EMERGENCY RECOVERY DATABASE
Incident: INC-2025-047
Issue: Accidental data corruption from UPDATE without WHERE clause
Corruption Time: 2025-02-03 14:30:00 UTC
Recovery Point: 2025-02-03 14:29:00 UTC (1 minute before corruption)
Restored via: gcloud alloydb clusters restore (Phase 1)
This manifest imports the already-restored cluster into Infrastream.
Incident Commander: ops-lead@acme-corp.com
Approved By: cto@acme-corp.com
DO NOT DELETE - Preserve for post-incident review
# Match the original database specifications exactly
cpuCount: 8
clusterSize: 3 # High availability configuration
configuration:
maxConnections: 1000
googleDbAdvisor: true
# Backup retention for the recovered database
backupConfig:
quantityBasedRetention: 30 # Keep 30 days of backups
Step 3: Submit PR to Import Database
# Navigate to project directory
cd organizational-unit/banking/environment/production/project/customer-portal
# Create the recovery database manifest
# Create the manifest for the ALREADY-RESTORED database
cat > database/main-recovered.yaml << 'EOF'
apiVersion: lowops.manifests.v1
kind: Database
metadata:
name: main-recovered # Must match restored cluster from Phase 1
project: customer-portal
environment: production
organizational-unit: banking
organization: acme-corp
spec:
description: |
EMERGENCY RECOVERY: Data corruption at 14:30 UTC
Incident: INC-2025-047
Restored via: gcloud PITR at 2025-02-03T14:29:00Z
This manifest imports the recovered database.
cpuCount: 8
clusterSize: 3
configuration:
maxConnections: 1000
googleDbAdvisor: true
backupConfig:
quantityBasedRetention: 30
EOF
# Commit with detailed message
git add database/main-recovered.yaml
git commit -m "EMERGENCY: Provision recovered database from 14:29 UTC
Incident: INC-2025-047
Issue: Data corruption from accidental UPDATE without WHERE clause
Corrupted Database: main (DO NOT DELETE - preserved for forensics)
Recovery Database: main-recovered
Recovery Timestamp: 2025-02-03T14:29:00Z (1 min before corruption)
Incident Commander: ops-lead@acme-corp.com
Approved By: cto@acme-corp.com (verbal approval, ticket #INC-2025-047)"
# Push to remote
git push origin emergency/database-recovery-inc-2025-047
# Create PR (use emergency change process for fast-track approval)
Step 4: Infrastream Imports Database
Once PR is merged, Infrastream:
- Discovers the existing cluster
main-recoveredin GCP (restored in Step 1) - Imports it into Infrastream management
- Configures networking and IAM permissions
- Enables monitoring and logging
- Manages backups going forward
Import Time: Typically 2-5 minutes after merge.
Monitor Progress: Infrastream Portal \u2192 Projects \u2192 customer-portal \u2192 Databases \u2192 main-recovered
Step 5: Validate Recovered Data
CRITICAL: Before cutover, validate the recovered database contains correct data and NO corruption.
Connection String
Infrastream provides the connection details via automatic DNS:
main-recovered.customer-portal.production.banking.acme.internal:5432
Validation Queries
-- Connect to recovered database
psql "host=main-recovered.customer-portal.production.banking.acme.internal \
port=5432 \
dbname=postgres \
user=postgres \
sslmode=require"
-- 1. Verify record count matches expectations
SELECT COUNT(*) FROM customers;
-- Expected: ~1,250,000 (pre-corruption count)
-- 2. Verify no corrupted records exist
SELECT COUNT(*) FROM customers WHERE email IS NULL;
-- Expected: 0 (corruption set emails to NULL)
-- 3. Verify critical customer data integrity
SELECT customer_id, email, created_at
FROM customers
WHERE customer_id IN (12345, 67890, 11111)
LIMIT 10;
-- Verify known good customers have correct data
-- 4. Check timestamp of latest record
SELECT MAX(updated_at) FROM customers;
-- Should be <= 2025-02-03 14:29:00 UTC
-- 5. Verify referential integrity
SELECT COUNT(*)
FROM orders o
LEFT JOIN customers c ON o.customer_id = c.customer_id
WHERE c.customer_id IS NULL;
-- Expected: 0 (no orphaned orders)
Validation Checklist
- Record counts match expectations
- No corrupted/NULL values in critical fields
- Known test records have correct data
- Latest timestamp is before corruption time
- Referential integrity maintained
- Sample queries return expected results
Step 5: Application Cutover
Once validation is complete, update the application to use the recovered database.
File: deployment-config/customer-portal-api.yaml
Before (Using Corrupted Database)
apiVersion: lowops.manifests.v1
kind: DeploymentConfig
metadata:
name: customer-portal-api
organization: acme-corp
organizational-unit: banking
environment: production
project: customer-portal
spec:
description: Customer portal API service
version: "1.0.0"
container:
env:
- name: DATABASE_HOST
value: main.customer-portal.production.banking.acme.internal
- name: DATABASE_NAME
value: postgres
- name: LOG_LEVEL
value: info
After (Using Recovered Database)
apiVersion: lowops.manifests.v1
kind: DeploymentConfig
metadata:
name: customer-portal-api
organization: acme-corp
organizational-unit: banking
environment: production
project: customer-portal
spec:
description: Customer portal API service
version: "1.0.0"
container:
env:
- name: DATABASE_HOST
value: main-recovered.customer-portal.production.banking.acme.internal # ← CHANGED
- name: DATABASE_NAME
value: postgres
- name: LOG_LEVEL
value: info
# Track recovery incident
- name: RECOVERY_INCIDENT
value: INC-2025-047
Cutover Process
# Update deployment config
vim deployment-config/customer-portal-api.yaml
# Commit the change
git add deployment-config/customer-portal-api.yaml
git commit -m "EMERGENCY: Cut over to recovered database
Incident: INC-2025-047
Validation: PASSED (see incident notes)
Cutover Time: 2025-02-03 15:05 UTC
Approved By: cto@acme-corp.com"
# Push and merge
git push origin emergency/database-recovery-inc-2025-047
Result: Application immediately reconnects to the clean database. Service restored within 2-5 minutes of merge.
Step 6: Preserve Corrupted Database for Forensics
DO NOT DELETE the corrupted database immediately. Preserve it for:
- Post-incident analysis
- Security investigation
- Data recovery of any transactions between 14:30-14:45
Rename for clarity:
# Rename the corrupted database manifest for preservation
apiVersion: lowops.manifests.v1
kind: Database
metadata:
name: main-corrupted-inc-2025-047 # ← Renamed
project: customer-portal
environment: production
organizational-unit: banking
organization: acme-corp
spec:
description: |
CORRUPTED DATABASE - PRESERVED FOR FORENSIC ANALYSIS
DO NOT DELETE WITHOUT SECURITY TEAM APPROVAL
Incident: INC-2025-047
Corruption Time: 2025-02-03 14:30:00 UTC
Preserved for: Forensic analysis, transaction recovery
Contact: security@acme-corp.com
Ticket: INC-2025-047
cpuCount: 2 # Downsize to save costs while preserved
clusterSize: 1
configuration:
maxConnections: 50 # Minimal connections for analysis only
backupConfig:
quantityBasedRetention: 7 # Minimal backup retention
Step 7: Post-Recovery Cleanup (After 30 Days)
Once incident analysis is complete and any needed data is recovered:
1. Rename Recovered Database to Primary
# Delete old corrupted database
git rm database/main-corrupted-inc-2025-047.yaml
# Rename recovered database manifest
mv database/main-recovered.yaml database/main.yaml
# Update metadata
sed -i 's/name: main-recovered/name: main/g' database/main.yaml
# Update description
vim database/main.yaml
# Remove incident details, restore normal description
# Commit
git commit -m "Post-incident cleanup: Promote recovered DB to primary
Incident INC-2025-047 resolved.
Corrupted database deleted after 30-day preservation.
Recovered database promoted to production primary."
2. Update Application Config
# Revert to standard database name
vim deployment-config/customer-portal-api.yaml
# Change: main-recovered → main
# Remove incident tracking
# Delete: RECOVERY_INCIDENT env var
Alternative Scenario: Ransomware Attack
Two-Phase Recovery for Ransomware
Phase 1: Restore via Console/CLI
gcloud alloydb clusters restore main-clean \
--source-cluster=main \
--point-in-time=2025-02-03T02:44:00Z \
--region=us-central1 \
--project=customer-portal-prod
Phase 2: Import via Infrastream Manifest
apiVersion: lowops.manifests.v1
kind: Database
metadata:
name: main-clean # Matches cluster name from Phase 1
project: customer-portal
environment: production
organizational-unit: banking
organization: acme-corp
spec:
description: |
SECURITY INCIDENT RESPONSE
Incident: SEC-2025-089
Attack Type: Ransomware (database encryption detected)
Attack Start: 2025-02-03 02:45:00 UTC (per security logs)
Recovery Point: 2025-02-03 02:44:00 UTC (before attack)
Restored via: gcloud alloydb PITR (Phase 1)
This manifest imports the clean, pre-attack database.
Incident Commander: security-lead@acme-corp.com
CISO Approval: ciso@acme-corp.com
Compromised database preserved for forensic analysis
cpuCount: 8
clusterSize: 3
configuration:
maxConnections: 1000
googleDbAdvisor: true
backupConfig:
quantityBasedRetention: 90 # Extended retention for security incident
Recovery Time Objectives (RTO)
| Database Size | Provisioning Time | Validation Time | Total RTO |
|---|---|---|---|
| < 10GB | 5 minutes | 5 minutes | 10 min |
| 10-100GB | 7 minutes | 10 minutes | 17 min |
| 100-500GB | 15 minutes | 15 minutes | 30 min |
| > 500GB | 30 minutes | 20 minutes | 50 min |
Real-World Example: A healthcare company recovered from ransomware attack in 18 minutes (10GB database, pre-attack timestamp).
Best Practices
1. Document Recovery Timestamp Selection
Always explain WHY you chose a specific timestamp:
spec:
description: |
Recovery Timestamp: 2025-02-03 14:29:00 UTC
Rationale:
- Corruption detected in audit logs at 14:30:12 UTC
- Last known-good transaction timestamp: 14:29:47 UTC
- Recovery set to 14:29:00 UTC for 1-minute safety buffer
- Estimated data loss: 0-60 seconds of transactions
2. Validate Before Cutover
Never cut over to a recovered database without validation:
## Validation Checklist - INC-2025-047
- [x] Database provisioned successfully
- [x] Connection established
- [x] Record counts verified (1,249,837 customers)
- [x] Sample queries return expected data
- [x] No NULL emails (corruption indicator)
- [x] Referential integrity intact
- [x] Latest timestamp: 2025-02-03 14:29:42 UTC ✓
- [x] QA team approval
- [x] CTO verbal approval
Validated By: ops-lead@acme-corp.com
Validation Time: 2025-02-03 15:02 UTC
3. Communicate During Recovery
Use the Git PR as the incident timeline:
PR #5678: EMERGENCY DATABASE RECOVERY - INC-2025-047
Timeline:
- 14:45: Corruption discovered
- 14:46: Incident declared, response team assembled
- 14:48: Recovery manifest created (14:29 UTC timestamp)
- 14:50: PR submitted and emergency-approved
- 14:52: Database provisioning started
- 15:02: Database ready, validation PASSED
- 15:05: Application cutover complete
- 15:07: Service fully restored
Data Loss: ~30 seconds of transactions (14:29:00 - 14:29:30)
Recovery Time: 22 minutes (discovery to full restoration)