Skip to main content

Database Point-in-Time Recovery (PITR) Example

This example demonstrates Infrastream's "panic-proof" database recovery capabilities, implementing the Security Resilience and Recovery use case.

Scenario: Data Corruption Recovery

Timeline:

  • 14:25 UTC: Production database running normally
  • 14:30 UTC: Developer accidentally runs UPDATE statement without WHERE clause, corrupting customer records
  • 14:45 UTC: Corruption discovered by QA team
  • 14:46 UTC: Incident response initiated

Objective: Restore database to 14:29 UTC (one minute before corruption) without disrupting production service.


Why Not "Restore in Place"?

Infrastream deliberately prevents restoring over a live database to avoid:

Overwriting good data that arrived between corruption and discovery
No rollback if you restore the wrong timestamp
Panic mistakes during high-stress incidents
Loss of forensic evidence (corrupted database is destroyed)

Instead: Create a NEW database from backup, validate it, then cutover your application


Step 1: Restore Database via Cloud Console/CLI

[!IMPORTANT] Critical: Infrastream does not perform PITR directly. You must restore via Google Cloud Console or gcloud CLI first, then import the recovered database to Infrastream.

Phase 1: Perform PITR (Outside Infrastream)

Use the gcloud CLI to restore the corrupted database to a specific point in time:

# Navigate to your project
cd organizational-unit/banking/environment/production

# Execute PITR to create new cluster from backup
gcloud alloydb clusters restore main-recovered \
--source-cluster=main \
--source-cluster-region=us-central1 \
--region=us-central1 \
--point-in-time=2025-02-03T14:29:00Z \
--project=customer-portal-prod

# Monitor restoration progress
gcloud alloydb operations list --filter="targetId:main-recovered"

Alternative: Cloud Console

  1. Navigate to AlloyDB → Clusters → main
  2. Click "Restore""Point-in-time recovery"
  3. Timestamp: 2025-02-03T14:29:00Z
  4. New cluster name: main-recovered
  5. Click "Restore"

Wait 5-10 minutes for provisioning to complete.


Step 2: Import Recovered Database to Infrastream

Phase 2: Create Manifest (After Restoration Completes)

Once the GCP restore operation finishes, create an Infrastream manifest to import and manage the recovered database.

File: organizational-unit/banking/environment/production/project/customer-portal/database/main-recovered.yaml

apiVersion: lowops.manifests.v1
kind: Database
metadata:
name: main-recovered # MUST match the cluster name from Phase 1
project: customer-portal
environment: production
organizational-unit: banking
organization: acme-corp
spec:
description: |
EMERGENCY RECOVERY DATABASE

Incident: INC-2025-047
Issue: Accidental data corruption from UPDATE without WHERE clause
Corruption Time: 2025-02-03 14:30:00 UTC
Recovery Point: 2025-02-03 14:29:00 UTC (1 minute before corruption)

Restored via: gcloud alloydb clusters restore (Phase 1)
This manifest imports the already-restored cluster into Infrastream.

Incident Commander: ops-lead@acme-corp.com
Approved By: cto@acme-corp.com

DO NOT DELETE - Preserve for post-incident review

# Match the original database specifications exactly
cpuCount: 8
clusterSize: 3 # High availability configuration

configuration:
maxConnections: 1000
googleDbAdvisor: true

# Backup retention for the recovered database
backupConfig:
quantityBasedRetention: 30 # Keep 30 days of backups

Step 3: Submit PR to Import Database

# Navigate to project directory
cd organizational-unit/banking/environment/production/project/customer-portal

# Create the recovery database manifest
# Create the manifest for the ALREADY-RESTORED database
cat > database/main-recovered.yaml << 'EOF'
apiVersion: lowops.manifests.v1
kind: Database
metadata:
name: main-recovered # Must match restored cluster from Phase 1
project: customer-portal
environment: production
organizational-unit: banking
organization: acme-corp
spec:
description: |
EMERGENCY RECOVERY: Data corruption at 14:30 UTC
Incident: INC-2025-047

Restored via: gcloud PITR at 2025-02-03T14:29:00Z
This manifest imports the recovered database.

cpuCount: 8
clusterSize: 3
configuration:
maxConnections: 1000
googleDbAdvisor: true
backupConfig:
quantityBasedRetention: 30
EOF

# Commit with detailed message
git add database/main-recovered.yaml
git commit -m "EMERGENCY: Provision recovered database from 14:29 UTC

Incident: INC-2025-047
Issue: Data corruption from accidental UPDATE without WHERE clause
Corrupted Database: main (DO NOT DELETE - preserved for forensics)
Recovery Database: main-recovered
Recovery Timestamp: 2025-02-03T14:29:00Z (1 min before corruption)

Incident Commander: ops-lead@acme-corp.com
Approved By: cto@acme-corp.com (verbal approval, ticket #INC-2025-047)"

# Push to remote
git push origin emergency/database-recovery-inc-2025-047

# Create PR (use emergency change process for fast-track approval)

Step 4: Infrastream Imports Database

Once PR is merged, Infrastream:

  1. Discovers the existing cluster main-recovered in GCP (restored in Step 1)
  2. Imports it into Infrastream management
  3. Configures networking and IAM permissions
  4. Enables monitoring and logging
  5. Manages backups going forward

Import Time: Typically 2-5 minutes after merge.

Monitor Progress: Infrastream Portal \u2192 Projects \u2192 customer-portal \u2192 Databases \u2192 main-recovered

Step 5: Validate Recovered Data

CRITICAL: Before cutover, validate the recovered database contains correct data and NO corruption.

Connection String

Infrastream provides the connection details via automatic DNS:

main-recovered.customer-portal.production.banking.acme.internal:5432

Validation Queries

-- Connect to recovered database
psql "host=main-recovered.customer-portal.production.banking.acme.internal \
port=5432 \
dbname=postgres \
user=postgres \
sslmode=require"

-- 1. Verify record count matches expectations
SELECT COUNT(*) FROM customers;
-- Expected: ~1,250,000 (pre-corruption count)

-- 2. Verify no corrupted records exist
SELECT COUNT(*) FROM customers WHERE email IS NULL;
-- Expected: 0 (corruption set emails to NULL)

-- 3. Verify critical customer data integrity
SELECT customer_id, email, created_at
FROM customers
WHERE customer_id IN (12345, 67890, 11111)
LIMIT 10;
-- Verify known good customers have correct data

-- 4. Check timestamp of latest record
SELECT MAX(updated_at) FROM customers;
-- Should be <= 2025-02-03 14:29:00 UTC

-- 5. Verify referential integrity
SELECT COUNT(*)
FROM orders o
LEFT JOIN customers c ON o.customer_id = c.customer_id
WHERE c.customer_id IS NULL;
-- Expected: 0 (no orphaned orders)

Validation Checklist

  • Record counts match expectations
  • No corrupted/NULL values in critical fields
  • Known test records have correct data
  • Latest timestamp is before corruption time
  • Referential integrity maintained
  • Sample queries return expected results

Step 5: Application Cutover

Once validation is complete, update the application to use the recovered database.

File: deployment-config/customer-portal-api.yaml

Before (Using Corrupted Database)

apiVersion: lowops.manifests.v1
kind: DeploymentConfig
metadata:
name: customer-portal-api
organization: acme-corp
organizational-unit: banking
environment: production
project: customer-portal
spec:
description: Customer portal API service
version: "1.0.0"

container:
env:
- name: DATABASE_HOST
value: main.customer-portal.production.banking.acme.internal
- name: DATABASE_NAME
value: postgres
- name: LOG_LEVEL
value: info

After (Using Recovered Database)

apiVersion: lowops.manifests.v1
kind: DeploymentConfig
metadata:
name: customer-portal-api
organization: acme-corp
organizational-unit: banking
environment: production
project: customer-portal
spec:
description: Customer portal API service
version: "1.0.0"

container:
env:
- name: DATABASE_HOST
value: main-recovered.customer-portal.production.banking.acme.internal # ← CHANGED
- name: DATABASE_NAME
value: postgres
- name: LOG_LEVEL
value: info
# Track recovery incident
- name: RECOVERY_INCIDENT
value: INC-2025-047

Cutover Process

# Update deployment config
vim deployment-config/customer-portal-api.yaml

# Commit the change
git add deployment-config/customer-portal-api.yaml
git commit -m "EMERGENCY: Cut over to recovered database

Incident: INC-2025-047
Validation: PASSED (see incident notes)
Cutover Time: 2025-02-03 15:05 UTC
Approved By: cto@acme-corp.com"

# Push and merge
git push origin emergency/database-recovery-inc-2025-047

Result: Application immediately reconnects to the clean database. Service restored within 2-5 minutes of merge.


Step 6: Preserve Corrupted Database for Forensics

DO NOT DELETE the corrupted database immediately. Preserve it for:

  • Post-incident analysis
  • Security investigation
  • Data recovery of any transactions between 14:30-14:45

Rename for clarity:

# Rename the corrupted database manifest for preservation
apiVersion: lowops.manifests.v1
kind: Database
metadata:
name: main-corrupted-inc-2025-047 # ← Renamed
project: customer-portal
environment: production
organizational-unit: banking
organization: acme-corp
spec:
description: |
CORRUPTED DATABASE - PRESERVED FOR FORENSIC ANALYSIS

DO NOT DELETE WITHOUT SECURITY TEAM APPROVAL

Incident: INC-2025-047
Corruption Time: 2025-02-03 14:30:00 UTC
Preserved for: Forensic analysis, transaction recovery

Contact: security@acme-corp.com
Ticket: INC-2025-047

cpuCount: 2 # Downsize to save costs while preserved
clusterSize: 1

configuration:
maxConnections: 50 # Minimal connections for analysis only

backupConfig:
quantityBasedRetention: 7 # Minimal backup retention

Step 7: Post-Recovery Cleanup (After 30 Days)

Once incident analysis is complete and any needed data is recovered:

1. Rename Recovered Database to Primary

# Delete old corrupted database
git rm database/main-corrupted-inc-2025-047.yaml

# Rename recovered database manifest
mv database/main-recovered.yaml database/main.yaml

# Update metadata
sed -i 's/name: main-recovered/name: main/g' database/main.yaml

# Update description
vim database/main.yaml
# Remove incident details, restore normal description

# Commit
git commit -m "Post-incident cleanup: Promote recovered DB to primary

Incident INC-2025-047 resolved.
Corrupted database deleted after 30-day preservation.
Recovered database promoted to production primary."

2. Update Application Config

# Revert to standard database name
vim deployment-config/customer-portal-api.yaml
# Change: main-recovered → main

# Remove incident tracking
# Delete: RECOVERY_INCIDENT env var

Alternative Scenario: Ransomware Attack

Two-Phase Recovery for Ransomware

Phase 1: Restore via Console/CLI

gcloud alloydb clusters restore main-clean \
--source-cluster=main \
--point-in-time=2025-02-03T02:44:00Z \
--region=us-central1 \
--project=customer-portal-prod

Phase 2: Import via Infrastream Manifest

apiVersion: lowops.manifests.v1
kind: Database
metadata:
name: main-clean # Matches cluster name from Phase 1
project: customer-portal
environment: production
organizational-unit: banking
organization: acme-corp
spec:
description: |
SECURITY INCIDENT RESPONSE

Incident: SEC-2025-089
Attack Type: Ransomware (database encryption detected)
Attack Start: 2025-02-03 02:45:00 UTC (per security logs)
Recovery Point: 2025-02-03 02:44:00 UTC (before attack)

Restored via: gcloud alloydb PITR (Phase 1)
This manifest imports the clean, pre-attack database.

Incident Commander: security-lead@acme-corp.com
CISO Approval: ciso@acme-corp.com

Compromised database preserved for forensic analysis

cpuCount: 8
clusterSize: 3

configuration:
maxConnections: 1000
googleDbAdvisor: true

backupConfig:
quantityBasedRetention: 90 # Extended retention for security incident

Recovery Time Objectives (RTO)

Database SizeProvisioning TimeValidation TimeTotal RTO
< 10GB5 minutes5 minutes10 min
10-100GB7 minutes10 minutes17 min
100-500GB15 minutes15 minutes30 min
> 500GB30 minutes20 minutes50 min

Real-World Example: A healthcare company recovered from ransomware attack in 18 minutes (10GB database, pre-attack timestamp).


Best Practices

1. Document Recovery Timestamp Selection

Always explain WHY you chose a specific timestamp:

spec:
description: |
Recovery Timestamp: 2025-02-03 14:29:00 UTC

Rationale:
- Corruption detected in audit logs at 14:30:12 UTC
- Last known-good transaction timestamp: 14:29:47 UTC
- Recovery set to 14:29:00 UTC for 1-minute safety buffer
- Estimated data loss: 0-60 seconds of transactions

2. Validate Before Cutover

Never cut over to a recovered database without validation:

## Validation Checklist - INC-2025-047

- [x] Database provisioned successfully
- [x] Connection established
- [x] Record counts verified (1,249,837 customers)
- [x] Sample queries return expected data
- [x] No NULL emails (corruption indicator)
- [x] Referential integrity intact
- [x] Latest timestamp: 2025-02-03 14:29:42 UTC ✓
- [x] QA team approval
- [x] CTO verbal approval

Validated By: ops-lead@acme-corp.com
Validation Time: 2025-02-03 15:02 UTC

3. Communicate During Recovery

Use the Git PR as the incident timeline:

PR #5678: EMERGENCY DATABASE RECOVERY - INC-2025-047

Timeline:
- 14:45: Corruption discovered
- 14:46: Incident declared, response team assembled
- 14:48: Recovery manifest created (14:29 UTC timestamp)
- 14:50: PR submitted and emergency-approved
- 14:52: Database provisioning started
- 15:02: Database ready, validation PASSED
- 15:05: Application cutover complete
- 15:07: Service fully restored

Data Loss: ~30 seconds of transactions (14:29:00 - 14:29:30)
Recovery Time: 22 minutes (discovery to full restoration)