CloudTadaInsights

Lesson 21: Multi-datacenter Setup

Multi-datacenter Setup

After this lesson, you will be able to:

  • Design cross-datacenter replication architecture.
  • Implement cascading replication topology.
  • Handle network latency and failures.
  • Configure disaster recovery for multiple sites.
  • Load balance across geographic locations.

1. Multi-DC Architecture Patterns

1.1. Active-Passive (DR standby)

TEXT
Primary DC (Active):
  ├─ node1 (Leader)
  ├─ node2 (Replica)
  └─ node3 (Replica)
    ↓ Async replication
DR DC (Passive):
  ├─ node4 (Standby)
  └─ node5 (Standby)

Use case: Disaster recovery
RPO: Minutes to hours
RTO: Minutes to hours
Cost: Lower (minimal resources in DR)

1.2. Active-Active (Multi-master)

TEXT
DC1 (Active):
  ├─ node1 (Leader)
  └─ node2 (Replica)
    ↕ Bi-directional logical replication
DC2 (Active):
  ├─ node3 (Leader)
  └─ node4 (Replica)

Use case: Global applications with regional traffic
RPO: Near-zero
RTO: Near-zero
Cost: Higher (full resources in both DCs)
Note: Requires conflict resolution

1.3. Hub-and-Spoke (Cascading)

TEXT
Primary DC (Hub):
  └─ node1 (Leader)
      ├─ node2 (Replica) ← DC1 local
      ├─ node3 (Cascade) → DC2
      └─ node4 (Cascade) → DC3

DC2 (Spoke):
  └─ node3 (receives from node3-cascade)
      └─ node5 (Replica) ← DC2 local

DC3 (Spoke):
  └─ node4 (receives from node4-cascade)
      └─ node6 (Replica) ← DC3 local

Use case: Multiple regional read replicas
RPO: Seconds to minutes
Cost: Medium

2. Cascading Replication Setup

2.1. Architecture

TEXT
DC1 (us-east):
  ├─ pg-us-east-1 (Leader) - 10.1.1.11
  ├─ pg-us-east-2 (Replica) - 10.1.1.12
  └─ pg-us-east-3 (Cascade) - 10.1.1.13
        ↓ WAN replication
DC2 (us-west):
  └─ pg-us-west-1 (Replica) - 10.2.1.11
      ├─ Receives from pg-us-east-3
      └─ pg-us-west-2 (Replica) - 10.2.1.12

2.2. Configure cascading node (DC1)

YAML
# /etc/patroni/patroni.yml on pg-us-east-3 (cascade node)
scope: postgres-cluster
name: pg-us-east-3

restapi:
  listen: 10.1.1.13:8008
  connect_address: 10.1.1.13:8008

etcd:
  hosts: 10.1.1.11:2379,10.1.1.12:2379,10.1.1.13:2379

bootstrap:
  dcs:
    ttl: 30
    loop_wait: 10
    retry_timeout: 10
    maximum_lag_on_failover: 1048576

postgresql:
  listen: 0.0.0.0:5432
  connect_address: 10.1.1.13:5432
  data_dir: /var/lib/postgresql/18/data
  bin_dir: /usr/lib/postgresql/18/bin

  authentication:
    replication:
      username: replicator
      password: rep_password
    superuser:
      username: postgres
      password: postgres_password

  parameters:
    # Enable cascading replication
    hot_standby: on
    wal_level: replica
    max_wal_senders: 10  # Extra slots for downstream replicas
    max_replication_slots: 10
    hot_standby_feedback: on
    
    # Performance tuning for WAN
    wal_sender_timeout: 60s
    wal_receiver_timeout: 60s

  # Allow downstream replicas to connect
  pg_hba:
    - host replication replicator 10.2.1.0/24 scram-sha-256  # DC2 subnet

tags:
  nofailover: false
  noloadbalance: false
  clonefrom: true  # Can be used as clone source
  nosync: false

2.3. Configure downstream replica (DC2)

YAML
# /etc/patroni/patroni.yml on pg-us-west-1
scope: postgres-cluster-dc2  # Different scope!
name: pg-us-west-1

restapi:
  listen: 10.2.1.11:8008
  connect_address: 10.2.1.11:8008

etcd:
  # Separate etcd cluster for DC2
  hosts: 10.2.1.11:2379,10.2.1.12:2379,10.2.1.13:2379

bootstrap:
  dcs:
    ttl: 30
    loop_wait: 10
    retry_timeout: 10
    maximum_lag_on_failover: 1048576
    standby_cluster:
      # Point to DC1 cascade node
      host: 10.1.1.13  # pg-us-east-3
      port: 5432
      primary_slot_name: pg_us_west_1_slot
      create_replica_methods:
        - basebackup

  method: basebackup
  basebackup:
    max-rate: '100M'
    checkpoint: 'fast'
    waldir: /var/lib/postgresql/18/data/pg_wal

postgresql:
  listen: 0.0.0.0:5432
  connect_address: 10.2.1.11:5432
  data_dir: /var/lib/postgresql/18/data
  bin_dir: /usr/lib/postgresql/18/bin

  authentication:
    replication:
      username: replicator
      password: rep_password
    superuser:
      username: postgres
      password: postgres_password

  parameters:
    hot_standby: on
    wal_level: replica
    max_wal_senders: 5
    max_replication_slots: 5
    hot_standby_feedback: on
    
    # WAN-optimized settings
    wal_sender_timeout: 120s  # Higher for WAN
    wal_receiver_timeout: 120s
    wal_retrieve_retry_interval: 10s

  pg_hba:
    - host replication replicator 10.2.1.0/24 scram-sha-256
    - host all all 10.2.1.0/24 scram-sha-256

tags:
  nofailover: false  # Can become leader in DC2
  noloadbalance: false
  clonefrom: false

2.4. Create replication slot on cascade node

BASH
# On pg-us-east-3 (cascade node)
sudo -u postgres psql -c "
  SELECT pg_create_physical_replication_slot('pg_us_west_1_slot');
"

2.5. Start DC2 replica

BASH
# On pg-us-west-1
sudo systemctl start patroni
sudo systemctl status patroni

# Check replication status
patronictl -c /etc/patroni/patroni.yml list

# Verify it's receiving from cascade node
sudo -u postgres psql -c "
  SELECT client_addr, state, sync_state, 
         pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS lag_bytes
  FROM pg_stat_replication;
"

3. Network Latency Handling

3.1. Measure inter-DC latency

BASH
# Ping test
ping -c 10 10.2.1.11

# TCP latency test
nc -vz 10.2.1.11 5432

# PostgreSQL connection latency
psql "host=10.2.1.11 user=postgres" -c "SELECT now();"

# iPerf bandwidth test
# On DC2:
iperf3 -s
# On DC1:
iperf3 -c 10.2.1.11 -t 30

3.2. Optimize for high latency

SQL
-- Increase timeouts for WAN
ALTER SYSTEM SET wal_sender_timeout = '120s';  -- Default 60s
ALTER SYSTEM SET wal_receiver_timeout = '120s';
ALTER SYSTEM SET wal_retrieve_retry_interval = '10s';

-- TCP keepalive settings
ALTER SYSTEM SET tcp_keepalives_idle = 60;
ALTER SYSTEM SET tcp_keepalives_interval = 10;
ALTER SYSTEM SET tcp_keepalives_count = 6;

-- Reload
SELECT pg_reload_conf();

3.3. Use WAL compression

SQL
-- Enable WAL compression (PostgreSQL 14+)
ALTER SYSTEM SET wal_compression = on;

-- Can reduce WAN traffic by 50-70%
SELECT pg_reload_conf();

3.4. Limit replication bandwidth

YAML
# In patroni.yml
bootstrap:
  method: basebackup
  basebackup:
    max-rate: '50M'  # Limit to 50 MB/s to avoid saturating WAN
    checkpoint: 'fast'

4. Disaster Recovery Scenarios

4.1. DC1 total failure

BASH
# Promote DC2 to primary
# On pg-us-west-1

# Remove standby_cluster config
patronictl edit-config postgres-cluster-dc2 -s scope -p standby_cluster --force

# Promote to leader
patronictl failover postgres-cluster-dc2 --leader pg-us-west-1 --force

# Verify
patronictl -c /etc/patroni/patroni.yml list
# + Cluster: postgres-cluster-dc2 ----+------------+----+-----------+
# | Member         | Host       | Role   | State     | Lag in MB |
# +----------------+------------+--------+-----------+-----------+
# | pg-us-west-1   | 10.2.1.11  | Leader | running   |         0 |
# | pg-us-west-2   | 10.2.1.12  | Replica| streaming |         0 |
# +----------------+------------+--------+-----------+-----------+

4.2. DC1 recovery after failure

BASH
# When DC1 comes back online, reintegrate it

# Option 1: Make DC1 follow DC2 (temporary)
# Edit patroni.yml on DC1 nodes to add standby_cluster pointing to DC2

# Option 2: Failback to DC1
# Wait for DC2 to be fully synchronized
# Perform planned switchover back to DC1

patronictl switchover postgres-cluster-dc2 \
  --leader pg-us-west-1 \
  --candidate pg-us-east-1 \
  --scheduled 'now'

4.3. Split-brain prevention

YAML
# Use etcd/consul in both DCs
# Each DC has its own etcd cluster
# Use etcd discovery URL for cross-DC awareness

etcd:
  hosts:
    - 10.1.1.11:2379  # DC1
    - 10.1.1.12:2379
    - 10.2.1.11:2379  # DC2
    - 10.2.1.12:2379

Note: For true split-brain prevention, consider:

  1. Odd number of sites (3+ DCs) with witness node.
  2. Fencing mechanisms (STONITH).
  3. Quorum-based decisions.

5. Geographic Load Balancing

5.1. HAProxy with geo-awareness

TEXT
Architecture:

Users (us-east) → HAProxy-DC1 → PG-DC1 (primary)
Users (us-west) → HAProxy-DC2 → PG-DC2 (replica) - read-only
                             ↳ PG-DC1 (primary) - writes
BASH
# /etc/haproxy/haproxy.cfg on HAProxy-DC1 (us-east)
frontend postgres_front
  bind *:5432
  mode tcp
  default_backend postgres_master

backend postgres_master
  mode tcp
  option tcp-check
  tcp-check connect
  tcp-check send-binary 00000008  # SSLRequest
  tcp-check expect binary 4e       # 'N' (no SSL)
  server pg-us-east-1 10.1.1.11:5432 check inter 3000
  server pg-us-east-2 10.1.1.12:5432 check inter 3000 backup
BASH
# /etc/haproxy/haproxy.cfg on HAProxy-DC2 (us-west)
frontend postgres_front_read
  bind *:5432
  mode tcp
  default_backend postgres_replicas

frontend postgres_front_write
  bind *:5433
  mode tcp
  default_backend postgres_master_remote

backend postgres_replicas
  # Local read replicas
  mode tcp
  balance roundrobin
  option tcp-check
  server pg-us-west-1 10.2.1.11:5432 check inter 3000
  server pg-us-west-2 10.2.1.12:5432 check inter 3000

backend postgres_master_remote
  # Write to primary in DC1
  mode tcp
  option tcp-check
  server pg-us-east-1 10.1.1.11:5432 check inter 3000
  server pg-us-east-2 10.1.1.12:5432 check inter 3000 backup

5.2. DNS-based routing

TEXT
# Use DNS with geo-location
# GeoDNS service (Route53, Cloudflare, etc.)

# US-East users resolve to:
postgres.example.com → 10.1.1.100 (HAProxy-DC1)

# US-West users resolve to:
postgres.example.com → 10.2.1.100 (HAProxy-DC2)

# Configure health checks to failover on DC failure

5.3. Application-level routing

PYTHON
# Python example with psycopg2
import psycopg2
import requests

def get_postgres_endpoint():
    """Get optimal PostgreSQL endpoint based on geo-location"""
    # Check latency to each DC
    latencies = {}
    for dc, host in [('dc1', '10.1.1.11'), ('dc2', '10.2.1.11')]:
        try:
            start = time.time()
            conn = psycopg2.connect(
                host=host, user='app', password='pass',
                dbname='mydb', connect_timeout=3
            )
            conn.close()
            latencies[dc] = time.time() - start
        except:
            latencies[dc] = float('inf')
    
    # Return DC with lowest latency
    best_dc = min(latencies, key=latencies.get)
    return '10.1.1.11' if best_dc == 'dc1' else '10.2.1.11'

# Use it
conn = psycopg2.connect(
    host=get_postgres_endpoint(),
    user='app', password='pass', dbname='mydb'
)

6. Cross-DC Monitoring

6.1. Monitor replication lag

SQL
-- On cascade node (DC1)
SELECT client_addr, application_name,
       state, sync_state,
       pg_wal_lsn_diff(pg_current_wal_lsn(), sent_lsn) AS sending_lag,
       pg_wal_lsn_diff(sent_lsn, write_lsn) AS write_lag,
       pg_wal_lsn_diff(write_lsn, flush_lsn) AS flush_lag,
       pg_wal_lsn_diff(flush_lsn, replay_lsn) AS replay_lag
FROM pg_stat_replication
WHERE application_name LIKE '%west%';
TEXT
# Expected replication lag for cross-DC:
# - Low latency WAN (< 10ms): 0-10 MB lag
# - Medium latency WAN (10-50ms): 10-50 MB lag
# - High latency WAN (> 50ms): 50-200 MB lag

6.2. Prometheus exporters

YAML
# prometheus.yml
scrape_configs:
  - job_name: 'postgres-dc1'
    static_configs:
      - targets:
          - '10.1.1.11:9187'
          - '10.1.1.12:9187'
          - '10.1.1.13:9187'
    labels:
      datacenter: 'us-east'
  
  - job_name: 'postgres-dc2'
    static_configs:
      - targets:
          - '10.2.1.11:9187'
          - '10.2.1.12:9187'
    labels:
      datacenter: 'us-west'

6.3. Alert rules for cross-DC

YAML
# /etc/prometheus/alerts/multi-dc.yml
groups:
  - name: multi-dc
    rules:
      - alert: CrossDCReplicationLag
        expr: |
          pg_replication_lag{datacenter="us-west"} > 100 * 1024 * 1024
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High replication lag to DC2"
          description: "Replication lag to {{ $labels.instance }} is {{ $value | humanize }}B"
      
      - alert: CrossDCReplicationBroken
        expr: |
          pg_replication_status{datacenter="us-west"} == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Replication to DC2 is broken"
      
      - alert: CrossDCLatency
        expr: |
          probe_duration_seconds{job="blackbox-dc2"} > 0.1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High network latency to DC2"

7. Backup Strategy for Multi-DC

7.1. Per-DC backups

BASH
# DC1 backup
pgbackrest --stanza=main --type=full backup

# DC2 backup (can use DC1's backup repo over WAN)
pgbackrest --stanza=main --type=diff backup --repo1-host=10.1.1.11

7.2. Geo-replicated backup storage

INI
# pgbackrest.conf
[global]
repo1-type=s3
repo1-s3-bucket=my-postgres-backups-us-east
repo1-s3-region=us-east-1
repo1-s3-endpoint=s3.amazonaws.com

# S3 cross-region replication enabled:
# us-east-1 → us-west-2

7.3. Backup verification

BASH
# Restore test in DR site
pgbackrest --stanza=main \
  --type=time \
  --target="2024-11-25 10:00:00" \
  restore \
  --repo1-host=backup-server \
  --pg1-path=/var/lib/postgresql/18/restore_test

8. Best Practices

✅ DO

  1. Use cascading replication: Reduces load on primary.
  2. Separate etcd clusters: Per-DC for independence.
  3. Monitor replication lag: Alert on high lag.
  4. Test failover regularly: Quarterly DR drills.
  5. Use replication slots: Prevent WAL deletion.
  6. Compress WAL: Reduce WAN bandwidth.
  7. Limit base backup rate: Avoid WAN saturation.
  8. Implement geo-routing: Reduce latency for users.
  9. Document topology: Clear architecture diagrams.
  10. Automate failover: But with human approval for DR.

❌ DON'T

  1. Don't use sync replication cross-DC: Too slow.
  2. Don't share etcd across WAN: Split-brain risk.
  3. Don't ignore network latency: Tune timeouts.
  4. Don't forget about WAL retention: Use slots.
  5. Don't skip DR testing: Must validate regularly.
  6. Don't use single DC for backups: Geo-replicate.
  7. Don't over-complicate: Start simple, add complexity as needed.

9. Lab Exercises

Lab 1: Setup cascading replication

Tasks:

  1. Configure cascade node in DC1.
  2. Setup downstream replica in DC2.
  3. Create replication slot.
  4. Verify replication lag.
  5. Monitor with Prometheus.

Lab 2: Test DR failover

Tasks:

  1. Simulate DC1 failure (stop all nodes).
  2. Promote DC2 to primary.
  3. Verify application connectivity.
  4. Document RTO/RPO.
  5. Plan failback procedure.

Lab 3: Geo-aware load balancing

Tasks:

  1. Setup HAProxy in each DC.
  2. Configure geo-based routing.
  3. Test read/write routing.
  4. Measure latency improvement.
  5. Implement health checks.

Lab 4: Cross-DC monitoring

Tasks:

  1. Configure Prometheus multi-DC scraping.
  2. Create Grafana dashboard with DC labels.
  3. Setup alert rules for cross-DC lag.
  4. Test alerting on simulated failure.
  5. Document runbook for alerts.

10. Advanced Topics

10.1. Three-datacenter setup

TEXT
DC1 (us-east):
  └─ pg1 (Leader)
      ├─ pg2 (Replica)
      └─ pg3 (Cascade) → DC2

DC2 (us-west):
  └─ pg4 (Replica from pg3)
      ├─ pg5 (Replica)
      └─ pg6 (Cascade) → DC3

DC3 (eu-central):
  └─ pg7 (Replica from pg6)
      └─ pg8 (Replica)

Use case: Global application with regional reads

10.2. Active-active with logical replication

SQL
-- DC1 publication
CREATE PUBLICATION dc1_pub FOR ALL TABLES;

-- DC2 subscription
CREATE SUBSCRIPTION dc2_sub
CONNECTION 'host=10.1.1.11 dbname=mydb user=replicator'
PUBLICATION dc1_pub
WITH (copy_data = true);

-- DC2 publication (for bi-directional)
CREATE PUBLICATION dc2_pub FOR ALL TABLES;

-- DC1 subscription
CREATE SUBSCRIPTION dc1_sub
CONNECTION 'host=10.2.1.11 dbname=mydb user=replicator'
PUBLICATION dc2_pub
WITH (copy_data = false);  -- Already have data

-- Conflict resolution required!
-- See: https://www.postgresql.org/docs/current/logical-replication-conflicts.html

10.3. Quorum-based commit

YAML
# For strong consistency across DCs
postgresql:
  parameters:
    synchronous_standby_names: 'ANY 2 (pg-us-east-2, pg-us-west-1, pg-eu-central-1)'
    synchronous_commit: 'remote_apply'

# Requires 2 of 3 DCs to acknowledge commit
# Provides strong durability but higher latency

11. Summary

Multi-DC Strategies

PatternRPORTOComplexityCost
Active-Passive (DR)MinutesMinutesLowLow
Cascading ReplicasSecondsSecondsMediumMedium
Active-ActiveNear-zeroNear-zeroHighHigh
Hub-and-SpokeSecondsMinutesMediumMedium

Key Metrics

TEXT
Replication Lag: < 50 MB for WAN
Network Latency: < 100 ms acceptable
Throughput: 50-100 MB/s typical for WAN
RPO Target: < 5 minutes
RTO Target: < 15 minutes

Checklist

  • Cascading replication configured
  • Separate etcd per DC
  • Replication slots created
  • WAL compression enabled
  • Timeouts tuned for WAN
  • Geo-aware load balancing
  • Cross-DC monitoring
  • DR failover tested
  • Backup geo-replication
  • Documentation complete

Next Steps

Lesson 22 will cover Patroni on Kubernetes:

  • StatefulSets configuration
  • Patroni Kubernetes operator
  • PersistentVolumes setup
  • Helm charts usage
  • K8s-specific considerations

Share this article

You might also like

Browse all articles
Series

Backup and Disaster Recovery with VMware

Comprehensive guide to backup and disaster recovery strategies in VMware environments, including native tools, third-party solutions, and recovery planning.

#VMware#Backup#Disaster Recovery

Lesson 27: Disaster Recovery Drills

Planning and executing effective disaster recovery drills to ensure PostgreSQL cluster resilience.

#PostgreSQL#Disaster Recovery#Drills

Lesson 2: PostgreSQL Streaming Replication

Deep dive into PostgreSQL Streaming Replication, covering WAL mechanisms, synchronous vs asynchronous replication, replication slots, and hands-on lab setup.

#Database#PostgreSQL#Replication

Lesson 1: Overview of PostgreSQL High Availability

Understanding the fundamentals of PostgreSQL High Availability, including why HA is critical, different HA methods, and comparing Patroni vs Repmgr vs Pacemaker solutions.

#Database#PostgreSQL#High Availability