Understanding RTGS: High Availability and Performance Design

  1. 1 Availability Requirements
  2. 2 High Availability Architecture
  3. 3 Performance Optimization
  4. 4 Scalability Design
  5. 5 Monitoring and Observability
  6. 6 Operational Excellence
  7. 7 Series Summary
  8. 8 Summary

High availability and performance are non-negotiable for RTGS systems. This final article in our series explores the architectural patterns, design strategies, and operational practices that ensure RTGS systems meet their demanding requirements.

1 Availability Requirements

1.1 RTGS Availability Standards

⚡ RTGS Availability Expectations

RTGS systems operate under stringent availability requirements:

Operating Hours Availability

  • 99.99%+ during business hours
  • Typically 18-24 hours/day
  • Scheduled maintenance windows only

Recovery Objectives

  • RTO: < 2 minutes for failover
  • RPO: Zero data loss
  • Graceful degradation under stress

Planned Downtime

  • Minimal scheduled maintenance
  • Weekend/off-hours only
  • Advanced notification required

1.2 Availability Calculation

graph LR subgraph "Availability Levels" A1["99.9%
8.76 hours/year"] A2["99.99%
52.6 minutes/year"] A3["99.999%
5.26 minutes/year"] A4["99.9999%
31.5 seconds/year"] end subgraph "RTGS Target" B["RTGS Target:
99.99%+"] end A2 -.-> B A3 -.-> B style B fill:#1976d2,stroke:#0d47a1,color:#fff style A2 fill:#e8f5e9,stroke:#388e3c style A3 fill:#e8f5e9,stroke:#388e3c

Availability Metrics:

Metric Formula RTGS Target
Availability % (Uptime / Total Time) × 100 > 99.99%
MTBF Total Uptime / Number of Failures > 8760 hours
MTTR Total Downtime / Number of Failures < 2 minutes
MTTF Total Uptime / Number of Units > 100,000 hours

2 High Availability Architecture

2.1 Redundancy Patterns

Active-Active Configuration:

graph TB subgraph "Site A (Active)" A1[Load Balancer] A2[App Cluster Node 1] A3[App Cluster Node 2] A4[Database Primary] end subgraph "Site B (Active)" B1[Load Balancer] B2[App Cluster Node 3] B3[App Cluster Node 4] B4[Database Primary] end subgraph "Global Load Balancer" G1[Traffic Distribution] end G1 --> A1 G1 --> B1 A1 --> A2 A1 --> A3 A2 --> A4 A3 --> A4 B1 --> B2 B1 --> B3 B2 --> B4 B3 --> B4 A4 -.->|Sync Replication| B4 style G1 fill:#1976d2,stroke:#0d47a1,color:#fff style A1 fill:#e8f5e9,stroke:#388e3c style B1 fill:#e8f5e9,stroke:#388e3c style A4 fill:#fff3e0,stroke:#f57c00 style B4 fill:#fff3e0,stroke:#f57c00

Active-Passive Configuration:

graph TB subgraph "Primary Site (Active)" A1[Load Balancer] A2[App Cluster] A3[Database Primary] end subgraph "Secondary Site (Passive)" B1[Load Balancer] B2[App Cluster] B3[Database Standby] end subgraph "Health Monitor" H1[Continuous Monitoring] H2[Automatic Failover] end A1 --> A2 A2 --> A3 B1 --> B2 B2 --> B3 A3 -.->|Sync Replication| B3 H1 --> A1 H1 --> B1 H2 -.->|On Failure| B1 style A1 fill:#1976d2,stroke:#0d47a1,color:#fff style A2 fill:#1976d2,stroke:#0d47a1,color:#fff style B1 fill:#e3f2fd,stroke:#1976d2 style B2 fill:#e3f2fd,stroke:#1976d2 style H1 fill:#fff3e0,stroke:#f57c00

2.2 Database High Availability

Synchronous Replication:

sequenceDiagram participant App as Application participant Primary as DB Primary participant Replica as DB Replica App->>Primary: Write Transaction Primary->>Primary: Write Local Primary->>Replica: Send Changes Replica->>Replica: Write Replica Replica-->>Primary: Acknowledge Primary-->>App: Commit Confirm Note over Primary,Replica: Zero data loss
Higher latency style Primary fill:#1976d2,stroke:#0d47a1,color:#fff style Replica fill:#e3f2fd,stroke:#1976d2

Asynchronous Replication:

sequenceDiagram participant App as Application participant Primary as DB Primary participant Replica as DB Replica App->>Primary: Write Transaction Primary->>Primary: Write Local Primary-->>App: Commit Confirm Primary->>Replica: Send Changes (async) Replica->>Replica: Write Replica Replica-->>Primary: Acknowledge Note over Primary,Replica: Lower latency
Potential data loss style Primary fill:#1976d2,stroke:#0d47a1,color:#fff style Replica fill:#e3f2fd,stroke:#1976d2

Replication Comparison:

Aspect Synchronous Asynchronous
Data Loss Zero Possible
Latency Higher Lower
Distance Limited (< 100km) Unlimited
Throughput Constrained Higher
Use Case Critical data Disaster recovery

2.3 Failover Strategies

flowchart TD A[Failure Detected] --> B{Failure Type} B -->|Application| C[Redirect to Healthy Node] B -->|Database| D[Promote Replica] B -->|Network| E[Route via Alternate Path] B -->|Site| F[Activate DR Site] C --> G[Update Load Balancer] D --> H[Update Connection Strings] E --> I[Update DNS/Routing] F --> J[Activate Standby Systems] G --> K[Verify Service Restored] H --> K I --> K J --> K K --> L{Success?} L -->|Yes| M[Monitor Recovery] L -->|No| N[Escalate] style A fill:#ffebee,stroke:#c62828 style M fill:#e8f5e9,stroke:#388e3c style N fill:#fff3e0,stroke:#f57c00

3 Performance Optimization

3.1 Performance Requirements

Metric Target Measurement
Latency (P50) < 100ms Message receipt to response
Latency (P99) < 500ms 99th percentile
Throughput > 1000 TPS Transactions per second
Queue Processing < 30 seconds Time in queue
Settlement Time < 1 second End-to-end settlement

3.2 Performance Architecture

graph TB subgraph "Performance Layers" A[Load Balancing] B[Caching Layer] C[Connection Pooling] D[Database Optimization] E[Asynchronous Processing] end subgraph "Optimization Techniques" F[Horizontal Scaling] G[Query Optimization] H[Batch Processing] I[Memory Management] end A --> B B --> C C --> D D --> E E --> F F --> G G --> H H --> I style A fill:#e3f2fd,stroke:#1976d2 style B fill:#fff3e0,stroke:#f57c00 style D fill:#e8f5e9,stroke:#388e3c style F fill:#f3e5f5,stroke:#7b1fa2

3.3 Caching Strategy

Multi-Level Caching:

graph LR subgraph "Cache Hierarchy" A[L1: In-Memory Cache] B[L2: Distributed Cache] C[L3: Database Cache] end subgraph "Data Types" D[Session Data] E[Configuration] F[Account Balances] G[Reference Data] end A --> D B --> E B --> F C --> G style A fill:#1976d2,stroke:#0d47a1,color:#fff style B fill:#fff3e0,stroke:#f57c00 style C fill:#e8f5e9,stroke:#388e3c

Cache Implementation:

// Conceptual caching layer for RTGS
interface RTGSCache {
    
    /**
     * L1 Cache: In-memory, ultra-fast
     * Use: Session data, frequently accessed
     */
    <T> T getFromL1(String key);
    void putInL1(String key, Object value, Duration ttl);
    
    /**
     * L2 Cache: Distributed (Redis/Memcached)
     * Use: Shared state, account balances
     */
    <T> CompletableFuture<T> getFromL2(String key);
    CompletableFuture<Void> putInL2(String key, Object value, Duration ttl);
    
    /**
     * L3 Cache: Database query cache
     * Use: Reference data, historical data
     */
    <T> T getFromL3(String key);
    void invalidateL3(String pattern);
    
    /**
     * Cache-aside pattern for account balances
     */
    default BigDecimal getAccountBalance(String accountId) {
        String key = "balance:" + accountId;
        
        // Try L1 first
        BigDecimal balance = getFromL1(key);
        if (balance != null) return balance;
        
        // Try L2
        balance = getFromL2(key).join();
        if (balance != null) {
            putInL1(key, balance, Duration.ofSeconds(10));
            return balance;
        }
        
        // Load from database
        balance = loadFromDatabase(accountId);
        putInL2(key, balance, Duration.ofMinutes(1));
        putInL1(key, balance, Duration.ofSeconds(10));
        
        return balance;
    }
}

3.4 Database Optimization

Indexing Strategy:

graph TB subgraph "Critical Indexes" A[Transaction ID Index] B[Account ID Index] C[Timestamp Index] D[Status Index] end subgraph "Composite Indexes" E[Account + Timestamp] F[Status + Timestamp] G[Participant + Date] end A --> H[Query Optimization] B --> H C --> H D --> H E --> H F --> H G --> H style A fill:#e3f2fd,stroke:#1976d2 style E fill:#fff3e0,stroke:#f57c00 style H fill:#e8f5e9,stroke:#388e3c

Query Optimization Examples:

-- Optimized query for transaction lookup
-- Uses covering index to avoid table scan
CREATE INDEX idx_transaction_lookup 
ON transactions (account_id, created_at DESC) 
INCLUDE (amount, currency, status);

-- Partitioned table for large transaction history
CREATE TABLE transactions_partitioned (
    id UUID PRIMARY KEY,
    account_id UUID NOT NULL,
    amount DECIMAL(19,4) NOT NULL,
    created_at TIMESTAMP NOT NULL,
    -- ... other columns
) PARTITION BY RANGE (created_at);

-- Queue processing with SKIP LOCKED
SELECT id, payment_data
FROM payment_queue
WHERE status = 'PENDING'
ORDER BY priority, enqueue_time
FOR UPDATE SKIP LOCKED
LIMIT 100;

4 Scalability Design

4.1 Scaling Patterns

Horizontal Scaling:

graph TB subgraph "Scale-Out Architecture" A[Load Balancer] B[App Node 1] C[App Node 2] D[App Node 3] E[App Node N...] end subgraph "Shared Resources" F[Database Cluster] G[Message Queue] H[Cache Cluster] end A --> B A --> C A --> D A --> E B --> F C --> F D --> F E --> F B --> G C --> G D --> G E --> G B --> H C --> H D --> H E --> H style A fill:#1976d2,stroke:#0d47a1,color:#fff style F fill:#e8f5e9,stroke:#388e3c style G fill:#fff3e0,stroke:#f57c00 style H fill:#f3e5f5,stroke:#7b1fa2

Scaling Strategies:

Component Scaling Approach Considerations
Application Horizontal (stateless) Session affinity, Load balancing
Database Read replicas, Sharding Write bottleneck, Consistency
Cache Cluster expansion Memory distribution, Eviction
Message Queue Partitioning Order preservation, Consumer groups

4.2 Queue Scaling

graph LR subgraph "Producers" P1[Producer 1] P2[Producer 2] P3[Producer 3] end subgraph "Message Queue" Q1[Partition 1] Q2[Partition 2] Q3[Partition 3] end subgraph "Consumers" C1[Consumer Group A] C2[Consumer Group B] end P1 --> Q1 P2 --> Q2 P3 --> Q3 Q1 --> C1 Q2 --> C1 Q3 --> C1 Q1 --> C2 Q2 --> C2 Q3 --> C2 style Q1 fill:#1976d2,stroke:#0d47a1,color:#fff style Q2 fill:#1976d2,stroke:#0d47a1,color:#fff style Q3 fill:#1976d2,stroke:#0d47a1,color:#fff

5 Monitoring and Observability

5.1 Monitoring Architecture

graph TB subgraph "Data Collection" A1[Application Metrics] A2[System Metrics] A3[Business Metrics] A4[Security Events] end subgraph "Processing" B1[Metrics Aggregation] B2[Log Processing] B3[Event Correlation] end subgraph "Storage" C1[Time Series DB] C2[Log Storage] C3[Event Store] end subgraph "Visualization" D1[Dashboards] D2[Alerts] D3[Reports] end A1 --> B1 A2 --> B1 A3 --> B1 A4 --> B3 B1 --> C1 B2 --> C2 B3 --> C3 C1 --> D1 C2 --> D1 C3 --> D2 style A1 fill:#e3f2fd,stroke:#1976d2 style B1 fill:#fff3e0,stroke:#f57c00 style C1 fill:#e8f5e9,stroke:#388e3c style D2 fill:#ffebee,stroke:#c62828

5.2 Key Performance Indicators

System Health Metrics:

Category Metric Threshold Alert Level
Availability Uptime % < 99.99% Critical
Latency P99 Response Time > 500ms High
Throughput Transactions/sec < 100 High
Error Rate Failed Transactions > 0.1% Critical
Queue Depth Pending Payments > 1000 Medium
Database Connection Pool Usage > 80% Medium

Business Metrics Dashboard:

graph LR subgraph "Real-Time Metrics" A[Transaction Volume] B[Settlement Value] C[Queue Wait Time] D[Success Rate] end subgraph "Trend Analysis" E[Hourly Volume] F[Daily Value] G[Peak Detection] H[Anomaly Detection] end A --> E B --> F C --> G D --> H style A fill:#1976d2,stroke:#0d47a1,color:#fff style B fill:#1976d2,stroke:#0d47a1,color:#fff style E fill:#e8f5e9,stroke:#388e3c style F fill:#e8f5e9,stroke:#388e3c

5.3 Distributed Tracing

sequenceDiagram participant G as API Gateway participant A as Payment Service participant V as Validation Service participant L as Liquidity Service participant S as Settlement Service participant D as Database Note over G,D: Trace ID: abc123 G->>A: Submit Payment
Span: 1 A->>V: Validate
Span: 1.1 V->>D: Check Rules
Span: 1.1.1 D-->>V: Result
10ms V-->>A: Valid
15ms total A->>L: Check Liquidity
Span: 1.2 L->>D: Get Balance
Span: 1.2.1 D-->>L: Balance
5ms L-->>A: Sufficient
20ms total A->>S: Settle
Span: 1.3 S->>D: Update Accounts
Span: 1.3.1 D-->>S: Committed
15ms S-->>A: Settled
25ms total A-->>G: Response
Span: 1 complete G-->>Client: Confirm
Total: 75ms Note over G,D: Each span tracked
for performance analysis

6 Operational Excellence

6.1 Deployment Pipeline

flowchart LR A[Code Commit] --> B[Automated Tests] B --> C[Security Scan] C --> D[Build Artifact] D --> E[Deploy to Dev] E --> F[Integration Tests] F --> G[Deploy to Staging] G --> H[Performance Tests] H --> I[Security Audit] I --> J[Production Approval] J --> K[Canary Deployment] K --> L[Full Rollout] style A fill:#e3f2fd,stroke:#1976d2 style B fill:#fff3e0,stroke:#f57c00 style C fill:#ffebee,stroke:#c62828 style L fill:#e8f5e9,stroke:#388e3c

6.2 Change Management

Change Type Approval Testing Deployment Window
Critical Security Emergency Minimal Immediate
Bug Fix Tech Lead Regression Off-peak
Feature Change Board Full Suite Scheduled
Infrastructure Architecture Performance Maintenance Window

6.3 Capacity Planning

graph LR subgraph "Capacity Inputs" A[Historical Growth] B[Business Forecasts] C[Seasonal Patterns] D[New Features] end subgraph "Planning Process" E[Trend Analysis] F[Capacity Modeling] G[Resource Planning] end subgraph "Outputs" H[Infrastructure Plan] I[Budget Forecast] J[Scaling Timeline] end A --> E B --> E C --> E D --> E E --> F F --> G G --> H G --> I G --> J style A fill:#e3f2fd,stroke:#1976d2 style E fill:#fff3e0,stroke:#f57c00 style H fill:#e8f5e9,stroke:#388e3c

7 Series Summary

7.1 Complete Series Overview

📚 RTGS Series Complete

All five articles in this series:

| Part | Topic | Key Takeaways | |------|-------|---------------| | Part 1 | Core Concepts | RTGS fundamentals, Real-time vs. net settlement | | Part 2 | System Architecture | Components, Data flow, Integration | | Part 3 | Message Standards | ISO 20022, SWIFT migration, Validation | | Part 4 | Security & Risk | Threats, Controls, Compliance | | Part 5 | High Availability | Redundancy, Performance, Operations |

7.2 Key Concepts Recap

mindmap root((RTGS)) Core Concepts Real-Time Settlement Gross vs. Net Finality Architecture Payment Processor Queue Manager Settlement Engine Messages ISO 20022 pacs.008 pacs.009 Security HSM Encryption Fraud Detection Availability Redundancy Failover Monitoring

7.3 Further Learning

Topic Resources
ISO 20022 iso20022.org, SWIFT documentation
Payment Systems BIS Publications, Central Bank guides
Security NIST Frameworks, PCI DSS
Architecture Enterprise Architecture patterns

8 Summary

📋 Key Takeaways

Essential high availability and performance concepts:

High Availability Architecture

  • Active-Active or Active-Passive redundancy
  • Synchronous replication for zero data loss
  • Automatic failover with health monitoring

Performance Optimization

  • Multi-level caching strategy
  • Database indexing and partitioning
  • Connection pooling and async processing

Scalability Design

  • Horizontal scaling for stateless components
  • Queue partitioning for parallel processing
  • Read replicas for database scaling

Monitoring and Observability

  • Comprehensive metrics collection
  • Distributed tracing
  • Real-time alerting

Operational Excellence

  • Automated deployment pipeline
  • Change management processes
  • Capacity planning

Footnotes for this article:

Note: For a complete list of all acronyms used in the RTGS series, see the RTGS Acronyms and Abbreviations Reference.

Share