Understanding RTGS: High Availability and Performance Design

Created 2025-12-20 Updated 2026-03-18

1 Availability Requirements
2 High Availability Architecture
3 Performance Optimization
4 Scalability Design
5 Monitoring and Observability
6 Operational Excellence
7 Series Summary
8 Summary

High availability and performance are non-negotiable for RTGS systems. This final article in our series explores the architectural patterns, design strategies, and operational practices that ensure RTGS systems meet their demanding requirements.

1 Availability Requirements

1.1 RTGS Availability Standards

⚡ RTGS Availability Expectations

RTGS systems operate under stringent availability requirements:

✅ Operating Hours Availability

99.99%+ during business hours
Typically 18-24 hours/day
Scheduled maintenance windows only

✅ Recovery Objectives

RTO: < 2 minutes for failover
RPO: Zero data loss
Graceful degradation under stress

✅ Planned Downtime

Minimal scheduled maintenance
Weekend/off-hours only
Advanced notification required

1.2 Availability Calculation

graph LR subgraph "Availability Levels" A1["99.9%
8.76 hours/year"] A2["99.99%
52.6 minutes/year"] A3["99.999%
5.26 minutes/year"] A4["99.9999%
31.5 seconds/year"] end subgraph "RTGS Target" B["RTGS Target:
99.99%+"] end A2 -.-> B A3 -.-> B style B fill:#1976d2,stroke:#0d47a1,color:#fff style A2 fill:#e8f5e9,stroke:#388e3c style A3 fill:#e8f5e9,stroke:#388e3c

Availability Metrics:

Metric	Formula	RTGS Target
Availability %	(Uptime / Total Time) × 100	> 99.99%
MTBF	Total Uptime / Number of Failures	> 8760 hours
MTTR	Total Downtime / Number of Failures	< 2 minutes
MTTF	Total Uptime / Number of Units	> 100,000 hours

2 High Availability Architecture

2.1 Redundancy Patterns

Active-Active Configuration:

Note: The diagram below illustrates a proposed Active-Active architectural pattern. The specific components and replication mechanisms can vary based on infrastructure and RPO/RTO objectives.

graph TB subgraph "Site A (Active)" A1[Load Balancer] A2[App Cluster Node 1] A3[App Cluster Node 2] A4[Database Primary] end subgraph "Site B (Active)" B1[Load Balancer] B2[App Cluster Node 3] B3[App Cluster Node 4] B4[Database Primary] end subgraph "Global Load Balancer" G1[Traffic Distribution] end G1 --> A1 G1 --> B1 A1 --> A2 A1 --> A3 A2 --> A4 A3 --> A4 B1 --> B2 B1 --> B3 B2 --> B4 B3 --> B4 A4 -.->|Sync Replication| B4 style G1 fill:#1976d2,stroke:#0d47a1,color:#fff style A1 fill:#e8f5e9,stroke:#388e3c style B1 fill:#e8f5e9,stroke:#388e3c style A4 fill:#fff3e0,stroke:#f57c00 style B4 fill:#fff3e0,stroke:#f57c00

Active-Passive Configuration:

Note: The diagram below illustrates a proposed Active-Passive architectural pattern. The specific components, health monitoring, and failover mechanisms can vary based on infrastructure and RPO/RTO objectives.

graph TB subgraph "Primary Site (Active)" A1[Load Balancer] A2[App Cluster] A3[Database Primary] end subgraph "Secondary Site (Passive)" B1[Load Balancer] B2[App Cluster] B3[Database Standby] end subgraph "Health Monitor" H1[Continuous Monitoring] H2[Automatic Failover] end A1 --> A2 A2 --> A3 B1 --> B2 B2 --> B3 A3 -.->|Sync Replication| B3 H1 --> A1 H1 --> B1 H2 -.->|On Failure| B1 style A1 fill:#1976d2,stroke:#0d47a1,color:#fff style A2 fill:#1976d2,stroke:#0d47a1,color:#fff style B1 fill:#e3f2fd,stroke:#1976d2 style B2 fill:#e3f2fd,stroke:#1976d2 style H1 fill:#fff3e0,stroke:#f57c00

2.2 Database High Availability

Synchronous Replication:

Note: The sequence diagram below illustrates a proposed flow for synchronous database replication. While it ensures zero data loss, it introduces higher latency compared to asynchronous replication.

sequenceDiagram participant App as Application participant Primary as DB Primary participant Replica as DB Replica App->>Primary: Write Transaction Primary->>Primary: Write Local Primary->>Replica: Send Changes Replica->>Replica: Write Replica Replica-->>Primary: Acknowledge Primary-->>App: Commit Confirm Note over Primary,Replica: Zero data loss
Higher latency style Primary fill:#1976d2,stroke:#0d47a1,color:#fff style Replica fill:#e3f2fd,stroke:#1976d2

Asynchronous Replication:

Note: The sequence diagram below illustrates a proposed flow for asynchronous database replication. While it offers lower latency, there is a potential for data loss in the event of a primary database failure.

sequenceDiagram participant App as Application participant Primary as DB Primary participant Replica as DB Replica App->>Primary: Write Transaction Primary->>Primary: Write Local Primary-->>App: Commit Confirm Primary->>Replica: Send Changes (async) Replica->>Replica: Write Replica Replica-->>Primary: Acknowledge Note over Primary,Replica: Lower latency
Potential data loss style Primary fill:#1976d2,stroke:#0d47a1,color:#fff style Replica fill:#e3f2fd,stroke:#1976d2

Replication Comparison:

Aspect	Synchronous	Asynchronous
Data Loss	Zero	Possible
Latency	Higher	Lower
Distance	Limited (< 100km)	Unlimited
Throughput	Constrained	Higher
Use Case	Critical data	Disaster recovery

2.3 Failover Strategies

Note: The flowchart below illustrates a proposed failover strategy. The specific steps and actions will depend on the type of failure and the recovery procedures defined for the RTGS system.

flowchart TD A[Failure Detected] --> B{Failure Type} B -->|Application| C[Redirect to Healthy Node] B -->|Database| D[Promote Replica] B -->|Network| E[Route via Alternate Path] B -->|Site| F[Activate DR Site] C --> G[Update Load Balancer] D --> H[Update Connection Strings] E --> I[Update DNS/Routing] F --> J[Activate Standby Systems] G --> K[Verify Service Restored] H --> K I --> K J --> K K --> L{Success?} L -->|Yes| M[Monitor Recovery] L -->|No| N[Escalate] style A fill:#ffebee,stroke:#c62828 style M fill:#e8f5e9,stroke:#388e3c style N fill:#fff3e0,stroke:#f57c00

3 Performance Optimization

3.1 Performance Requirements

Metric	Target	Measurement
Latency (P50)	< 100ms	Message receipt to response
Latency (P99)	< 500ms	99th percentile
Throughput	> 1000 TPS	Transactions per second
Queue Processing	< 30 seconds	Time in queue
Settlement Time	< 1 second	End-to-end settlement

3.2 Performance Architecture

Note: The diagram below illustrates a proposed performance architecture with various optimization layers and techniques. The specific components and their configuration will vary based on the performance requirements and system design.

graph TB subgraph "Performance Layers" A[Load Balancing] B[Caching Layer] C[Connection Pooling] D[Database Optimization] E[Asynchronous Processing] end subgraph "Optimization Techniques" F[Horizontal Scaling] G[Query Optimization] H[Batch Processing] I[Memory Management] end A --> B B --> C C --> D D --> E E --> F F --> G G --> H H --> I style A fill:#e3f2fd,stroke:#1976d2 style B fill:#fff3e0,stroke:#f57c00 style D fill:#e8f5e9,stroke:#388e3c style F fill:#f3e5f5,stroke:#7b1fa2

3.3 Caching Strategy

Multi-Level Caching:

Note: The diagram below illustrates a proposed multi-level caching hierarchy. The specific cache types, data stored, and eviction policies will depend on the application’s performance characteristics.

graph LR subgraph "Cache Hierarchy" A[L1: In-Memory Cache] B[L2: Distributed Cache] C[L3: Database Cache] end subgraph "Data Types" D[Session Data] E[Configuration] F[Account Balances] G[Reference Data] end A --> D B --> E B --> F C --> G style A fill:#1976d2,stroke:#0d47a1,color:#fff style B fill:#fff3e0,stroke:#f57c00 style C fill:#e8f5e9,stroke:#388e3c

Cache Implementation:

Note: The Java code snippet below provides a proposed conceptual interface and example implementation for a caching layer. The actual implementation will vary significantly based on the chosen caching framework and specific data access patterns.

// Conceptual caching layer for RTGS
interface RTGSCache {
    
    /**
     * L1 Cache: In-memory, ultra-fast
     * Use: Session data, frequently accessed
     */
    <T> T getFromL1(String key);
    void putInL1(String key, Object value, Duration ttl);
    
    /**
     * L2 Cache: Distributed (Redis/Memcached)
     * Use: Shared state, account balances
     */
    <T> CompletableFuture<T> getFromL2(String key);
    CompletableFuture<Void> putInL2(String key, Object value, Duration ttl);
    
    /**
     * L3 Cache: Database query cache
     * Use: Reference data, historical data
     */
    <T> T getFromL3(String key);
    void invalidateL3(String pattern);
    
    /**
     * Cache-aside pattern for account balances
     */
    default BigDecimal getAccountBalance(String accountId) {
        String key = "balance:" + accountId;
        
        // Try L1 first
        BigDecimal balance = getFromL1(key);
        if (balance != null) return balance;
        
        // Try L2
        balance = getFromL2(key).join();
        if (balance != null) {
            putInL1(key, balance, Duration.ofSeconds(10));
            return balance;
        }
        
        // Load from database
        balance = loadFromDatabase(accountId);
        putInL2(key, balance, Duration.ofMinutes(1));
        putInL1(key, balance, Duration.ofSeconds(10));
        
        return balance;
    }
}

3.4 Database Optimization

Indexing Strategy:

Note: The diagram below illustrates a proposed indexing strategy. The specific indexes and their composition should be determined based on the database schema, query patterns, and performance requirements.

graph TB subgraph "Critical Indexes" A[Transaction ID Index] B[Account ID Index] C[Timestamp Index] D[Status Index] end subgraph "Composite Indexes" E[Account + Timestamp] F[Status + Timestamp] G[Participant + Date] end A --> H[Query Optimization] B --> H C --> H D --> H E --> H F --> H G --> H style A fill:#e3f2fd,stroke:#1976d2 style E fill:#fff3e0,stroke:#f57c00 style H fill:#e8f5e9,stroke:#388e3c

Query Optimization Examples:

Note: The SQL snippets below are proposed examples of database query optimization techniques. The specific optimizations will depend on the database system, data volume, and query patterns.

-- Optimized query for transaction lookup
-- Uses covering index to avoid table scan
CREATE INDEX idx_transaction_lookup 
ON transactions (account_id, created_at DESC) 
INCLUDE (amount, currency, status);

-- Partitioned table for large transaction history
CREATE TABLE transactions_partitioned (
    id UUID PRIMARY KEY,
    account_id UUID NOT NULL,
    amount DECIMAL(19,4) NOT NULL,
    created_at TIMESTAMP NOT NULL,
    -- ... other columns
) PARTITION BY RANGE (created_at);

-- Queue processing with SKIP LOCKED
SELECT id, payment_data
FROM payment_queue
WHERE status = 'PENDING'
ORDER BY priority, enqueue_time
FOR UPDATE SKIP LOCKED
LIMIT 100;

4 Scalability Design

4.1 Scaling Patterns

Horizontal Scaling:

Note: The diagram below illustrates a proposed horizontal scaling architecture. The specific components, load balancing strategies, and shared resources will vary based on the application’s scalability requirements.

graph TB subgraph "Scale-Out Architecture" A[Load Balancer] B[App Node 1] C[App Node 2] D[App Node 3] E[App Node N...] end subgraph "Shared Resources" F[Database Cluster] G[Message Queue] H[Cache Cluster] end A --> B A --> C A --> D A --> E B --> F C --> F D --> F E --> F B --> G C --> G D --> G E --> G B --> H C --> H D --> H E --> H style A fill:#1976d2,stroke:#0d47a1,color:#fff style F fill:#e8f5e9,stroke:#388e3c style G fill:#fff3e0,stroke:#f57c00 style H fill:#f3e5f5,stroke:#7b1fa2

Scaling Strategies:

Component	Scaling Approach	Considerations
Application	Horizontal (stateless)	Session affinity, Load balancing
Database	Read replicas, Sharding	Write bottleneck, Consistency
Cache	Cluster expansion	Memory distribution, Eviction
Message Queue	Partitioning	Order preservation, Consumer groups

4.2 Queue Scaling

Note: The diagram below illustrates a proposed queue scaling architecture. The specific partitioning strategy and consumer group configuration will depend on the message processing requirements and chosen message queue technology.

graph LR subgraph "Producers" P1[Producer 1] P2[Producer 2] P3[Producer 3] end subgraph "Message Queue" Q1[Partition 1] Q2[Partition 2] Q3[Partition 3] end subgraph "Consumers" C1[Consumer Group A] C2[Consumer Group B] end P1 --> Q1 P2 --> Q2 P3 --> Q3 Q1 --> C1 Q2 --> C1 Q3 --> C1 Q1 --> C2 Q2 --> C2 Q3 --> C2 style Q1 fill:#1976d2,stroke:#0d47a1,color:#fff style Q2 fill:#1976d2,stroke:#0d47a1,color:#fff style Q3 fill:#1976d2,stroke:#0d47a1,color:#fff

5 Monitoring and Observability

5.1 Monitoring Architecture

Note: The diagram below illustrates a proposed monitoring architecture. The specific tools, data sources, and correlation mechanisms will vary based on the observability requirements.

graph TB subgraph "Data Collection" A1[Application Metrics] A2[System Metrics] A3[Business Metrics] A4[Security Events] end subgraph "Processing" B1[Metrics Aggregation] B2[Log Processing] B3[Event Correlation] end subgraph "Storage" C1[Time Series DB] C2[Log Storage] C3[Event Store] end subgraph "Visualization" D1[Dashboards] D2[Alerts] D3[Reports] end A1 --> B1 A2 --> B1 A3 --> B1 A4 --> B3 B1 --> C1 B2 --> C2 B3 --> C3 C1 --> D1 C2 --> D1 C3 --> D2 style A1 fill:#e3f2fd,stroke:#1976d2 style B1 fill:#fff3e0,stroke:#f57c00 style C1 fill:#e8f5e9,stroke:#388e3c style D2 fill:#ffebee,stroke:#c62828

5.2 Key Performance Indicators

System Health Metrics:

Note: The table below presents proposed system health metrics with illustrative thresholds and alert levels. The specific KPIs and their targets should be established based on the RTGS system’s SLA and operational requirements.
| Category | Metric | Threshold | Alert Level |
|----------|--------|-----------|-------------|
| Availability | Uptime % | < 99.99% | Critical |
| Latency | P99 Response Time | > 500ms | High |
| Throughput | Transactions/sec | < 100 | High |
| Error Rate | Failed Transactions | > 0.1% | Critical |
| Queue Depth | Pending Payments | > 1000 | Medium |
| Database | Connection Pool Usage | > 80% | Medium |

**Business Metrics Dashboard:**
*   **Note:** The diagram below illustrates a proposed structure for a business metrics dashboard. The specific metrics, their visualization, and aggregation methods can vary based on business needs.
```MERMAID_BASE64_654
Z3JhcGggTFIKICAgIHN1YmdyYXBoICJSZWFsLVRpbWUgTWV0cmljcyIKICAgICAgICBBW1RyYW5zYWN0aW9uIFZvbHVtZV0KICAgICAgICBCW1NldHRsZW1lbnQgVmFsdWVdCiAgICAgICAgQ1tRdWV1ZSBXYWl0IFRpbWVdCiAgICAgICAgRFtTdWNjZXNzIFJhdGVdCiAgICBlbmQKICAgIAogICAgc3ViZ3JhcGggIlRyZW5kIEFuYWx5c2lzIgogICAgICAgIEVbSG91cmx5IFZvbHVtZV0KICAgICAgICBGW0RhaWx5IFZhbHVlXQogICAgICAgIEdbUGVhayBEZXRlY3Rpb25dCiAgICAgICAgSFtBbm9tYWx5IERldGVjdGlvbl0KICAgIGVuZAogICAgCiAgICBBIC0tPiBFCiAgICBCIC0tPiBGCiAgICBDIC0tPiBHCiAgICBEIC0tPiBICiAgICAKICAgIHN0eWxlIEEgZmlsbDojMTk3NmQyLHN0cm9rZTojMGQ0N2ExLGNvbG9yOiNmZmYKICAgIHN0eWxlIEIgZmlsbDojMTk3NmQyLHN0cm9rZTojMGQ0N2ExLGNvbG9yOiNmZmYKICAgIHN0eWxlIEUgZmlsbDojZThmNWU5LHN0cm9rZTojMzg4ZTNjCiAgICBzdHlsZSBGIGZpbGw6I2U4ZjVlOSxzdHJva2U6IzM4OGUzYw==

5.3 Distributed Tracing

Note: The sequence diagram below illustrates a proposed distributed tracing flow. The specific services, spans, and metrics captured will depend on the tracing framework and microservices architecture.

sequenceDiagram participant G as API Gateway participant A as Payment Service participant V as Validation Service participant L as Liquidity Service participant S as Settlement Service participant D as Database Note over G,D: Trace ID: abc123 G->>A: Submit Payment
Span: 1 A->>V: Validate
Span: 1.1 V->>D: Check Rules
Span: 1.1.1 D-->>V: Result
10ms V-->>A: Valid
15ms total A->>L: Check Liquidity
Span: 1.2 L->>D: Get Balance
Span: 1.2.1 D-->>L: Balance
5ms L-->>A: Sufficient
20ms total A->>S: Settle
Span: 1.3 S->>D: Update Accounts
Span: 1.3.1 D-->>S: Committed
15ms S-->>A: Settled
25ms total A-->>G: Response
Span: 1 complete G-->>Client: Confirm
Total: 75ms Note over G,D: Each span tracked
for performance analysis

6 Operational Excellence

6.1 Deployment Pipeline

Note: The flowchart below illustrates a proposed deployment pipeline. The specific stages, automation tools, and approval gates will vary based on the CI/CD practices and regulatory requirements.

flowchart LR A[Code Commit] --> B[Automated Tests] B --> C[Security Scan] C --> D[Build Artifact] D --> E[Deploy to Dev] E --> F[Integration Tests] F --> G[Deploy to Staging] G --> H[Performance Tests] H --> I[Security Audit] I --> J[Production Approval] J --> K[Canary Deployment] K --> L[Full Rollout] style A fill:#e3f2fd,stroke:#1976d2 style B fill:#fff3e0,stroke:#f57c00 style C fill:#ffebee,stroke:#c62828 style L fill:#e8f5e9,stroke:#388e3c

6.2 Change Management

Note: The table below presents a proposed example of change management policies. The specific approval workflows, testing requirements, and deployment windows should be defined based on the organization’s risk appetite and operational procedures.
| Change Type | Approval | Testing | Deployment Window |
|-------------|----------|---------|-------------------|
| Critical Security | Emergency | Minimal | Immediate |
| Bug Fix | Tech Lead | Regression | Off-peak |
| Feature | Change Board | Full Suite | Scheduled |
| Infrastructure | Architecture | Performance | Maintenance Window |

6.3 Capacity Planning

Note: The diagram below illustrates a proposed capacity planning process. The specific inputs, modeling techniques, and outputs will vary based on the system’s growth patterns and resource management strategies.

graph LR subgraph "Capacity Inputs" A[Historical Growth] B[Business Forecasts] C[Seasonal Patterns] D[New Features] end subgraph "Planning Process" E[Trend Analysis] F[Capacity Modeling] G[Resource Planning] end subgraph "Outputs" H[Infrastructure Plan] I[Budget Forecast] J[Scaling Timeline] end A --> E B --> E C --> E D --> E E --> F F --> G G --> H G --> I G --> J style A fill:#e3f2fd,stroke:#1976d2 style E fill:#fff3e0,stroke:#f57c00 style H fill:#e8f5e9,stroke:#388e3c

7 Series Summary

7.1 Complete Series Overview

📚 RTGS Series Complete

All five articles in this series:

| Part | Topic | Key Takeaways | |------|-------|---------------| | Part 1 | Core Concepts | RTGS fundamentals, Real-time vs. net settlement | | Part 2 | System Architecture | Components, Data flow, Integration | | Part 3 | Message Standards | ISO 20022, SWIFT migration, Validation | | Part 4 | Security & Risk | Threats, Controls, Compliance | | Part 5 | High Availability | Redundancy, Performance, Operations |

7.2 Key Concepts Recap

mindmap root((RTGS)) Core Concepts Real-Time Settlement Gross vs. Net Finality Architecture Payment Processor Queue Manager Settlement Engine Messages ISO 20022 pacs.008 pacs.009 Security HSM Encryption Fraud Detection Availability Redundancy Failover Monitoring

7.3 Further Learning

Topic	Resources
ISO 20022	iso20022.org, SWIFT documentation
Payment Systems	BIS Publications, Central Bank guides
Security	NIST Frameworks, PCI DSS
Architecture	Enterprise Architecture patterns

8 Summary

📋 Key Takeaways

Essential high availability and performance concepts:

✅ High Availability Architecture

Active-Active or Active-Passive redundancy
Synchronous replication for zero data loss
Automatic failover with health monitoring

✅ Performance Optimization

Multi-level caching strategy
Database indexing and partitioning
Connection pooling and async processing

✅ Scalability Design

Horizontal scaling for stateless components
Queue partitioning for parallel processing
Read replicas for database scaling

✅ Monitoring and Observability

Comprehensive metrics collection
Distributed tracing
Real-time alerting

✅ Operational Excellence

Automated deployment pipeline
Change management processes
Capacity planning

Footnotes for this article:

Note: For a complete list of all acronyms used in the RTGS series, see the RTGS Acronyms and Abbreviations Reference.

Decoding Digital Anomalies

Sometimes the feature is the bug in the digital rabbit hole, and vice versa

Misc