Rate Limiting Pattern: Efficiently Managing Throttled Services

Created 2019-01-28 Updated 2025-11-15

Context and Problem
Solution
Issues and Considerations
When to Use This Pattern
Example Architecture
Related Patterns
References

Many services use throttling to control resource consumption, imposing limits on the rate at which applications can access them. The rate limiting pattern helps you avoid throttling errors and accurately predict throughput, especially for large-scale repetitive tasks like batch processing.

Context and Problem

Performing large numbers of operations against a throttled service can result in increased traffic and reduced efficiency. You’ll need to track rejected requests and retry operations, potentially requiring multiple passes to complete your work.

Consider this example of ingesting data into a database:

Your application needs to ingest 10,000 records. Each record costs 10 Request Units (RUs), requiring 100,000 RUs total.
Your database instance has 20,000 RUs provisioned capacity.
You send all 10,000 records. 2,000 succeed, 8,000 are rejected.
You retry with 8,000 records. 2,000 succeed, 6,000 are rejected.
You retry with 6,000 records. 2,000 succeed, 4,000 are rejected.
You retry with 4,000 records. 2,000 succeed, 2,000 are rejected.
You retry with 2,000 records. All succeed.

The job completed, but only after sending 30,000 records—three times the actual dataset size.

Additional problems with this naive approach:

Error handling overhead: 20,000 errors need logging and processing, consuming memory and storage.
Unpredictable completion time: Without knowing throttling limits, you can’t estimate how long processing will take.

Solution

Rate limiting reduces traffic and improves throughput by controlling the number of records sent to a service over time.

Services throttle based on different metrics:

Number of operations (e.g., 20 requests per second)
Amount of data (e.g., 2 GiB per minute)
Relative cost of operations (e.g., 20,000 RUs per second)

Your rate limiting implementation must control operations sent to the service, optimizing usage without exceeding capacity.

Using a Durable Messaging System

When your APIs can handle requests faster than throttled services allow, you need to manage ingestion speed. Simply buffering requests is risky—if your application crashes, you lose buffered data.

Instead, send records to a durable messaging system that can handle your full ingestion rate. Use job processors to read records at a controlled rate within the throttled service’s limits.

Durable messaging options include:

Message queues (e.g., RabbitMQ, ActiveMQ)
Event streaming platforms (e.g., Apache Kafka)
Cloud-based queue services

Z3JhcGggTFIKICAgIEFbIkFQSTxici8+KEhpZ2ggUmF0ZSkiXSAtLT4gQlsiRHVyYWJsZTxici8+TWVzc2FnZSBRdWV1ZSJdCiAgICBCIC0tPiBDWyJKb2IgUHJvY2Vzc29yIDEiXQogICAgQiAtLT4gRFsiSm9iIFByb2Nlc3NvciAyIl0KICAgIEIgLS0+IEVbIkpvYiBQcm9jZXNzb3IgMyJdCiAgICBDIC0tPiBGWyJUaHJvdHRsZWQgU2VydmljZTxici8+KExpbWl0ZWQgUmF0ZSkiXQogICAgRCAtLT4gRgogICAgRSAtLT4gRgogICAgCiAgICBzdHlsZSBBIGZpbGw6I2UxZjVmZgogICAgc3R5bGUgQiBmaWxsOiNmZmY0ZTEKICAgIHN0eWxlIEYgZmlsbDojZmZlMWUx

Granular Time Intervals

Services often throttle based on comprehensible timespans (per second or per minute), but computers process much faster. Rather than batching releases once per second, send smaller amounts more frequently to:

Keep resource consumption (memory, CPU, network) flowing evenly
Prevent bottlenecks from sudden request bursts

For example, if a service allows 100 operations per second, release 20 operations every 200 milliseconds:

Managing Multiple Uncoordinated Processes

When multiple processes share a throttled service, logically partition the service’s capacity and use a distributed mutual exclusion system to manage locks on those partitions.

Example:

If a throttled system allows 500 requests per second:

Create 20 partitions worth 25 requests per second each
A process needing 100 requests asks for four partitions
The system grants two partitions for 10 seconds
The process rate limits to 50 requests per second, completes in 2 seconds, and releases the lock

Implementation approach:

Use blob storage to create one small file per logical partition. Applications obtain exclusive leases on these files for short periods (e.g., 15 seconds). For each lease granted, the application can use that partition’s capacity.

YmxvY2stYmV0YQpjb2x1bW5zIDMKICBibG9jazpwcm9jZXNzZXM6MwogICAgY29sdW1ucyAzCiAgICBQMVsiUHJvY2VzcyAxIl0KICAgIFAyWyJQcm9jZXNzIDIiXQogICAgUDNbIlByb2Nlc3MgMyJdCiAgZW5kCiAgc3BhY2U6MwogIGJsb2NrOmxlYXNlczozCiAgICBjb2x1bW5zIDMKICAgIEwxWyJMZWFzZSAxPGJyLz4yNSByZXEvcyJdCiAgICBMMlsiTGVhc2UgMjxici8+MjUgcmVxL3MiXQogICAgTDNbIkxlYXNlIDM8YnIvPjI1IHJlcS9zIl0KICBlbmQKICBzcGFjZTozCiAgYmxvY2s6c2VydmljZTozCiAgICBTWyJUaHJvdHRsZWQgU2VydmljZTxici8+NTAwIHJlcS9zIHRvdGFsIl0KICBlbmQKICAKICBQMSAtLT4gTDEKICBQMiAtLT4gTDIKICBQMyAtLT4gTDMKICBMMSAtLT4gUwogIEwyIC0tPiBTCiAgTDMgLS0+IFMKICAKICBzdHlsZSBwcm9jZXNzZXMgZmlsbDojZTFmNWZmCiAgc3R5bGUgbGVhc2VzIGZpbGw6I2ZmZjRlMQogIHN0eWxlIHNlcnZpY2UgZmlsbDojZmZlMWUx

To reduce latency, allocate a small amount of exclusive capacity for each process. Processes only seek shared capacity leases when exceeding their reserved capacity.

Alternative technologies for lease management include Zookeeper, Consul, etcd, and Redis/Redsync.

Issues and Considerations

💡 Key Considerations

Handle throttling errors: Rate limiting reduces errors but doesn't eliminate them. Your application must still handle any throttling errors that occur.

Multiple workstreams: If your application has multiple workstreams accessing the same throttled service (e.g., bulk loading and querying), integrate all into your rate limiting strategy or reserve separate capacity pools for each.

Multi-application usage: When multiple applications use the same throttled service, increased throttling errors might indicate contention. Consider temporarily reducing throughput until usage from other applications decreases.

!!!

When to Use This Pattern

Use this pattern to:

Reduce throttling errors from throttle-limited services
Reduce traffic compared to naive retry-on-error approaches
Reduce memory consumption by dequeuing records only when there’s capacity to process them
Improve predictability of batch processing completion times

Example Architecture

Consider an application where users submit records of various types to an API. Each record type has a unique job processor that performs validation, enrichment, and database insertion.

All components (API, job processors) are separate processes that scale independently and don’t directly communicate.

Z3JhcGggVEIKICAgIFUxWyJVc2VyIl0gLS0+IEFQSVsiQVBJIl0KICAgIFUyWyJVc2VyIl0gLS0+IEFQSQogICAgCiAgICBBUEkgLS0+IFFBWyJRdWV1ZSBBPGJyLz4oVHlwZSBBIFJlY29yZHMpIl0KICAgIEFQSSAtLT4gUUJbIlF1ZXVlIEI8YnIvPihUeXBlIEIgUmVjb3JkcykiXQogICAgCiAgICBRQSAtLT4gSlBBWyJKb2IgUHJvY2Vzc29yIEEiXQogICAgUUIgLS0+IEpQQlsiSm9iIFByb2Nlc3NvciBCIl0KICAgIAogICAgSlBBIC0tPiBMU1siTGVhc2UgU3RvcmFnZTxici8+KEJsb2IgMC05KSJdCiAgICBKUEIgLS0+IExTCiAgICAKICAgIEpQQSAtLT4gREJbIkRhdGFiYXNlPGJyLz4oMTAwMCByZXEvcyBsaW1pdCkiXQogICAgSlBCIC0tPiBEQgogICAgCiAgICBzdHlsZSBBUEkgZmlsbDojZTFmNWZmCiAgICBzdHlsZSBRQSBmaWxsOiNmZmY0ZTEKICAgIHN0eWxlIFFCIGZpbGw6I2ZmZjRlMQogICAgc3R5bGUgTFMgZmlsbDojZjBlMWZmCiAgICBzdHlsZSBEQiBmaWxsOiNmZmUxZTE=

Workflow:

User submits 10,000 records of type A to the API
API enqueues records in Queue A
User submits 5,000 records of type B to the API
API enqueues records in Queue B
Job Processor A attempts to lease blob 2
Job Processor B attempts to lease blob 2
Job Processor A fails; Job Processor B obtains the lease for 15 seconds (100 req/s capacity)
Job Processor B dequeues and writes 100 records
After 1 second, both processors attempt additional leases
Job Processor A obtains blob 6 (100 req/s); Job Processor B obtains blob 3 (now 200 req/s total)
Processors continue competing for leases and processing records at their granted rates
As leases expire (after 15 seconds), processors reduce their request rates accordingly

Throttling: Rate limiting is typically implemented in response to a throttled service.

Retry: When requests result in throttling errors, retry after an appropriate interval.

Queue-Based Load Leveling: Similar but broader than rate limiting. Key differences:

Rate limiting doesn’t necessarily require queues but needs durable messaging
Rate limiting introduces distributed mutual exclusion on partitions for managing capacity across uncoordinated processes
Queue-based load leveling applies to any performance mismatch between services; rate limiting specifically addresses throttled services

References

Rate Limiting Pattern - Microsoft Learn

Architecture