How to Handle Model Rate Limits
Abstract
A LangSmith guidance article covering three complementary strategies for managing model provider rate limit errors in production and large-scale evaluation workflows. The first approach uses LangChain’s InMemoryRateLimiter — a client-side token bucket attached to a model instance that throttles outbound requests to a configurable requests_per_second with a configurable burst ceiling (max_bucket_size). This is proactive: it prevents rate limit errors before they occur. The second approach is reactive: exponential backoff retry, either via LangChain’s .with_retry(stop_after_attempt=N) chain method (Python/JS) or external libraries (tenacity, backoff) for non-LangChain code. The third approach is concurrency limiting via max_concurrency on LangSmith’s evaluate()/aevaluate() functions, which parallelizes dataset evaluation across threads while capping simultaneous API calls — effective for large-scale offline evaluation jobs.
Key Concepts
InMemoryRateLimiter: Client-side token bucket rate limiter attached to a LangChain model instance; configured withrequests_per_second(average rate),check_every_n_seconds(polling interval), andmax_bucket_size(maximum burst); Python-only- Token Bucket Algorithm: Rate limiting approach that allows short bursts up to
max_bucket_sizewhile enforcing a long-run average ofrequests_per_second; the bucket refills at the configured rate - Exponential Backoff Retry: Failed requests are retried with exponentially increasing wait times between attempts; prevents thundering-herd problem where simultaneous retries worsen provider congestion
.with_retry()Method: LangChain chain method attaching exponential backoff retry logic to any model instance or chain; configurable withstop_after_attempt=Nmax_concurrencyonevaluate(): LangSmith evaluation-level concurrency cap; splits the evaluation dataset across parallel threads; reduces peak simultaneous API calls without per-request throttling
Key Claims and Findings
- Rate limiters and retries address different failure modes: rate limiters prevent hitting limits proactively; retries recover from limit errors reactively — both are needed in robust production systems
max_concurrencyis evaluation-specific — it parallelizes dataset processing while capping concurrency, but is not a general-purpose rate limiter for production agents- Non-LangChain code can use
tenacity(Python) orbackoff(Python) libraries for retry logic, or implement exponential backoff from scratch following the OpenAI docs pattern
Strategy Comparison
| Strategy | Mechanism | When to Use |
|---|---|---|
InMemoryRateLimiter | Proactive throttling (token bucket) | Known rate limit; want to prevent errors before they occur |
.with_retry() / exponential backoff | Reactive retry on failure | Unpredictable bursts; want resilience to transient limit errors |
max_concurrency | Parallel evaluation concurrency cap | Large LangSmith evaluation jobs hitting rate limits |
Implementation Reference
from langchain.chat_models import init_chat_model
from langchain_core.rate_limiters import InMemoryRateLimiter
# Proactive: throttle to 0.1 req/sec, burst up to 10
rate_limiter = InMemoryRateLimiter(
requests_per_second=0.1,
check_every_n_seconds=0.1,
max_bucket_size=10,
)
model = init_chat_model("gpt-5.4", rate_limiter=rate_limiter)
# Reactive: retry up to 6 times with exponential backoff
model_with_retry = init_chat_model("gpt-5.4-mini").with_retry(stop_after_attempt=6)
# Evaluation: cap concurrency at 4 parallel threads
results = await aevaluate(..., max_concurrency=4)Terminology
- Token Bucket: Rate limiting algorithm maintaining a “bucket” of available request credits that refill at
requests_per_secondand allow bursts up tomax_bucket_size - Thundering Herd: Failure pattern where many simultaneous retries after a rate limit error collectively worsen congestion at the provider endpoint; exponential backoff with jitter mitigates this
tenacity/backoff: Python libraries providing decorator-based retry logic with configurable exponential backoff strategies for non-LangChain codestop_after_attempt:.with_retry()parameter specifying the maximum number of retry attempts before propagating the exception
Connections to Existing Wiki Pages
- Retry Pattern — the exponential backoff retry approach here is a direct application of the retry pattern;
.with_retry()andtenacityare concrete LangChain/Python instantiations of that pattern - Transient Fault Handling — Best Practices — rate limit errors are transient faults; the strategies here (retry with backoff, concurrency limiting) are implementations of the transient fault handling best practices described there
- Log, Trace, and Monitor Portkey Integrations — Portkey’s built-in exponential backoff retry (up to 5 attempts) provides the same reactive recovery at the gateway level rather than the LangChain client level