How to Handle Model Rate Limits

Abstract

A LangSmith guidance article covering three complementary strategies for managing model provider rate limit errors in production and large-scale evaluation workflows. The first approach uses LangChain’s InMemoryRateLimiter — a client-side token bucket attached to a model instance that throttles outbound requests to a configurable requests_per_second with a configurable burst ceiling (max_bucket_size). This is proactive: it prevents rate limit errors before they occur. The second approach is reactive: exponential backoff retry, either via LangChain’s .with_retry(stop_after_attempt=N) chain method (Python/JS) or external libraries (tenacity, backoff) for non-LangChain code. The third approach is concurrency limiting via max_concurrency on LangSmith’s evaluate()/aevaluate() functions, which parallelizes dataset evaluation across threads while capping simultaneous API calls — effective for large-scale offline evaluation jobs.

Key Concepts

InMemoryRateLimiter: Client-side token bucket rate limiter attached to a LangChain model instance; configured with requests_per_second (average rate), check_every_n_seconds (polling interval), and max_bucket_size (maximum burst); Python-only
Token Bucket Algorithm: Rate limiting approach that allows short bursts up to max_bucket_size while enforcing a long-run average of requests_per_second; the bucket refills at the configured rate
Exponential Backoff Retry: Failed requests are retried with exponentially increasing wait times between attempts; prevents thundering-herd problem where simultaneous retries worsen provider congestion
.with_retry() Method: LangChain chain method attaching exponential backoff retry logic to any model instance or chain; configurable with stop_after_attempt=N
max_concurrency on evaluate(): LangSmith evaluation-level concurrency cap; splits the evaluation dataset across parallel threads; reduces peak simultaneous API calls without per-request throttling

Key Claims and Findings

Rate limiters and retries address different failure modes: rate limiters prevent hitting limits proactively; retries recover from limit errors reactively — both are needed in robust production systems
max_concurrency is evaluation-specific — it parallelizes dataset processing while capping concurrency, but is not a general-purpose rate limiter for production agents
Non-LangChain code can use tenacity (Python) or backoff (Python) libraries for retry logic, or implement exponential backoff from scratch following the OpenAI docs pattern

Strategy Comparison

Strategy	Mechanism	When to Use
`InMemoryRateLimiter`	Proactive throttling (token bucket)	Known rate limit; want to prevent errors before they occur
`.with_retry()` / exponential backoff	Reactive retry on failure	Unpredictable bursts; want resilience to transient limit errors
`max_concurrency`	Parallel evaluation concurrency cap	Large LangSmith evaluation jobs hitting rate limits

Implementation Reference

from langchain.chat_models import init_chat_model
from langchain_core.rate_limiters import InMemoryRateLimiter
 
# Proactive: throttle to 0.1 req/sec, burst up to 10
rate_limiter = InMemoryRateLimiter(
    requests_per_second=0.1,
    check_every_n_seconds=0.1,
    max_bucket_size=10,
)
model = init_chat_model("gpt-5.4", rate_limiter=rate_limiter)
 
# Reactive: retry up to 6 times with exponential backoff
model_with_retry = init_chat_model("gpt-5.4-mini").with_retry(stop_after_attempt=6)
 
# Evaluation: cap concurrency at 4 parallel threads
results = await aevaluate(..., max_concurrency=4)

Terminology

Token Bucket: Rate limiting algorithm maintaining a “bucket” of available request credits that refill at requests_per_second and allow bursts up to max_bucket_size
Thundering Herd: Failure pattern where many simultaneous retries after a rate limit error collectively worsen congestion at the provider endpoint; exponential backoff with jitter mitigates this
tenacity / backoff: Python libraries providing decorator-based retry logic with configurable exponential backoff strategies for non-LangChain code
stop_after_attempt: .with_retry() parameter specifying the maximum number of retry attempts before propagating the exception

Connections to Existing Wiki Pages

Retry Pattern — the exponential backoff retry approach here is a direct application of the retry pattern; .with_retry() and tenacity are concrete LangChain/Python instantiations of that pattern
Transient Fault Handling — Best Practices — rate limit errors are transient faults; the strategies here (retry with backoff, concurrency limiting) are implementations of the transient fault handling best practices described there
Log, Trace, and Monitor Portkey Integrations — Portkey’s built-in exponential backoff retry (up to 5 attempts) provides the same reactive recovery at the gateway level rather than the LangChain client level

Personal Wiki

Explorer

How to Handle Model Rate Limits

How to Handle Model Rate Limits

Abstract

Key Concepts

Key Claims and Findings

Strategy Comparison

Implementation Reference

Terminology

Connections to Existing Wiki Pages

Graph View

Table of Contents

Backlinks