How to Handle Model Rate Limits

Abstract

A LangSmith guidance article covering three complementary strategies for managing model provider rate limit errors in production and large-scale evaluation workflows. The first approach uses LangChain’s InMemoryRateLimiter — a client-side token bucket attached to a model instance that throttles outbound requests to a configurable requests_per_second with a configurable burst ceiling (max_bucket_size). This is proactive: it prevents rate limit errors before they occur. The second approach is reactive: exponential backoff retry, either via LangChain’s .with_retry(stop_after_attempt=N) chain method (Python/JS) or external libraries (tenacity, backoff) for non-LangChain code. The third approach is concurrency limiting via max_concurrency on LangSmith’s evaluate()/aevaluate() functions, which parallelizes dataset evaluation across threads while capping simultaneous API calls — effective for large-scale offline evaluation jobs.


Key Concepts

  • InMemoryRateLimiter: Client-side token bucket rate limiter attached to a LangChain model instance; configured with requests_per_second (average rate), check_every_n_seconds (polling interval), and max_bucket_size (maximum burst); Python-only
  • Token Bucket Algorithm: Rate limiting approach that allows short bursts up to max_bucket_size while enforcing a long-run average of requests_per_second; the bucket refills at the configured rate
  • Exponential Backoff Retry: Failed requests are retried with exponentially increasing wait times between attempts; prevents thundering-herd problem where simultaneous retries worsen provider congestion
  • .with_retry() Method: LangChain chain method attaching exponential backoff retry logic to any model instance or chain; configurable with stop_after_attempt=N
  • max_concurrency on evaluate(): LangSmith evaluation-level concurrency cap; splits the evaluation dataset across parallel threads; reduces peak simultaneous API calls without per-request throttling

Key Claims and Findings

  • Rate limiters and retries address different failure modes: rate limiters prevent hitting limits proactively; retries recover from limit errors reactively — both are needed in robust production systems
  • max_concurrency is evaluation-specific — it parallelizes dataset processing while capping concurrency, but is not a general-purpose rate limiter for production agents
  • Non-LangChain code can use tenacity (Python) or backoff (Python) libraries for retry logic, or implement exponential backoff from scratch following the OpenAI docs pattern

Strategy Comparison

StrategyMechanismWhen to Use
InMemoryRateLimiterProactive throttling (token bucket)Known rate limit; want to prevent errors before they occur
.with_retry() / exponential backoffReactive retry on failureUnpredictable bursts; want resilience to transient limit errors
max_concurrencyParallel evaluation concurrency capLarge LangSmith evaluation jobs hitting rate limits

Implementation Reference

from langchain.chat_models import init_chat_model
from langchain_core.rate_limiters import InMemoryRateLimiter
 
# Proactive: throttle to 0.1 req/sec, burst up to 10
rate_limiter = InMemoryRateLimiter(
    requests_per_second=0.1,
    check_every_n_seconds=0.1,
    max_bucket_size=10,
)
model = init_chat_model("gpt-5.4", rate_limiter=rate_limiter)
 
# Reactive: retry up to 6 times with exponential backoff
model_with_retry = init_chat_model("gpt-5.4-mini").with_retry(stop_after_attempt=6)
 
# Evaluation: cap concurrency at 4 parallel threads
results = await aevaluate(..., max_concurrency=4)

Terminology

  • Token Bucket: Rate limiting algorithm maintaining a “bucket” of available request credits that refill at requests_per_second and allow bursts up to max_bucket_size
  • Thundering Herd: Failure pattern where many simultaneous retries after a rate limit error collectively worsen congestion at the provider endpoint; exponential backoff with jitter mitigates this
  • tenacity / backoff: Python libraries providing decorator-based retry logic with configurable exponential backoff strategies for non-LangChain code
  • stop_after_attempt: .with_retry() parameter specifying the maximum number of retry attempts before propagating the exception

Connections to Existing Wiki Pages

  • Retry Pattern — the exponential backoff retry approach here is a direct application of the retry pattern; .with_retry() and tenacity are concrete LangChain/Python instantiations of that pattern
  • Transient Fault Handling — Best Practices — rate limit errors are transient faults; the strategies here (retry with backoff, concurrency limiting) are implementations of the transient fault handling best practices described there
  • Log, Trace, and Monitor Portkey Integrations — Portkey’s built-in exponential backoff retry (up to 5 attempts) provides the same reactive recovery at the gateway level rather than the LangChain client level