Circuit Breaker Pattern

Azure Architecture Center — companion to Retry Pattern and Transient Fault Handling.

Abstract

This Azure Architecture Center article describes the Circuit Breaker pattern — a distributed systems resilience mechanism that prevents cascading failures by acting as a proxy between an application and a potentially faulty remote service. The proxy monitors recent failure rates and transitions through three states (Closed, Open, Half-Open) to control whether requests are forwarded or rejected immediately. Unlike the Retry pattern, which handles short transient faults, the circuit breaker handles faults of variable and potentially long duration by temporarily halting traffic to allow the downstream service to recover. The article covers the state machine, implementation considerations, operational concerns (monitoring, manual override, multiregion), and an Azure Cosmos DB example. For agents, the circuit breaker is the canonical pattern for managing tool call failures, protecting downstream LLM or API dependencies, and preventing runaway retry storms that could cascade across an agent graph.


Key Concepts

The Three-State Machine

The circuit breaker is implemented as a proxy with three states:

StateBehaviourTransition
Closed (normal)Requests pass through; failure counter increments on each failureOpen when failure count exceeds threshold within a time window
Open (tripped)All requests fail immediately; no calls made to the downstream serviceHalf-Open after a configurable time-out timer expires
Half-Open (probing)A limited number of trial requests are allowed throughClosed if all trials succeed; → Open immediately if any trial fails

The failure counter in the Closed state is time-based: it resets at periodic intervals, preventing occasional failures from triggering the Open state. Only a burst of failures within a window triggers the transition.

Why the Circuit Breaker Complements Retry

The Retry pattern (see Retry Pattern) works well for brief transient faults but is counterproductive for prolonged outages: continued retries consume threads, connections, and memory, potentially causing resource exhaustion in the calling service — the “cascading failure” scenario. The circuit breaker cuts off all traffic during the outage window, freeing resources and allowing the downstream service to recover without being hammered.

Graceful Degradation

When the circuit is Open, the application should not simply error: it should degrade gracefully — serving cached or default responses, routing to a backup service, queueing requests for later replay, or informing the user of temporary unavailability. The circuit breaker raises a state-change event that operations teams can wire to alerts.

Adaptive and Advanced Patterns

  • Increasing time-out timer: Start with a short Open period (seconds); if the fault persists, progressively lengthen it (minutes).
  • Ping-based Half-Open trigger: Instead of a fixed timer, periodically ping the target’s health endpoint to determine when to probe.
  • Accelerated tripping: If an error response indicates a prolonged outage (e.g., Retry-After: 120s), trip immediately and set the time-out to match the server’s signal.
  • Service mesh circuit breaking: Many service meshes (e.g., Istio/Envoy) implement circuit breaking at the infrastructure layer as a sidecar, without modifying application code.
  • Multiregion: Use global load balancers or region-aware circuit breaking strategies to enable controlled failover across regions.

Key Algorithms

State machine transitions (formal):

Closed:
  on request success → increment success counter (reset failure counter)
  on request failure → increment failure counter
  if failure_count > threshold in window → → Open; start timer

Open:
  on any request → return exception immediately (no call made)
  on timer_expired → → Half-Open

Half-Open:
  on request → allow through (limited count)
  if success → → Closed; reset failure counter
  if failure → → Open; restart timer

Azure Cosmos DB example flow:

  • Normal operation (Closed): requests reach DB, no 429 responses.
  • 429 received → breaker trips to Open; subsequent requests return cached/default responses.
  • Azure Monitor detects pattern, notifies operations team.
  • After time-out (or team approval for scaling): Half-Open trial requests; if clean, returns to Closed.

Key Claims and Findings

  • The circuit breaker provides stability by preventing resource exhaustion under prolonged fault conditions — threads and connections are freed instead of queuing against a timing-out service.
  • It minimises impact on performance by failing fast rather than waiting for each request to time out.
  • A circuit breaker that opens too quickly (low threshold) causes unnecessary degradation; one that opens too slowly (high threshold) permits excessive resource consumption before reacting — tuning is empirical.
  • For multi-shard data stores, a single circuit breaker covering all shards is an anti-pattern: one failing shard should not block access to healthy shards.
  • Circuit breakers should expose clear observability via state-change events and distributed traces — without these, operators cannot distinguish a tripped breaker from a service outage.

Terminology

TermDefinition
Cascading failureFailure in one service propagating through the system as dependent services exhaust resources waiting for it
Closed / Open / Half-OpenThe three states of the circuit breaker state machine
Failure thresholdNumber of failures within a time window required to trip the breaker to Open
Time-out timerDuration the breaker stays in Open before transitioning to Half-Open
Graceful degradationServing a reduced but functional response (cached, default) when a dependency is unavailable
SidecarService mesh pattern where a proxy process handles cross-cutting concerns (including circuit breaking) alongside the application container
Health endpointAPI endpoint on a service that reports its operational status, usable as a Half-Open probe

Connections

  • Retry Pattern — the circuit breaker is the recommended complement to the retry pattern; retries handle brief faults, the circuit breaker handles prolonged outages
  • Transient Fault Handling — covers the broader context of when and how to detect transient vs terminal faults, with concrete guidance on retry budgets and anti-patterns
  • Design Considerations for Advanced Agentic AI — agentic systems with tool-calling loops are especially vulnerable to cascading failures; the circuit breaker directly addresses the failure modes that arise when an agent invokes an unavailable downstream tool or API