Circuit Breaker Pattern

Azure Architecture Center — companion to Retry Pattern and Transient Fault Handling.

Abstract

This Azure Architecture Center article describes the Circuit Breaker pattern — a distributed systems resilience mechanism that prevents cascading failures by acting as a proxy between an application and a potentially faulty remote service. The proxy monitors recent failure rates and transitions through three states (Closed, Open, Half-Open) to control whether requests are forwarded or rejected immediately. Unlike the Retry pattern, which handles short transient faults, the circuit breaker handles faults of variable and potentially long duration by temporarily halting traffic to allow the downstream service to recover. The article covers the state machine, implementation considerations, operational concerns (monitoring, manual override, multiregion), and an Azure Cosmos DB example. For agents, the circuit breaker is the canonical pattern for managing tool call failures, protecting downstream LLM or API dependencies, and preventing runaway retry storms that could cascade across an agent graph.

Key Concepts

The Three-State Machine

The circuit breaker is implemented as a proxy with three states:

State	Behaviour	Transition
Closed (normal)	Requests pass through; failure counter increments on each failure	→ Open when failure count exceeds threshold within a time window
Open (tripped)	All requests fail immediately; no calls made to the downstream service	→ Half-Open after a configurable time-out timer expires
Half-Open (probing)	A limited number of trial requests are allowed through	→ Closed if all trials succeed; → Open immediately if any trial fails

The failure counter in the Closed state is time-based: it resets at periodic intervals, preventing occasional failures from triggering the Open state. Only a burst of failures within a window triggers the transition.

Why the Circuit Breaker Complements Retry

The Retry pattern (see Retry Pattern) works well for brief transient faults but is counterproductive for prolonged outages: continued retries consume threads, connections, and memory, potentially causing resource exhaustion in the calling service — the “cascading failure” scenario. The circuit breaker cuts off all traffic during the outage window, freeing resources and allowing the downstream service to recover without being hammered.

Graceful Degradation

When the circuit is Open, the application should not simply error: it should degrade gracefully — serving cached or default responses, routing to a backup service, queueing requests for later replay, or informing the user of temporary unavailability. The circuit breaker raises a state-change event that operations teams can wire to alerts.

Adaptive and Advanced Patterns

Increasing time-out timer: Start with a short Open period (seconds); if the fault persists, progressively lengthen it (minutes).
Ping-based Half-Open trigger: Instead of a fixed timer, periodically ping the target’s health endpoint to determine when to probe.
Accelerated tripping: If an error response indicates a prolonged outage (e.g., Retry-After: 120s), trip immediately and set the time-out to match the server’s signal.
Service mesh circuit breaking: Many service meshes (e.g., Istio/Envoy) implement circuit breaking at the infrastructure layer as a sidecar, without modifying application code.
Multiregion: Use global load balancers or region-aware circuit breaking strategies to enable controlled failover across regions.

Key Algorithms

State machine transitions (formal):

Closed:
  on request success → increment success counter (reset failure counter)
  on request failure → increment failure counter
  if failure_count > threshold in window → → Open; start timer

Open:
  on any request → return exception immediately (no call made)
  on timer_expired → → Half-Open

Half-Open:
  on request → allow through (limited count)
  if success → → Closed; reset failure counter
  if failure → → Open; restart timer

Azure Cosmos DB example flow:

Normal operation (Closed): requests reach DB, no 429 responses.
429 received → breaker trips to Open; subsequent requests return cached/default responses.
Azure Monitor detects pattern, notifies operations team.
After time-out (or team approval for scaling): Half-Open trial requests; if clean, returns to Closed.

Key Claims and Findings

The circuit breaker provides stability by preventing resource exhaustion under prolonged fault conditions — threads and connections are freed instead of queuing against a timing-out service.
It minimises impact on performance by failing fast rather than waiting for each request to time out.
A circuit breaker that opens too quickly (low threshold) causes unnecessary degradation; one that opens too slowly (high threshold) permits excessive resource consumption before reacting — tuning is empirical.
For multi-shard data stores, a single circuit breaker covering all shards is an anti-pattern: one failing shard should not block access to healthy shards.
Circuit breakers should expose clear observability via state-change events and distributed traces — without these, operators cannot distinguish a tripped breaker from a service outage.

Terminology

Term	Definition
Cascading failure	Failure in one service propagating through the system as dependent services exhaust resources waiting for it
Closed / Open / Half-Open	The three states of the circuit breaker state machine
Failure threshold	Number of failures within a time window required to trip the breaker to Open
Time-out timer	Duration the breaker stays in Open before transitioning to Half-Open
Graceful degradation	Serving a reduced but functional response (cached, default) when a dependency is unavailable
Sidecar	Service mesh pattern where a proxy process handles cross-cutting concerns (including circuit breaking) alongside the application container
Health endpoint	API endpoint on a service that reports its operational status, usable as a Half-Open probe

Connections

Retry Pattern — the circuit breaker is the recommended complement to the retry pattern; retries handle brief faults, the circuit breaker handles prolonged outages
Transient Fault Handling — covers the broader context of when and how to detect transient vs terminal faults, with concrete guidance on retry budgets and anti-patterns
Design Considerations for Advanced Agentic AI — agentic systems with tool-calling loops are especially vulnerable to cascading failures; the circuit breaker directly addresses the failure modes that arise when an agent invokes an unavailable downstream tool or API

Personal Wiki

Explorer

Circuit Breaker Pattern

Circuit Breaker Pattern

Abstract

Key Concepts

The Three-State Machine

Why the Circuit Breaker Complements Retry

Graceful Degradation

Adaptive and Advanced Patterns

Key Algorithms

Key Claims and Findings

Terminology

Connections

Graph View

Table of Contents

Backlinks