Retry Pattern

Azure Architecture Center — companion to Circuit Breaker Pattern and Transient Fault Handling.

Abstract

This Azure Architecture Center article describes the Retry pattern — wrapping all remote service calls in logic that transparently retries failed operations a configurable number of times with a configurable delay before surfacing the failure to the caller. The article covers three high-level strategies (cancel, immediate retry, delayed retry), key considerations (idempotency, exception type, transaction consistency), and the integration point with the Circuit Breaker pattern for handling long-lasting faults. It is the most focused and tactical of the three Azure resilience articles, providing a concise decision framework for when and how to apply the retry pattern. For agents, the retry pattern is the baseline mechanism for handling transient failures in tool calls, API requests, and any remote dependency.

Key Concepts

Three Retry Strategies

Strategy	When to use	Notes
Cancel	Fault is clearly not transient, or the operation logic is broken	Report exception immediately; do not retry
Retry immediately	Fault is rare and likely caused by a transient packet-level event	Attempt once only; if it fails, switch to a delayed strategy
Retry after delay	Fault is caused by connectivity or service-busy conditions	Use exponential back-off or incremental delay; add jitter for multi-instance clients

If the operation fails after the maximum number of retries, it should be treated as a definitive exception — not silently ignored or retried again. At this point, refer to the Circuit Breaker Pattern for protecting the system from continued attempts.

Delay Strategies

Exponential back-off: delay doubles after each attempt (e.g., 1s, 2s, 4s, 8s). Most effective for service-busy faults; pairs with jitter to prevent thundering herd.
Incremental: delay grows by a fixed amount (e.g., 3s, 7s, 11s). Moderate severity faults.
Regular interval: fixed delay. Simplest; avoid at high concurrency without jitter.
Immediate: no delay; appropriate for rare packet-level faults only, and only once.
Randomisation: adds per-instance random offset to any strategy; prevents synchronised retry waves from multiple client instances.

Idempotency Requirement

Before applying the retry pattern to an operation, determine whether it is idempotent — i.e., whether executing it multiple times produces the same result with no additional side effects. Examples:

Idempotent (safe to retry): read operations, GET requests, PUT with full resource body.
Non-idempotent (dangerous to retry without guards): POST with auto-generated IDs, payment transactions, counter increments, message-send operations.

For non-idempotent operations, apply idempotency keys (unique request IDs) so the server can detect and deduplicate retried requests. See the Idempotence pattern for full guidance.

Logging Model

Log early failures (retried attempts that subsequently succeeded) as informational entries, not errors — they represent normal transient fault recovery and should not trigger alerts. Log the final failure of all retry attempts as an actual error. This prevents alert fatigue while preserving failure signal.

Relationship to Circuit Breaker

The retry and circuit breaker patterns are complementary:

Retry handles brief, self-correcting faults transparently.
Circuit Breaker handles prolonged or unresolvable faults by halting traffic and allowing the downstream service to recover.

When the retry pattern exhausts its attempts, the circuit breaker’s Open state provides the fallback: subsequent calls fail immediately without further retrying, preventing resource exhaustion. See Circuit Breaker Pattern for the state machine.

Key Algorithms

Generic retry wrapper (pseudo-code):

def retry(operation, policy):
    for attempt in range(policy.max_retries):
        try:
            return operation()
        except TransientError as e:
            if attempt == policy.max_retries - 1:
                raise
            sleep(policy.delay(attempt))  # exponential back-off + jitter
        except PermanentError:
            raise  # cancel immediately, do not retry

Exponential back-off + jitter:

def delay(attempt, base=1.0, cap=60.0, jitter=0.5):
    return min(cap, base * (2 ** attempt)) + random.uniform(0, jitter)

Key Claims and Findings

Aggressive retry policies with minimal delay can further degrade an already overloaded service, extending the outage rather than resolving it.
Nested retry layers (caller and callee both retrying) multiply the actual request rate to the downstream service; prefer fast-fail at lower layers.
For non-critical interactive operations: fewer retries, shorter delays, inform the user if all fail.
For batch/background operations: more retries, longer exponential delays.
Consider using general-purpose retry libraries (Polly for .NET, Resilience4j for Java) before writing custom retry logic — they handle edge cases and provide tested implementations.
If retries are applied to a transaction, consider the transaction consistency implications: partial success in a multi-step operation may require compensating transactions if a later step fails after retrying.

Terminology

Term	Definition
Retry pattern	Design pattern: wrap remote calls in retry logic that transparently re-executes on transient failure
Cancel strategy	Variant of retry that immediately abandons the operation and reports an exception
Exponential back-off	Delay strategy where wait time doubles after each failed attempt
Jitter	Random offset on retry intervals; prevents synchronised retry waves across multiple clients
Idempotency	Property of an operation whose repeated execution produces identical results with no additional side effects
Polly	Popular .NET retry and resilience library
Resilience4j	Popular Java retry and resilience library

Connections

Circuit Breaker Pattern — the recommended pattern to apply when the retry pattern’s attempts are exhausted; combining both provides comprehensive fault handling
Transient Fault Handling — the comprehensive best-practices guide covering retry budget, dead-letter queues, anti-patterns, and testing — the broader context for this pattern
Design Considerations for Advanced Agentic AI — CODE1 and CODE5 implement basic retry loops in their Global Agents; the structured retry policy and circuit-breaker integration described here is the production-grade version

Personal Wiki

Explorer

Retry Pattern

Retry Pattern

Abstract

Key Concepts

Three Retry Strategies

Delay Strategies

Idempotency Requirement

Logging Model

Relationship to Circuit Breaker

Key Algorithms

Key Claims and Findings

Terminology

Connections

Graph View

Table of Contents

Backlinks