Retry Pattern

Azure Architecture Center — companion to Circuit Breaker Pattern and Transient Fault Handling.

Abstract

This Azure Architecture Center article describes the Retry pattern — wrapping all remote service calls in logic that transparently retries failed operations a configurable number of times with a configurable delay before surfacing the failure to the caller. The article covers three high-level strategies (cancel, immediate retry, delayed retry), key considerations (idempotency, exception type, transaction consistency), and the integration point with the Circuit Breaker pattern for handling long-lasting faults. It is the most focused and tactical of the three Azure resilience articles, providing a concise decision framework for when and how to apply the retry pattern. For agents, the retry pattern is the baseline mechanism for handling transient failures in tool calls, API requests, and any remote dependency.


Key Concepts

Three Retry Strategies

StrategyWhen to useNotes
CancelFault is clearly not transient, or the operation logic is brokenReport exception immediately; do not retry
Retry immediatelyFault is rare and likely caused by a transient packet-level eventAttempt once only; if it fails, switch to a delayed strategy
Retry after delayFault is caused by connectivity or service-busy conditionsUse exponential back-off or incremental delay; add jitter for multi-instance clients

If the operation fails after the maximum number of retries, it should be treated as a definitive exception — not silently ignored or retried again. At this point, refer to the Circuit Breaker Pattern for protecting the system from continued attempts.

Delay Strategies

  • Exponential back-off: delay doubles after each attempt (e.g., 1s, 2s, 4s, 8s). Most effective for service-busy faults; pairs with jitter to prevent thundering herd.
  • Incremental: delay grows by a fixed amount (e.g., 3s, 7s, 11s). Moderate severity faults.
  • Regular interval: fixed delay. Simplest; avoid at high concurrency without jitter.
  • Immediate: no delay; appropriate for rare packet-level faults only, and only once.
  • Randomisation: adds per-instance random offset to any strategy; prevents synchronised retry waves from multiple client instances.

Idempotency Requirement

Before applying the retry pattern to an operation, determine whether it is idempotent — i.e., whether executing it multiple times produces the same result with no additional side effects. Examples:

  • Idempotent (safe to retry): read operations, GET requests, PUT with full resource body.
  • Non-idempotent (dangerous to retry without guards): POST with auto-generated IDs, payment transactions, counter increments, message-send operations.

For non-idempotent operations, apply idempotency keys (unique request IDs) so the server can detect and deduplicate retried requests. See the Idempotence pattern for full guidance.

Logging Model

Log early failures (retried attempts that subsequently succeeded) as informational entries, not errors — they represent normal transient fault recovery and should not trigger alerts. Log the final failure of all retry attempts as an actual error. This prevents alert fatigue while preserving failure signal.

Relationship to Circuit Breaker

The retry and circuit breaker patterns are complementary:

  • Retry handles brief, self-correcting faults transparently.
  • Circuit Breaker handles prolonged or unresolvable faults by halting traffic and allowing the downstream service to recover.

When the retry pattern exhausts its attempts, the circuit breaker’s Open state provides the fallback: subsequent calls fail immediately without further retrying, preventing resource exhaustion. See Circuit Breaker Pattern for the state machine.


Key Algorithms

Generic retry wrapper (pseudo-code):

def retry(operation, policy):
    for attempt in range(policy.max_retries):
        try:
            return operation()
        except TransientError as e:
            if attempt == policy.max_retries - 1:
                raise
            sleep(policy.delay(attempt))  # exponential back-off + jitter
        except PermanentError:
            raise  # cancel immediately, do not retry

Exponential back-off + jitter:

def delay(attempt, base=1.0, cap=60.0, jitter=0.5):
    return min(cap, base * (2 ** attempt)) + random.uniform(0, jitter)

Key Claims and Findings

  • Aggressive retry policies with minimal delay can further degrade an already overloaded service, extending the outage rather than resolving it.
  • Nested retry layers (caller and callee both retrying) multiply the actual request rate to the downstream service; prefer fast-fail at lower layers.
  • For non-critical interactive operations: fewer retries, shorter delays, inform the user if all fail.
  • For batch/background operations: more retries, longer exponential delays.
  • Consider using general-purpose retry libraries (Polly for .NET, Resilience4j for Java) before writing custom retry logic — they handle edge cases and provide tested implementations.
  • If retries are applied to a transaction, consider the transaction consistency implications: partial success in a multi-step operation may require compensating transactions if a later step fails after retrying.

Terminology

TermDefinition
Retry patternDesign pattern: wrap remote calls in retry logic that transparently re-executes on transient failure
Cancel strategyVariant of retry that immediately abandons the operation and reports an exception
Exponential back-offDelay strategy where wait time doubles after each failed attempt
JitterRandom offset on retry intervals; prevents synchronised retry waves across multiple clients
IdempotencyProperty of an operation whose repeated execution produces identical results with no additional side effects
PollyPopular .NET retry and resilience library
Resilience4jPopular Java retry and resilience library

Connections

  • Circuit Breaker Pattern — the recommended pattern to apply when the retry pattern’s attempts are exhausted; combining both provides comprehensive fault handling
  • Transient Fault Handling — the comprehensive best-practices guide covering retry budget, dead-letter queues, anti-patterns, and testing — the broader context for this pattern
  • Design Considerations for Advanced Agentic AI — CODE1 and CODE5 implement basic retry loops in their Global Agents; the structured retry policy and circuit-breaker integration described here is the production-grade version