Code of the Day
AdvancedRobust Pipelines

Retry logic

Transient failures are a fact of life in networked pipelines. Learn when to retry, when not to, and how exponential backoff with jitter prevents a thundering herd.

WorkflowAdvanced6 min read
By the end of this lesson you will be able to:
  • Explain exponential backoff with jitter and why both components matter
  • Identify transient failures (worth retrying) versus permanent failures (not)
  • Understand why blind, synchronous retries can worsen an already-overloaded service

Not every failure is final. A network blip, a momentary rate-limit response, or a cloud service restarting behind a load balancer are all transient — retry the same request a few seconds later and it will likely succeed. Treating every failure as permanent and aborting wastes work; treating every failure as transient and retrying immediately can make things worse.

Knowing the difference, and knowing how to retry, is what separates a robust pipeline from a fragile one.

Transient vs permanent failures

A transient failure is one that may resolve on its own without any change to the request or the environment:

  • HTTP 429 (Too Many Requests) — back off and try again.
  • HTTP 503 (Service Unavailable) — the upstream may be restarting.
  • ConnectionResetError / TimeoutError — packet loss or a proxy hiccup.

A permanent failure will not resolve no matter how many times you retry:

  • HTTP 401 / 403 — your credentials are wrong or missing; retrying is pointless.
  • HTTP 404 — the resource does not exist; it will not appear by itself.
  • FileNotFoundError for a local file — it is not going to create itself.
  • Validation errors — bad data in, bad data out, every time.

Retrying a permanent failure wastes time at best and hammers a downstream service at worst. Your retry logic should only catch exception types that correspond to transient conditions.

Exponential backoff

A simple fixed-delay retry (time.sleep(1) between attempts) is already better than nothing, but it has a problem: if many clients all fail at the same moment and all sleep for exactly one second, they all hammer the server again simultaneously.

Exponential backoff spaces retries further apart with each attempt:

Attempt 1 fails → wait 1 s
Attempt 2 fails → wait 2 s
Attempt 3 fails → wait 4 s
Attempt 4 fails → wait 8 s
Attempt 5 fails → give up

Wait times grow as base ** attempt, so the load on the upstream service decreases dramatically as the number of concurrent retrying clients increases.

Jitter

Exponential backoff still has a synchronisation problem if all clients start retrying at the same moment: they will all back off by the same amounts and continue to hit the server in waves.

Jitter adds a random offset to each wait so clients desynchronise:

wait = min(base ** attempt, max_wait) + random.uniform(0, 1)

With jitter, a thousand clients that all fail at second zero spread their retries across an interval rather than piling up at the same instant. This is the thundering herd problem; jitter solves it.

"Full jitter" picks the delay uniformly from [0, computed_wait] rather than adding a small noise term. AWS Engineering popularised the analysis of jitter strategies in a 2015 post that remains the canonical reference.

The tenacity library

Implementing backoff, jitter, attempt limits, and per-exception routing from scratch is error-prone. The tenacity library packages all of it into clean decorators:

from tenacity import (
    retry,
    wait_exponential,
    stop_after_attempt,
    retry_if_exception_type,
)
from requests.exceptions import RequestException

@retry(
    wait=wait_exponential(multiplier=1, min=1, max=10),
    stop=stop_after_attempt(5),
    retry=retry_if_exception_type(RequestException),
)
def fetch_report(url: str) -> dict:
    response = requests.get(url, timeout=10)
    response.raise_for_status()
    return response.json()

wait_exponential handles the backoff curve. stop_after_attempt enforces a hard limit. retry_if_exception_type ensures that only RequestException and its subclasses trigger a retry — a ValueError from bad response data will still propagate immediately.

Where to go next

Next: retry in practice — a runnable example that simulates a flaky function, wraps it with tenacity, and shows the retry attempts in real time.

Finished reading? Mark it complete to track your progress.

On this page