Retry logic
Transient failures are a fact of life in networked pipelines. Learn when to retry, when not to, and how exponential backoff with jitter prevents a thundering herd.
- Explain exponential backoff with jitter and why both components matter
- Identify transient failures (worth retrying) versus permanent failures (not)
- Understand why blind, synchronous retries can worsen an already-overloaded service
Not every failure is final. A network blip, a momentary rate-limit response, or a cloud service restarting behind a load balancer are all transient — retry the same request a few seconds later and it will likely succeed. Treating every failure as permanent and aborting wastes work; treating every failure as transient and retrying immediately can make things worse.
Knowing the difference, and knowing how to retry, is what separates a robust pipeline from a fragile one.
Transient vs permanent failures
A transient failure is one that may resolve on its own without any change to the request or the environment:
- HTTP 429 (Too Many Requests) — back off and try again.
- HTTP 503 (Service Unavailable) — the upstream may be restarting.
ConnectionResetError/TimeoutError— packet loss or a proxy hiccup.
A permanent failure will not resolve no matter how many times you retry:
- HTTP 401 / 403 — your credentials are wrong or missing; retrying is pointless.
- HTTP 404 — the resource does not exist; it will not appear by itself.
FileNotFoundErrorfor a local file — it is not going to create itself.- Validation errors — bad data in, bad data out, every time.
Retrying a permanent failure wastes time at best and hammers a downstream service at worst. Your retry logic should only catch exception types that correspond to transient conditions.
Exponential backoff
A simple fixed-delay retry (time.sleep(1) between attempts) is already better
than nothing, but it has a problem: if many clients all fail at the same moment and
all sleep for exactly one second, they all hammer the server again simultaneously.
Exponential backoff spaces retries further apart with each attempt:
Attempt 1 fails → wait 1 s
Attempt 2 fails → wait 2 s
Attempt 3 fails → wait 4 s
Attempt 4 fails → wait 8 s
Attempt 5 fails → give upWait times grow as base ** attempt, so the load on the upstream service
decreases dramatically as the number of concurrent retrying clients increases.
Jitter
Exponential backoff still has a synchronisation problem if all clients start retrying at the same moment: they will all back off by the same amounts and continue to hit the server in waves.
Jitter adds a random offset to each wait so clients desynchronise:
wait = min(base ** attempt, max_wait) + random.uniform(0, 1)With jitter, a thousand clients that all fail at second zero spread their retries across an interval rather than piling up at the same instant. This is the thundering herd problem; jitter solves it.
"Full jitter" picks the delay uniformly from [0, computed_wait] rather than
adding a small noise term. AWS Engineering popularised the analysis of jitter
strategies in a 2015 post that remains the canonical reference.
The tenacity library
Implementing backoff, jitter, attempt limits, and per-exception routing from
scratch is error-prone. The tenacity library packages all of it into clean
decorators:
from tenacity import (
retry,
wait_exponential,
stop_after_attempt,
retry_if_exception_type,
)
from requests.exceptions import RequestException
@retry(
wait=wait_exponential(multiplier=1, min=1, max=10),
stop=stop_after_attempt(5),
retry=retry_if_exception_type(RequestException),
)
def fetch_report(url: str) -> dict:
response = requests.get(url, timeout=10)
response.raise_for_status()
return response.json()wait_exponential handles the backoff curve. stop_after_attempt enforces a hard
limit. retry_if_exception_type ensures that only RequestException and its
subclasses trigger a retry — a ValueError from bad response data will still
propagate immediately.
Where to go next
Next: retry in practice — a runnable example that simulates a flaky function, wraps it with tenacity, and shows the retry attempts in real time.
Checkpoints and atomic writes
Implement the two core idempotency patterns in Python — checkpoint marker files and atomic write-then-rename — so your pipelines survive crashes and restarts cleanly.
Retry in practice
Use the tenacity library to wrap a flaky function with exponential backoff and per-exception routing, and see it recover from transient failures automatically.