Circuit Breaker
Stops your app from hammering a failing dependency, so one slow service can't drag the rest down with it.
What is it?
When a service you depend on (a payment gateway, a database, an external API) starts failing, the worst thing your app can do is keep calling it. Every doomed request ties up a thread or a connection while it waits to time out, and those pile up until your own app grinds to a halt. The failure spreads upward instead of staying contained.
A circuit breaker borrows the idea from household electrical wiring: when it detects trouble, it "trips" and cuts the connection. While the breaker is tripped, calls fail instantly instead of hanging, which gives the struggling dependency room to recover and keeps your app responsive. After a cool-down it cautiously tests whether the dependency is healthy again, and only then restores normal traffic. In Baldur this is the Circuit Breaker, the most fundamental of the resilience patterns.
Why it matters
The failure a circuit breaker prevents is cascading failure: the domino effect where one unhealthy dependency exhausts your app's threads and connection pool, which then makes your service look unhealthy to its callers, and so on up the chain. A breaker turns a slow, resource-draining failure into a fast, contained one:
- Fail fast. Once the breaker is open, calls return immediately instead of blocking on a timeout.
- Give the dependency room. Pausing traffic lets an overloaded service catch up instead of being kept underwater by retries.
- Recover automatically. The breaker probes for recovery on its own and reopens at the first sign the dependency is still broken, so you don't have to babysit it.
How it works in Baldur
You wrap a call with the @baldur.protected facade (which combines the breaker with retry and
fallback) or the circuit_breaker decorator directly. From then on, Baldur tracks that call's health
and moves the breaker through three observable states:
stateDiagram-v2
[*] --> CLOSED
CLOSED --> OPEN: failures cross threshold (or operator force-open)
OPEN --> HALF_OPEN: recovery timeout elapses
HALF_OPEN --> CLOSED: trial calls succeed
HALF_OPEN --> OPEN: a trial call fails
- CLOSED is the normal state: calls flow straight through.
- OPEN is the tripped state: calls are rejected instantly, without reaching the dependency.
- HALF_OPEN is the probing state: after the cool-down, a few trial calls are allowed through to test the waters.
| What you observe | When it happens |
|---|---|
| Calls pass straight through | CLOSED — the dependency is healthy |
| Calls are rejected instantly, without touching the dependency | OPEN — failures crossed the threshold (or an operator forced it open) |
| A handful of trial calls are let through | HALF_OPEN — the recovery timeout elapsed and Baldur is probing whether the dependency recovered |
| Normal traffic resumes | a trial call succeeds enough times → back to CLOSED |
| The breaker snaps back to rejecting | a single trial call fails → straight back to OPEN |
How the breaker decides to trip, and who can override that, comes with a few wrinkles:
- Low-traffic services won't trip on noise. The breaker requires a minimum number of calls in the window before it will open, so a single early failure on a barely-used endpoint can't flip it.
- Rate-limit storms trip it too. A burst of HTTP 429 (Too Many Requests) responses from a dependency is treated as a failure signal and can open the breaker, so your app stops amplifying the overload.
- Operators stay in control. You can force a breaker open (
force_open_circuit) to take a dependency out of rotation during maintenance, force it closed (force_close_circuit) once you know it has recovered, or hand control back to automatic mode at any time.
Get notified when it trips
Set a Slack webhook URL and Baldur posts to your channel the moment a breaker
opens, then again when it recovers: a 🔴 when traffic is cut and a 🟢 when it is
restored. This is the one notification the OSS tier delivers on its own, and it
works on the most minimal install, with no message broker or background worker
running. Set BALDUR_META_WATCHDOG_SLACK_WEBHOOK_URL to turn it on; the URL
lives under the self-monitoring namespace, but on OSS the circuit-breaker push is
what reads it. Leave it unset and the open and close events are still logged,
just not posted.
The OSS push is deliberately plain: one message per transition, with no grouping or rate-limiting, so a breaker that flaps posts every time. Deduplication, cooldown, multi-channel routing, and on-call escalation belong to Unified Notification on PRO. The OSS vs PRO tier model lays out the full split.
Configuration
The most common knobs an operator sets. The full list lives in the API reference.
| Env Var | Default | What it controls |
|---|---|---|
BALDUR_CB_FAILURE_THRESHOLD |
5 |
How many failures in the window trip the breaker from CLOSED to OPEN |
BALDUR_CB_RECOVERY_TIMEOUT |
60 |
Seconds the breaker stays OPEN before letting trial calls through |
BALDUR_CB_HALF_OPEN_MAX_CALLS |
3 |
How many trial calls are allowed through while probing for recovery |
BALDUR_EVENT_LOGGING_CB_LOG_LEVEL |
WARNING |
Log level for circuit state-change events |
BALDUR_META_WATCHDOG_SLACK_WEBHOOK_URL |
(unset) | Slack incoming-webhook URL for the open/close push; unset means the events are logged, not posted |
See also
- Getting Started — set it up
- Circuit Breaker API Reference — full options and signatures
- Environment Variables — the complete operator-tunable list