Circuit Breaker¶

Circuit breakers protect your system against cascading failures by automatically stopping requests to unhealthy providers. When a provider fails repeatedly, the circuit "opens" and requests are rejected immediately (or rerouted to a fallback) until the provider recovers.

Unlike rule-based features (deduplication, suppression, etc.), circuit breakers operate at the infrastructure level and apply automatically to every request targeting a provider — no rules required.

How It Works¶

stateDiagram-v2
    [*] --> Closed
    Closed --> Open : Consecutive failures >= threshold
    Open --> HalfOpen : Recovery timeout elapsed
    HalfOpen --> Closed : Consecutive successes >= threshold
    HalfOpen --> Open : Any failure

States¶

State	Behavior
Closed	Normal operation. Requests flow through to the provider. Failures are counted.
Open	Provider is unhealthy. Requests are rejected immediately with `CircuitOpen` (or rerouted to a fallback).
HalfOpen	Recovery probe. A single request is allowed through to test provider health. Additional requests are rejected until the probe completes.

Transition Rules¶

Closed -> Open: After failure_threshold consecutive retryable failures (connection errors, timeouts). Non-retryable errors (validation, auth) do not count.
Open -> HalfOpen: After recovery_timeout elapses since the last failure.
HalfOpen -> Closed: After success_threshold consecutive successful probes.
HalfOpen -> Open: On any probe failure.

Configuration¶

TOML (Server)¶

acteon.toml

# ─── Circuit Breaker ────────────────────────────────────
[circuit_breaker]
enabled = true
failure_threshold = 5        # Consecutive failures to open
success_threshold = 2        # Consecutive successes to close
recovery_timeout_seconds = 60 # Seconds before probing

# Per-provider overrides
[circuit_breaker.providers.email]
failure_threshold = 10
recovery_timeout_seconds = 120
fallback_provider = "webhook"

[circuit_breaker.providers.sms]
fallback_provider = "push-notification"

Parameters¶

Default (applies to all providers)¶

Parameter	Type	Default	Description
`enabled`	bool	`false`	Enable circuit breakers
`failure_threshold`	u32	`5`	Consecutive failures before opening
`success_threshold`	u32	`2`	Consecutive successes in `HalfOpen` to close
`recovery_timeout_seconds`	u64	`60`	Seconds in `Open` before probing

Per-provider overrides¶

Parameter	Type	Required	Description
`failure_threshold`	u32	No	Override default failure threshold
`success_threshold`	u32	No	Override default success threshold
`recovery_timeout_seconds`	u64	No	Override default recovery timeout
`fallback_provider`	string	No	Provider to reroute to when circuit is open

Per-provider fields inherit from the defaults when not specified.

Rust API¶

use acteon_gateway::{GatewayBuilder, CircuitBreakerConfig};
use std::time::Duration;

let gateway = GatewayBuilder::new()
    .state(state)
    .lock(lock)
    // Default circuit breaker for all providers
    .circuit_breaker(CircuitBreakerConfig {
        failure_threshold: 5,
        success_threshold: 2,
        recovery_timeout: Duration::from_secs(60),
        fallback_provider: None,
    })
    // Per-provider override with fallback
    .circuit_breaker_provider("email", CircuitBreakerConfig {
        failure_threshold: 10,
        success_threshold: 2,
        recovery_timeout: Duration::from_secs(120),
        fallback_provider: Some("webhook".to_string()),
    })
    .provider(email_provider)
    .provider(webhook_provider)
    .build()?;

Fallback Routing¶

When a provider's circuit opens and a fallback_provider is configured, traffic is automatically rerouted to the fallback instead of being rejected.

flowchart LR
    A["Action<br/>provider: email"] --> B{Circuit open?}
    B -->|No| C["Execute via Email"]
    B -->|Yes| D{Fallback configured?}
    D -->|Yes| E["Execute via Webhook"]
    D -->|No| F["Return CircuitOpen"]
    E --> G["Return Rerouted"]

Recursive Fallback Chains¶

Fallback chains are resolved recursively. If the fallback's circuit is also open and it has its own fallback configured, the gateway continues walking the chain until it finds a healthy provider or exhausts the chain. This enables multi-region failover scenarios where multiple providers can be chained.

flowchart TD
    A["Action: region-us"] --> B{"region-us<br/>circuit open?"}
    B -->|No| C["Execute via region-us"]
    B -->|Yes| D{"region-eu<br/>circuit open?"}
    D -->|No| E["Execute via region-eu<br/>(Rerouted)"]
    D -->|Yes| F{"region-ap<br/>circuit open?"}
    F -->|No| G["Execute via region-ap<br/>(Rerouted)"]
    F -->|Yes| H["Return CircuitOpen<br/>fallback_chain: [region-eu, region-ap]"]

Example: Multi-Region Failover¶

acteon.toml

[circuit_breaker]
enabled = true

[circuit_breaker.providers.region-us]
fallback_provider = "region-eu"

[circuit_breaker.providers.region-eu]
fallback_provider = "region-ap"

let gateway = GatewayBuilder::new()
    .state(state)
    .lock(lock)
    .circuit_breaker(CircuitBreakerConfig {
        failure_threshold: 5,
        success_threshold: 2,
        recovery_timeout: Duration::from_secs(60),
        fallback_provider: None,
    })
    .circuit_breaker_provider("region-us", CircuitBreakerConfig {
        fallback_provider: Some("region-eu".to_string()),
        ..CircuitBreakerConfig::default()
    })
    .circuit_breaker_provider("region-eu", CircuitBreakerConfig {
        fallback_provider: Some("region-ap".to_string()),
        ..CircuitBreakerConfig::default()
    })
    .provider(us_provider)
    .provider(eu_provider)
    .provider(ap_provider)
    .build()?;

When region-us and region-eu are both down, traffic cascades automatically to region-ap. The Rerouted outcome reports the final destination:

{
  "outcome": "Rerouted",
  "original_provider": "region-us",
  "new_provider": "region-ap",
  "response": { "status": "success", "body": {} }
}

If all providers in the chain are open, the CircuitOpen outcome includes every fallback that was attempted:

{
  "outcome": "CircuitOpen",
  "provider": "region-us",
  "fallback_chain": ["region-eu", "region-ap"]
}

Build-Time Validation¶

Fallback provider names and chains are validated at build time. The gateway returns a configuration error if a fallback_provider:

References a provider that isn't registered
References itself (self-referencing fallback)
Creates a cycle (e.g., A→B→C→A)

Probe Limiting (Thundering Herd Prevention)¶

In HalfOpen state, only one probe request is allowed at a time. This prevents a burst of requests from overwhelming a recovering provider.

When a probe is in flight, additional requests are rejected with CircuitOpen.
If the probe succeeds, the probe slot is released and the next request can probe again (until success_threshold is met).
If the probe fails, the circuit reopens and the recovery timeout restarts.
Probes that don't complete within 30 seconds are considered stale and the slot is freed.

This works correctly across multiple gateway instances because probe state is tracked in the shared state store.

Distributed State¶

Circuit breaker state is persisted in the configured state store (StateStore) and mutations are serialized via the distributed lock (DistributedLock). This means:

Multiple gateway instances share the same view of provider health.
When one instance detects a provider failure, all instances see the circuit open.
Probe coordination works across instances — only one instance sends the probe.

Backend	Circuit Breaker Accuracy
Memory	Perfect (single process only)
Redis	Perfect (distributed)
PostgreSQL	Perfect (distributed)

If the state store or lock is unavailable, circuit breakers fail open — requests are allowed through rather than being blocked.

Response¶

When the circuit is open and no fallback is configured:

{
  "outcome": "CircuitOpen",
  "provider": "email",
  "fallback_chain": []
}

When the circuit is open and the full fallback chain is exhausted:

{
  "outcome": "CircuitOpen",
  "provider": "email",
  "fallback_chain": ["webhook", "sms"]
}

When the circuit is open and a fallback (direct or via chain) succeeds:

{
  "outcome": "Rerouted",
  "original_provider": "email",
  "new_provider": "webhook",
  "response": {
    "status": "success",
    "body": {"sent": true}
  }
}

Metrics¶

The gateway tracks circuit breaker activity:

Metric	Description
`circuit_open`	Requests rejected because the circuit was open (no fallback)
`circuit_fallbacks`	Requests rerouted to a fallback provider
`circuit_transitions`	Total state transitions (Closed->Open, Open->HalfOpen, etc.)

Example: Rust API with Simulation¶

use std::sync::Arc;
use std::time::Duration;

use acteon_core::ActionOutcome;
use acteon_gateway::{CircuitBreakerConfig, GatewayBuilder};
use acteon_simulation::provider::{FailureMode, RecordingProvider};
use acteon_state_memory::{MemoryDistributedLock, MemoryStateStore};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let state = Arc::new(MemoryStateStore::new());
    let lock = Arc::new(MemoryDistributedLock::new());

    let primary = Arc::new(
        RecordingProvider::new("email")
            .with_failure_mode(FailureMode::Always),
    );
    let fallback = Arc::new(RecordingProvider::new("webhook"));

    let gateway = GatewayBuilder::new()
        .state(state)
        .lock(lock)
        .circuit_breaker(CircuitBreakerConfig {
            failure_threshold: 3,
            success_threshold: 2,
            recovery_timeout: Duration::from_secs(60),
            fallback_provider: None,
        })
        .circuit_breaker_provider("email", CircuitBreakerConfig {
            failure_threshold: 3,
            success_threshold: 2,
            recovery_timeout: Duration::from_secs(60),
            fallback_provider: Some("webhook".to_string()),
        })
        .provider(primary.clone() as Arc<dyn acteon_provider::DynProvider>)
        .provider(fallback.clone() as Arc<dyn acteon_provider::DynProvider>)
        .build()?;

    // First 3 requests fail and trip the circuit
    for _ in 0..3 {
        let action = acteon_core::Action::new(
            "ns", "tenant", "email", "send", serde_json::json!({}),
        );
        let outcome = gateway.dispatch(action, None).await?;
        assert!(matches!(outcome, ActionOutcome::Failed(_)));
    }

    // Subsequent requests are rerouted to the webhook fallback
    let action = acteon_core::Action::new(
        "ns", "tenant", "email", "send", serde_json::json!({}),
    );
    let outcome = gateway.dispatch(action, None).await?;
    assert!(matches!(outcome, ActionOutcome::Rerouted { .. }));

    gateway.shutdown().await;
    Ok(())
}

Running the Full Simulation

A comprehensive 5-scenario simulation is included:

cargo run -p acteon-simulation --example circuit_breaker_simulation

It demonstrates basic circuit opening, fallback routing, full recovery lifecycle, independent per-provider circuits, and multi-level fallback chains.

Admin API¶

Operators can manually trip and reset circuit breakers via the HTTP admin API without restarting the gateway. This is useful during incidents when you need to immediately isolate a failing provider or restore traffic after a manual fix.

All admin endpoints require authentication with the admin or operator role (CircuitBreakerManage permission).

List Circuit Breakers¶

GET /admin/circuit-breakers

Returns all registered circuit breakers with their distributed state (read from the shared state store, not a local cache) and configuration.

{
  "circuit_breakers": [
    {
      "provider": "email",
      "state": "closed",
      "failure_threshold": 5,
      "success_threshold": 2,
      "recovery_timeout_seconds": 60,
      "fallback_provider": "webhook"
    },
    {
      "provider": "webhook",
      "state": "open",
      "failure_threshold": 5,
      "success_threshold": 2,
      "recovery_timeout_seconds": 60
    }
  ]
}

Trip (Force Open)¶

POST /admin/circuit-breakers/{provider}/trip

Force-opens the circuit for a provider, immediately rejecting all requests (or rerouting to its fallback). The last_failure_time is set to now, so the normal recovery_timeout applies from this point forward.

curl -X POST http://localhost:8080/admin/circuit-breakers/email/trip \
  -H "Authorization: Bearer <token>"

{
  "provider": "email",
  "state": "open",
  "message": "circuit breaker tripped"
}

Reset (Force Close)¶

POST /admin/circuit-breakers/{provider}/reset

Force-closes the circuit, restoring normal request flow. All failure counters are cleared.

curl -X POST http://localhost:8080/admin/circuit-breakers/email/reset \
  -H "Authorization: Bearer <token>"

{
  "provider": "email",
  "state": "closed",
  "message": "circuit breaker reset"
}

Rust API¶

The trip() and reset() methods are also available programmatically:

if let Some(registry) = gateway.circuit_breakers() {
    if let Some(cb) = registry.get("email") {
        cb.trip().await;   // Force open
        cb.reset().await;  // Force close
    }
}

Design Notes¶

Only retryable errors (connection failures, timeouts) count toward the failure threshold. Non-retryable errors like authentication failures or validation errors do not trip the circuit.
recovery_timeout = 0 is allowed and useful for testing — the circuit transitions to HalfOpen immediately.
Circuit breakers are independent per provider. One provider's failures never affect another provider's circuit.
The circuit breaker runs before the executor's retry logic. If the circuit is open, the request is rejected without any retry attempts.