Provider Health Dashboard¶

The Provider Health Dashboard provides real-time visibility into the health and performance of all registered providers. Unlike circuit breakers (which operate reactively when providers fail), the health dashboard offers comprehensive observability — success rates, latency percentiles, health check status, and circuit breaker state — all accessible via a single API endpoint and in the Admin UI.

This feature is infrastructure-level and operates automatically. No configuration or rules are required — metrics collection begins as soon as providers are registered and start handling requests.

How It Works¶

The gateway tracks three orthogonal dimensions of provider health:

Health Check Status: Result of the provider's health() method (supports readiness checks, ping tests, etc.)
Circuit Breaker State: Whether the circuit is open/closed/half-open (if circuit breakers are enabled)
Execution Metrics: Per-provider counters and latency percentiles collected during normal operation

These data sources are combined into a unified health report that updates in real-time as actions are dispatched.

Metric Collection¶

The gateway collects per-provider statistics automatically during execute_action():

Success/failure counters: Incremented on each execution based on the provider's response
Latency samples: Each request's duration is recorded in microseconds
Last request timestamp: Unix milliseconds of the most recent request
Last error: The most recent error message from the provider (if any)

All metrics are ephemeral (in-memory only) and reset on server restart. For historical analysis and long-term monitoring, export gateway metrics to Prometheus.

Latency Percentiles¶

The gateway maintains a rolling window of the most recent 1,000 latency samples per provider. When you query the health dashboard, percentiles (p50, p95, p99) are computed from this buffer using a selection algorithm.

Important: This approach gives accurate percentiles for low-to-medium traffic providers (< 100 req/s), but for high-throughput providers (1000+ req/s), the 1,000-sample buffer represents only ~1 second of traffic. For production-grade long-term latency monitoring, use Prometheus metrics and precomputed histogram buckets.

API Reference¶

Get Provider Health¶

GET /v1/providers/health

Returns health status, circuit breaker state, and execution metrics for all registered providers.

Authentication: Requires a valid API token (all roles).

Response (200):

{
  "providers": [
    {
      "provider": "email",
      "healthy": true,
      "health_check_error": null,
      "circuit_breaker_state": "closed",
      "total_requests": 15482,
      "successes": 15301,
      "failures": 181,
      "success_rate": 98.83,
      "avg_latency_ms": 47.3,
      "p50_latency_ms": 32.0,
      "p95_latency_ms": 125.4,
      "p99_latency_ms": 280.0,
      "last_request_at": 1707900123456,
      "last_error": null
    },
    {
      "provider": "webhook",
      "healthy": false,
      "health_check_error": "connection refused",
      "circuit_breaker_state": "open",
      "total_requests": 230,
      "successes": 45,
      "failures": 185,
      "success_rate": 19.57,
      "avg_latency_ms": 1850.2,
      "p50_latency_ms": 1420.0,
      "p95_latency_ms": 5000.0,
      "p99_latency_ms": 10000.0,
      "last_request_at": 1707899000000,
      "last_error": "timeout after 10s"
    }
  ]
}

Response Fields¶

Field	Type	Description
`provider`	string	Provider name
`healthy`	bool	Whether the provider's health check passed
`health_check_error`	string?	Health check error message (null if healthy)
`circuit_breaker_state`	string?	Circuit state (`closed`, `open`, `half_open`) — null if circuit breakers are disabled
`total_requests`	u64	Total requests routed to this provider since startup (0 if never used)
`successes`	u64	Successful executions
`failures`	u64	Failed executions
`success_rate`	f64	Success rate as a percentage (0.0 to 100.0)
`avg_latency_ms`	f64	Average latency in milliseconds
`p50_latency_ms`	f64	Median (50^th percentile) latency in milliseconds
`p95_latency_ms`	f64	95^th percentile latency in milliseconds
`p99_latency_ms`	f64	99^th percentile latency in milliseconds
`last_request_at`	i64?	Unix milliseconds of the last request (null if never executed)
`last_error`	string?	Most recent error message (null if none)

Health Status Determination¶

A provider is marked healthy: true if its health check passes. Health checks are implemented by the provider's health() method and can perform any validation — network connectivity, credential checks, rate limit status, etc.

Independent of circuit breakers: A provider can be healthy: true (health check passes) while circuit_breaker_state: "open" (too many recent failures). The health check validates potential readiness; the circuit breaker tracks actual operational health.

Example: An email provider's health check might verify SMTP credentials and connection, returning healthy: true. However, if the SMTP server then starts timing out during actual message sends, the circuit breaker will trip to open while healthy remains true (credentials are still valid, server is reachable, but performance is degraded).

Memory Usage¶

Per-provider memory overhead is approximately 8 KB:

Latency sample buffer: 1,000 × 8 bytes (u64) = 8,000 bytes
Counters: ~64 bytes (atomic u64s for total/success/failure/latency)
Last error string: ~100 bytes average (variable)

For a typical deployment with 5-10 providers, total memory overhead is ~50-80 KB. Even with 100 providers, the total footprint is under 1 MB.

Thread Safety¶

All metrics use lock-free atomic operations where possible (counters, timestamps) and a short-duration parking_lot::Mutex for the latency sample buffer and last-error string. The latency buffer lock is held only for the time required to:

Push a new sample (O(1) with VecDeque::push_back)
Evict the oldest sample if the buffer is full (O(1) with VecDeque::pop_front)

Percentile computation (during snapshot()) acquires the lock once to copy the buffer, then releases it and computes percentiles on the copy. This ensures dashboard queries do not block live action dispatch.

Integration with Circuit Breaker¶

The health dashboard displays the current circuit breaker state for each provider. When circuit breakers are enabled, you'll see:

circuit_breaker_state: "closed" — Normal operation
circuit_breaker_state: "open" — Circuit is open (requests rejected or rerouted)
circuit_breaker_state: "half_open" — Circuit is probing for recovery

If circuit breakers are disabled, circuit_breaker_state is null.

The circuit breaker state is read from the distributed state store on each health dashboard request, ensuring multi-instance deployments show consistent data.

Admin UI¶

The Admin UI includes a dedicated Provider Health page accessible from the main navigation. The dashboard displays:

Provider list with status indicators (green = healthy, red = unhealthy)
Success rate as a percentage with visual bar chart
Latency percentiles (p50/p95/p99) in milliseconds
Circuit breaker badge showing current state (Closed/Open/Half-Open)
Last error message (if any)
Last request timestamp in human-readable format
Auto-refresh every 5 seconds (configurable)

The UI uses the same GET /v1/providers/health API endpoint consumed by external dashboards.

Configuration¶

No special configuration is required. The provider health dashboard works automatically when:

Providers are registered via GatewayBuilder::provider()
The server is running

Health checks run on-demand when the dashboard is queried (not on a background schedule). This ensures the health status is always fresh without adding background load.

Circuit breaker state is only included if circuit breakers are enabled via the [circuit_breaker] config section. See the Circuit Breaker documentation for details.

Use Cases¶

Incident Response¶

When investigating an outage or degraded performance, the health dashboard provides immediate visibility into which providers are failing and why:

curl -H "Authorization: Bearer <token>" \
  http://localhost:8080/v1/providers/health | jq '.providers[] | select(.success_rate < 90)'

This query returns all providers with < 90% success rate, showing you where to focus remediation efforts.

Capacity Planning¶

Use latency percentiles to identify providers approaching saturation. If p95 or p99 latencies start climbing, it may indicate the provider is reaching capacity limits or experiencing network congestion.

Circuit Breaker Tuning¶

The health dashboard helps calibrate circuit breaker thresholds. If a provider's success_rate is 85% but its circuit is open, you may want to increase the failure_threshold. Conversely, if a provider's success_rate is 50% but the circuit is closed, the threshold may be too lenient.

SLA Monitoring¶

Integrate the health dashboard API into your monitoring stack (Grafana, DataDog, etc.) to track provider SLAs over time. Set alerts when success rates drop below acceptable thresholds or when p99 latencies exceed SLA targets.

Example: Rust Client¶

use acteon_client::ActeonClient;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = ActeonClient::new("http://localhost:8080", "your-api-token")?;

    let health = client.list_provider_health().await?;

    for provider in &health.providers {
        println!("{}: {} ({}% success, p99: {:.1}ms)",
            provider.provider,
            if provider.healthy { "✓" } else { "✗" },
            provider.success_rate,
            provider.p99_latency_ms
        );

        if let Some(state) = &provider.circuit_breaker_state {
            println!("  Circuit: {}", state);
        }

        if let Some(err) = &provider.last_error {
            println!("  Last error: {}", err);
        }
    }

    Ok(())
}

Limitations¶

In-Memory Only¶

All metrics are ephemeral and reset on server restart. This is by design — the health dashboard is intended for real-time operational visibility, not long-term trend analysis.

For historical metrics and dashboards, use the gateway's Prometheus /metrics endpoint, which exports the same counters in a format compatible with long-term storage and alerting (Prometheus, Grafana, Thanos, etc.).

High-Throughput Latency Accuracy¶

The 1,000-sample latency buffer provides accurate percentiles for providers handling up to ~100 req/s. Beyond that, the buffer represents only the most recent ~1-10 seconds of traffic, which may not reflect long-term performance.

For high-throughput providers (1000+ req/s), use Prometheus histogram metrics with precomputed buckets instead of the in-memory percentile buffer.

No Historical Trend Data¶

The dashboard shows current snapshot data only. It cannot answer questions like "What was the p99 latency 3 hours ago?" or "How has success rate changed over the last week?"

For time-series queries, export metrics to Prometheus and query using PromQL.

Design Notes¶

Why in-memory instead of state store? Metrics are ephemeral by nature and don't need durability. Storing them in the state backend would add latency and storage overhead with no operational benefit.
Why 1,000 samples? This strikes a balance between accuracy (sufficient for stable p99 estimates) and memory overhead (~8 KB per provider). Increasing to 10,000 samples would improve accuracy for high-throughput providers but increase memory usage 10x.
Why compute percentiles on-query instead of pre-aggregating? Pre-aggregation (e.g., maintaining sorted buckets) would reduce query latency but increase complexity and lock contention during high-throughput dispatch. The current approach optimizes for dispatch performance (no locks during latency recording) at the expense of slightly slower dashboard queries (percentile computation takes ~1-2ms for 1,000 samples).
Why include health check status? Some failures are environmental (network issues, credential expiry) rather than operational (high latency, rate limits). The health check status disambiguates these failure modes.