Grafana Dashboard Templates¶

Acteon ships with pre-built Grafana dashboards and a Prometheus scrape configuration that provide immediate observability over the gateway. Two dashboards cover the full metric surface: an Overview dashboard for gateway-wide throughput and outcomes, and a Provider Health dashboard for per-provider latency percentiles and success rates.

No external dependencies are required beyond Prometheus and Grafana. The gateway exposes a lightweight Prometheus text-format endpoint (GET /metrics/prometheus) with zero third-party crate overhead.

Acteon Overview Dashboard

Acteon Provider Health Dashboard

Quick Start¶

The fastest way to get the full monitoring stack running is with Docker Compose:

docker compose --profile monitoring up

This starts four services:

Service	Port	Description
`acteon`	8080	Acteon gateway (API + metrics)
`redis`	6379	State backend
`prometheus`	9090	Metrics scraper (15s interval)
`grafana`	3000	Dashboard UI

Grafana is available at http://localhost:3000 with default credentials configured in deploy/grafana/grafana.ini. Both dashboards are provisioned automatically under the Acteon folder.

Prometheus Endpoint¶

GET /metrics/prometheus

Returns all gateway metrics in Prometheus text exposition format (text/plain; version=0.0.4). No authentication is required for this endpoint.

The endpoint reads atomic in-memory counters and serializes them directly to the text format. There is no dependency on the prometheus crate -- the exporter is a hand-written ~80-line Axum handler that formats # HELP, # TYPE, and metric lines.

Example Response¶

# HELP acteon_actions_dispatched_total Total number of actions dispatched to the gateway.
# TYPE acteon_actions_dispatched_total counter
acteon_actions_dispatched_total 15482

# HELP acteon_provider_success_rate Provider success rate percentage (0-100).
# TYPE acteon_provider_success_rate gauge
acteon_provider_success_rate{provider="email"} 99.12
acteon_provider_success_rate{provider="slack"} 97.50

Exported Metrics¶

All metrics use the acteon_ prefix. Counters are monotonically increasing from server start; gauges reflect current computed values.

Gateway Dispatch Counters¶

Metric	Type	Description
`acteon_actions_dispatched_total`	counter	Total actions dispatched to the gateway
`acteon_actions_executed_total`	counter	Actions successfully executed by a provider
`acteon_actions_deduplicated_total`	counter	Actions skipped (deduplication)
`acteon_actions_suppressed_total`	counter	Actions suppressed by a rule
`acteon_actions_rerouted_total`	counter	Actions rerouted to a different provider
`acteon_actions_throttled_total`	counter	Actions rejected (rate limiting)
`acteon_actions_failed_total`	counter	Actions that failed after all retries
`acteon_actions_pending_approval_total`	counter	Actions sent to human approval
`acteon_actions_scheduled_total`	counter	Actions scheduled for delayed execution

LLM Guardrail Counters¶

Metric	Type	Description
`acteon_llm_guardrail_allowed_total`	counter	Actions approved by the LLM guardrail
`acteon_llm_guardrail_denied_total`	counter	Actions blocked by the LLM guardrail
`acteon_llm_guardrail_errors_total`	counter	LLM guardrail evaluation errors

Chain (Workflow) Counters¶

Metric	Type	Description
`acteon_chains_started_total`	counter	Task chains initiated
`acteon_chains_completed_total`	counter	Task chains completed successfully
`acteon_chains_failed_total`	counter	Task chains that failed
`acteon_chains_cancelled_total`	counter	Task chains cancelled

Circuit Breaker Counters¶

Metric	Type	Description
`acteon_circuit_open_total`	counter	Actions rejected (circuit breaker open)
`acteon_circuit_transitions_total`	counter	Circuit breaker state transitions
`acteon_circuit_fallbacks_total`	counter	Actions rerouted to fallback provider

Recurring Action Counters¶

Metric	Type	Description
`acteon_recurring_dispatched_total`	counter	Recurring actions dispatched
`acteon_recurring_errors_total`	counter	Recurring action dispatch errors
`acteon_recurring_skipped_total`	counter	Recurring actions skipped
`acteon_recurring_active`	gauge	Recurring actions currently scheduled and eligible for dispatch

Quota Counters¶

Metric	Type	Description
`acteon_quota_exceeded_total`	counter	Actions blocked by tenant quota (HTTP 429)
`acteon_quota_warned_total`	counter	Actions passed with a quota warning
`acteon_quota_degraded_total`	counter	Actions degraded to fallback provider
`acteon_quota_notified_total`	counter	Quota threshold notifications sent

Retention Reaper Counters¶

Metric	Type	Description
`acteon_retention_deleted_state_total`	counter	State entries deleted by retention reaper
`acteon_retention_skipped_compliance_total`	counter	Entries skipped due to compliance hold
`acteon_retention_errors_total`	counter	Retention reaper errors

Embedding Cache Counters¶

These metrics are only emitted when an embedding provider is configured.

Metric	Type	Description
`acteon_embedding_topic_cache_hits_total`	counter	Topic embeddings served from cache
`acteon_embedding_topic_cache_misses_total`	counter	Topic embeddings requiring API call
`acteon_embedding_text_cache_hits_total`	counter	Text embeddings served from cache
`acteon_embedding_text_cache_misses_total`	counter	Text embeddings requiring API call
`acteon_embedding_errors_total`	counter	Embedding provider errors
`acteon_embedding_fail_open_total`	counter	Fail-open returns (similarity 0.0)

Per-Provider Metrics¶

These metrics carry a provider label and are emitted for each registered provider.

Metric	Type	Description
`acteon_provider_requests_total`	counter	Total requests to the provider
`acteon_provider_successes_total`	counter	Successful provider executions
`acteon_provider_failures_total`	counter	Failed provider executions
`acteon_provider_success_rate`	gauge	Success rate percentage (0-100)
`acteon_provider_avg_latency_ms`	gauge	Average latency in milliseconds
`acteon_provider_p50_latency_ms`	gauge	Median latency in milliseconds
`acteon_provider_p95_latency_ms`	gauge	95^th percentile latency in milliseconds
`acteon_provider_p99_latency_ms`	gauge	99^th percentile latency in milliseconds

Overview Dashboard¶

The Acteon Overview dashboard (acteon-overview.json) provides a high-level view of gateway activity across seven collapsible row sections:

Throughput¶

Action Throughput (timeseries) -- Dispatched, executed, and failed action rates over time using rate(...[5m]).
Action Outcomes (stacked) (timeseries) -- Stacked area chart showing the breakdown of all outcome types (executed, suppressed, deduplicated, rerouted, throttled, failed).
Totals (stat) -- Absolute counter values for all nine action outcome types. Failed and throttled counters turn red/orange when non-zero.

LLM Guardrail¶

LLM Guardrail Decisions (timeseries) -- Allowed, denied, and error rates.
LLM Guardrail Totals (stat) -- Absolute counts with sparkline area graphs.

Chains (Workflows)¶

Chain Throughput (timeseries) -- Started, completed, failed, and cancelled chain rates.
Chain Success Rate (gauge) -- Completed / started ratio. Green above 95%, orange 90-95%, red below 90%.
Chain Totals (stat) -- Absolute chain lifecycle counts.

Circuit Breaker¶

Circuit Breaker Activity (timeseries) -- Open rejections, state transitions, and fallback reroutes.
Circuit Breaker Totals (stat) -- Absolute counts; non-zero values turn red.

Recurring Actions¶

Recurring Action Totals (stat) -- Dispatched, errors, and skipped counts. Error count turns red when non-zero.
Recurring Action Rate (timeseries) -- Dispatched and error rates over time.

Quotas & Retention¶

Quota Totals (stat) -- Exceeded (red when > 0), warned, degraded, and notified counts.
Retention Reaper Totals (stat) -- Deleted, compliance hold (skipped), and error counts.

Embedding Cache¶

Embedding Cache Hit/Miss Rate (timeseries) -- Topic and text cache hit/miss rates.
Topic Cache Hit Rate (gauge) -- Green above 80%, orange 50-80%, red below 50%.
Text Cache Hit Rate (gauge) -- Same thresholds as topic cache.

Provider Health Dashboard¶

The Acteon Provider Health dashboard (acteon-provider-health.json) provides per-provider observability across three row sections:

Provider Success Rates¶

Success Rate by Provider (stat) -- Large stat panels showing each provider's success rate percentage. Color-coded: green (>= 99%), yellow (>= 95%), orange (>= 90%), red (< 90%).

Request Volume¶

Request Rate by Provider (timeseries) -- Per-provider request rate using rate(acteon_provider_requests_total[5m]).
Failure Rate by Provider (timeseries) -- Per-provider failure rate.
Total Requests by Provider (stat) -- Absolute request counts per provider.

Latency¶

Average Latency by Provider (timeseries) -- Average latency in milliseconds per provider over time.
p99 Latency by Provider (timeseries) -- 99^th percentile latency per provider over time.
Latency Percentile Summary (table) -- Combined table with columns: Provider, p50 Latency, p95 Latency, p99 Latency, Avg Latency, and Success Rate. Success Rate column is color-coded by threshold.

Standalone Setup¶

If you are not using Docker Compose, you can set up monitoring manually.

1. Configure Prometheus¶

Add an Acteon scrape job to your prometheus.yml:

scrape_configs:
  - job_name: "acteon"
    metrics_path: "/metrics/prometheus"
    scrape_interval: 15s
    static_configs:
      - targets: ["localhost:8080"]
        labels:
          service: "acteon-gateway"

For multiple Acteon instances, list all targets:

static_configs:
  - targets:
    - "acteon-1:8080"
    - "acteon-2:8080"
    - "acteon-3:8080"

2. Import Dashboards into Grafana¶

Import the dashboard JSON files from the deploy/grafana/dashboards/ directory:

Open Grafana and navigate to Dashboards > Import.
Click Upload JSON file and select acteon-overview.json.
Select your Prometheus datasource for the DS_PROMETHEUS variable.
Repeat for acteon-provider-health.json.

Alternatively, use Grafana provisioning by copying the files from deploy/grafana/provisioning/ into your Grafana configuration directory:

/etc/grafana/provisioning/
  datasources/
    prometheus.yml          # Points to your Prometheus instance
  dashboards/
    dashboards.yml          # Points to the dashboard JSON directory

The provisioning datasource config (deploy/grafana/provisioning/datasources/prometheus.yml):

apiVersion: 1
datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    editable: false

The provisioning dashboard config (deploy/grafana/provisioning/dashboards/dashboards.yml):

apiVersion: 1
providers:
  - name: "Acteon"
    orgId: 1
    folder: "Acteon"
    type: file
    disableDeletion: false
    editable: true
    options:
      path: /var/lib/grafana/dashboards
      foldersFromFilesStructure: false

3. Verify¶

Open Prometheus at http://localhost:9090/targets and confirm the acteon job shows status UP.
Open Grafana at http://localhost:3000 and navigate to the Acteon folder. Both dashboards should be listed.
Dispatch a few actions through the gateway and watch metrics populate.

Customization¶

Adding Custom Panels¶

Both dashboards use a DS_PROMETHEUS template variable for the datasource. To add custom panels:

Open a dashboard in Grafana and click Edit.
Add a new panel and select ${DS_PROMETHEUS} as the datasource.
Use any acteon_* metric in your PromQL queries.
Save the dashboard.

Since provisioned dashboards are set to editable: true, changes persist in Grafana's storage. To make changes permanent across deployments, export the modified dashboard JSON and commit it to deploy/grafana/dashboards/.

Useful PromQL Queries¶

Overall success rate:

(acteon_actions_executed_total / clamp_min(acteon_actions_dispatched_total, 1)) * 100

Action failure rate (5-minute window):

rate(acteon_actions_failed_total[5m]) / clamp_min(rate(acteon_actions_dispatched_total[5m]), 0.001) * 100

Provider with highest p99 latency:

topk(1, acteon_provider_p99_latency_ms)

Quota usage trending toward limit:

rate(acteon_quota_exceeded_total[1h])

Alerting¶

Grafana supports alerting directly from dashboard panels. Recommended alert rules:

Alert	Condition	Severity
High failure rate	`rate(acteon_actions_failed_total[5m]) > 0.1`	Warning
Provider down	`acteon_provider_success_rate < 50`	Critical
Circuit breaker tripped	`increase(acteon_circuit_open_total[5m]) > 0`	Warning
Quota exceeded	`increase(acteon_quota_exceeded_total[5m]) > 0`	Warning
Retention errors	`increase(acteon_retention_errors_total[5m]) > 0`	Warning
Embedding cache degraded	`acteon_embedding_topic_cache_hits_total / (acteon_embedding_topic_cache_hits_total + acteon_embedding_topic_cache_misses_total) < 0.5`	Info

To configure alerts, edit a panel, switch to the Alert tab, and define thresholds. See the Grafana alerting documentation for details on notification channels and routing.

Configuration Reference¶

Prometheus Scrape Config¶

Setting	Default	Description
`scrape_interval`	15s	How often Prometheus scrapes the endpoint
`metrics_path`	`/metrics/prometheus`	Acteon metrics endpoint path
`storage.tsdb.retention.time`	30d	How long Prometheus retains time-series data

Grafana Configuration¶

Grafana is configured via deploy/grafana/grafana.ini, which is mounted into the container at /etc/grafana/grafana.ini. Key settings:

Section	Key	Default	Description
`[security]`	`admin_user`	`admin`	Grafana admin username
`[security]`	`admin_password`	(encoded)	Grafana admin password
`[users]`	`allow_sign_up`	`false`	Disable self-registration

Dashboard Settings¶

Both dashboards share these settings:

Setting	Value	Description
Auto-refresh	30s	Dashboard refresh interval
Default time range	Last 1 hour	Initial time window
Timezone	Browser	Respects the viewer's local timezone
Tags	`acteon`	Dashboard tag for search/filtering

Production Hardening¶

Authentication¶

The default Docker Compose setup uses credentials from deploy/grafana/grafana.ini. For production:

Change the admin password in grafana.ini or override it via the Grafana UI on first login.
Enable Grafana's built-in LDAP, OAuth, or SAML authentication.
Consider placing Prometheus behind a reverse proxy with authentication -- the /metrics/prometheus endpoint exposes operational data.

# deploy/grafana/grafana.ini
[security]
admin_password = your_production_password

[auth.generic_oauth]
enabled = true
# ... OAuth config

Data Retention¶

Prometheus defaults to 30 days of retention. Adjust based on your storage budget:

command:
  - "--storage.tsdb.retention.time=90d"
  - "--storage.tsdb.retention.size=10GB"

For long-term storage, consider Thanos or Cortex as a remote write backend.

High Availability¶

For production HA deployments:

Prometheus: Run two independent Prometheus instances scraping the same targets. Use Thanos or similar for deduplication and unified querying.
Grafana: Grafana supports HA mode with a shared PostgreSQL or MySQL database for session/dashboard storage.
Acteon: Each Acteon instance exposes its own /metrics/prometheus endpoint. List all instances as Prometheus targets and use sum(rate(...)) in PromQL to aggregate.

Network Security¶

Restrict Prometheus scrape access to internal networks only.
Use TLS between Prometheus and Grafana (GF_SERVER_PROTOCOL=https).
The /metrics/prometheus endpoint does not require authentication by default. If your Acteon server is publicly accessible, use a reverse proxy or firewall rule to restrict access to the metrics endpoint.

File Layout¶

deploy/
  grafana/
    grafana.ini                      # Grafana server configuration (auth, security)
    dashboards/
      acteon-overview.json           # Gateway overview dashboard
      acteon-provider-health.json    # Per-provider health dashboard
    provisioning/
      dashboards/
        dashboards.yml               # Dashboard provisioning config
      datasources/
        prometheus.yml               # Prometheus datasource config
  prometheus/
    prometheus.yml                   # Prometheus scrape configuration