Skip to content

Incident Response Pipeline

This guide shows how to use Acteon as an incident response orchestration layer for operations teams. Alerts from monitoring tools flow through Acteon, which enforces suppression of test noise, throttles alert storms, deduplicates repeated firings, batches low-severity alerts, and routes critical incidents through a multi-step triage chain with a war-room sub-chain -- all backed by a full audit trail with field redaction.

Runnable Example

The examples/incident-response-pipeline/ directory contains a complete, runnable setup with PostgreSQL-backed state and audit, safety rules, alert routing, chain orchestration, and incident lifecycle management. Follow the quick start below to have the full pipeline running in minutes.

flowchart LR
    subgraph Monitoring
        D1[Datadog]
        D2[Prometheus]
        D3[CloudWatch]
        D4[Grafana]
    end

    subgraph Acteon
        GW[Gateway]
        RE[Rule Engine]
        CH[Chain Engine]
        CB[Circuit Breakers]
    end

    subgraph Providers
        PD[PagerDuty]
        SL[Slack]
        EM[Email]
        WH[Webhook Fallback]
        TK[Ticket System]
    end

    D1 & D2 & D3 & D4 -->|dispatch| GW
    GW --> RE --> CH --> CB
    CB --> PD & SL & EM & WH & TK
    GW -.->|audit + state| DB[(PostgreSQL)]

The scenario: an operations team receives alerts from multiple monitoring tools (Datadog, Prometheus, CloudWatch, Grafana). Instead of routing each tool directly to PagerDuty or Slack, all alerts flow through Acteon. The rule engine filters noise, prevents alert fatigue, and routes actionable incidents through a triage chain that classifies, escalates, and optionally spins up a war room.


What This Example Exercises

The example exercises 14 Acteon features through a single unified scenario:

# Feature How
1 Chains incident-triage chain: classify → escalate → war-room or ticket
2 Sub-chains Critical path invokes war-room-setup sub-chain (3 steps)
3 Conditional branching Branch on body.logged after classify; branch on success after escalate
4 Event state management Incident lifecycle: open → acknowledged → investigating → resolved
5 Circuit breakers + fallback PagerDuty → webhook-fallback; email-alerts → slack-alerts
6 Recurring actions Health-check alert every 60 seconds via cron
7 Data retention Audit 7 days, events 1 day
8 Event grouping Low-severity alerts batched by service, 30s flush window
9 Quotas 100 actions/hour for ops-team
10 Throttle Alert storm protection: 20/min per tenant
11 Dedup Same fingerprint deduplicated within 5 minutes
12 Suppress Block alerts from test environments
13 Modify Enrich all alerts with pipeline_version metadata
14 Audit + redaction Full audit with api_key, webhook_url, pagerduty_key redacted

Prerequisites

  • PostgreSQL (for durable state + audit)
  • jq (for script output formatting)
  • Rust 1.88+ and Cargo

Quick Start

1. Start PostgreSQL

docker compose --profile postgres up -d

2. Run Database Migrations

scripts/migrate.sh -c examples/incident-response-pipeline/acteon.toml

3. Start Acteon

cargo run -p acteon-server --features postgres -- \
  -c examples/incident-response-pipeline/acteon.toml

Wait for Listening on 127.0.0.1:8080.

4. Create API Resources

cd examples/incident-response-pipeline
bash scripts/setup.sh

This creates via the REST API:

  • Quota: 100 actions/hour for ops-team
  • Retention policy: audit 7 days, events 1 day
  • Recurring action: health-check alert every 60 seconds

5. Fire Sample Alerts

bash scripts/send-alerts.sh

This sends 20 sample alerts covering all categories (see Expected Outcomes).

6. Manage an Incident Lifecycle

bash scripts/manage-incident.sh

This demonstrates event state transitions: open → acknowledged → investigating → resolved.

7. View the Report

bash scripts/show-report.sh

This queries 8 API endpoints and displays: audit trail with outcome breakdown, chain status, event states, provider health, quotas, groups, recurring actions, and retention policies.

8. Cleanup

bash scripts/teardown.sh

Architecture

                    ┌─────────────────────────────┐
  Monitoring  ─────►│      Acteon Gateway          │
  (Datadog,        │                             │
   Prometheus,     │  Rules Engine               │          ┌──────────────────┐
   CloudWatch,     │  ┌─suppress test env───────┐│         │ pagerduty        │
   Grafana)        │  ├─throttle 20/min─────────┤│────────►│ (escalation)     │
                   │  ├─dedup 5min──────────────┤│         ├──────────────────┤
                   │  ├─enrich metadata─────────┤│         │ slack-alerts     │
                   │  ├─chain critical/high─────┤│────────►│ (classification) │
                   │  ├─group low severity──────┤│         ├──────────────────┤
                   │  ├─allow ops actions───────┤│         │ email-alerts     │
                   │  └─deny unmatched──────────┘│────────►│ (notifications)  │
                   │                             │         ├──────────────────┤
                   │  Chain Engine               │         │ webhook-fallback │
                   │  ┌─classify────────────────┐│────────►│ (CB fallback)    │
                   │  ├─escalate────────────────┤│         ├──────────────────┤
                   │  ├─war-room (sub-chain)────┤│         │ ticket-system    │
                   │  └─create-ticket───────────┘│────────►│ (ticketing)      │
                   │                             │         └──────────────────┘
                   │  Background Jobs            │
                   │  ├─group flush (10s)       │         ┌──────────────────┐
                   │  ├─recurring health-check  │         │ PostgreSQL       │
                   │  └─retention reaper        │────────►│ state + audit    │
                   └─────────────────────────────┘         └──────────────────┘

The gateway acts as a single entry point for all monitoring alerts. The rule engine decides what happens to each alert before any provider call is made. This means you can suppress test noise, tune throttle limits, or change routing logic by editing YAML rule files -- no code changes, no redeployment.


Provider Configuration

The acteon.toml configures five providers -- four log providers simulating real integrations plus a webhook fallback for circuit breaker testing:

# PagerDuty: escalation for critical incidents
[[providers]]
name = "pagerduty"
type = "log"

# Slack: alert classification and war-room channel creation
[[providers]]
name = "slack-alerts"
type = "log"

# Email: notification delivery
[[providers]]
name = "email-alerts"
type = "log"

# Webhook fallback: circuit breaker target (intentionally unreachable)
[[providers]]
name = "webhook-fallback"
type = "webhook"
url = "http://localhost:9999/fallback"

# Ticket system: JIRA/ServiceNow-style ticket creation
[[providers]]
name = "ticket-system"
type = "log"

Note

Log providers return {"provider": "<name>", "logged": true}. The chain engine uses body.logged == true for conditional branching. In production, replace these with real provider types (e.g., type = "webhook" for PagerDuty's Events API).


Rule Design

Rules are split across three files by concern:

Triage Rules (triage.yaml)

Triage rules handle the first line of defense -- suppressing noise and routing actionable alerts:

rules:
  # Block all alerts from test environments
  - name: suppress-test-env
    priority: 1
    description: "Block all alerts originating from test environments"
    condition:
      all:
        - field: action.tenant
          eq: "ops-team"
        - field: action.payload.environment
          eq: "test"
    action:
      type: suppress

  # Limit ops-team to 20 alerts per minute to prevent alert fatigue
  - name: throttle-alert-storm
    priority: 2
    description: "Limit ops-team to 20 alerts per minute"
    condition:
      field: action.tenant
      eq: "ops-team"
    action:
      type: throttle
      max_count: 20
      window_seconds: 60

  # Start incident triage chain for critical and high severity alerts
  - name: trigger-incident-triage
    priority: 5
    description: "Start incident triage chain for critical and high severity"
    condition:
      all:
        - field: action.tenant
          eq: "ops-team"
        - field: action.action_type
          eq: "alert"
        - field: action.payload.severity
          in_list: ["critical", "high"]
    action:
      type: chain
      chain: "incident-triage"

Routing Rules (routing.yaml)

Routing rules handle deduplication, metadata enrichment, grouping, and the default allow:

rules:
  # Deduplicate alerts sharing the same dedup_key within 5 minutes
  - name: dedup-alerts
    priority: 3
    condition:
      all:
        - field: action.tenant
          eq: "ops-team"
        - field: action.action_type
          eq: "alert"
    action:
      type: deduplicate
      ttl_seconds: 300

  # Add pipeline_version metadata to all ops-team actions
  - name: enrich-alert-metadata
    priority: 4
    condition:
      field: action.tenant
      eq: "ops-team"
    action:
      type: modify
      changes:
        pipeline_version: "1.0.0"

  # Batch low-severity alerts by service, flush every 30 seconds
  - name: group-low-severity
    priority: 6
    condition:
      all:
        - field: action.tenant
          eq: "ops-team"
        - field: action.action_type
          eq: "alert"
        - field: action.payload.severity
          eq: "low"
    action:
      type: group
      group_by:
        - payload.service
      group_wait_seconds: 30

  # Allow all remaining ops-team actions
  - name: allow-ops-actions
    priority: 10
    condition:
      field: action.tenant
      eq: "ops-team"
    action:
      type: allow

Safety Rules (safety.yaml)

A catch-all rule ensures nothing slips through:

rules:
  # Block any action not matched by a higher-priority rule
  - name: deny-unmatched
    priority: 100
    condition:
      field: action.tenant
      eq: "ops-team"
    action:
      type: suppress

Rule Evaluation Order

Rules are evaluated by priority (lowest number = highest priority). The first matching terminal rule determines the outcome:

Priority Rule File Action
1 suppress-test-env triage.yaml Suppress
2 throttle-alert-storm triage.yaml Throttle 20/min
3 dedup-alerts routing.yaml Deduplicate 5min
4 enrich-alert-metadata routing.yaml Modify (add metadata)
5 trigger-incident-triage routing.yaml Chain → incident-triage
6 group-low-severity routing.yaml Group by service, 30s
10 allow-ops-actions routing.yaml Allow
100 deny-unmatched safety.yaml Suppress (catch-all)

Chain Orchestration

Main Chain: incident-triage

The incident-triage chain handles critical and high-severity alerts through a multi-step triage process with conditional branching and a sub-chain:

flowchart TD
    A[classify<br/>slack-alerts] --> B{body.logged == true?}
    B -->|yes| C[escalate<br/>pagerduty]
    B -->|no / default| C
    C --> D{success == true?}
    D -->|yes| E[war-room<br/>sub-chain: war-room-setup]
    D -->|no / default| F[create-ticket<br/>ticket-system]
[[chains.definitions]]
name = "incident-triage"
timeout_seconds = 300

[[chains.definitions.steps]]
name = "classify"
provider = "slack-alerts"
action_type = "classify_alert"
payload_template = {
    alert_id = "{{origin.payload.alert_id}}",
    severity = "{{origin.payload.severity}}"
}

  [[chains.definitions.steps.branches]]
  field = "body.logged"
  operator = "eq"
  value = true
  target = "escalate"

[[chains.definitions.steps]]
name = "escalate"
provider = "pagerduty"
action_type = "create_incident"
payload_template = {
    service = "{{origin.payload.service}}",
    severity = "{{origin.payload.severity}}"
}

  [[chains.definitions.steps.branches]]
  field = "success"
  operator = "eq"
  value = true
  target = "war-room"

  default_next = "create-ticket"

[[chains.definitions.steps]]
name = "war-room"
sub_chain = "war-room-setup"

[[chains.definitions.steps]]
name = "create-ticket"
provider = "ticket-system"
action_type = "create_ticket"
payload_template = {
    alert_id = "{{origin.payload.alert_id}}",
    service = "{{origin.payload.service}}"
}

Sub-Chain: war-room-setup

When escalation succeeds, the chain branches to a sub-chain that creates a dedicated Slack channel, pages on-call engineers, and opens a tracking ticket:

[[chains.definitions]]
name = "war-room-setup"
timeout_seconds = 120

[[chains.definitions.steps]]
name = "create-channel"
provider = "slack-alerts"
action_type = "create_channel"
payload_template = { name = "inc-{{origin.payload.alert_id}}" }

[[chains.definitions.steps]]
name = "page-oncall"
provider = "pagerduty"
action_type = "page_oncall"
payload_template = {
    urgency = "high",
    service = "{{origin.payload.service}}"
}

[[chains.definitions.steps]]
name = "open-ticket"
provider = "ticket-system"
action_type = "create_ticket"
payload_template = {
    alert_id = "{{origin.payload.alert_id}}",
    type = "war-room"
}

Sub-chains are first-class chain definitions. The parent chain's war-room step uses sub_chain = "war-room-setup" instead of a provider. When the sub-chain completes, execution returns to the parent chain.


Circuit Breaker Fallbacks

Two circuit breaker fallback paths protect against provider outages:

[circuit_breaker]
enabled = true
failure_threshold = 3
success_threshold = 1
recovery_timeout_seconds = 30

# PagerDuty trips after 2 failures, falls back to webhook
[circuit_breaker.providers.pagerduty]
failure_threshold = 2
recovery_timeout_seconds = 60
fallback_provider = "webhook-fallback"

# Email trips after 2 failures, falls back to Slack
[circuit_breaker.providers.email-alerts]
failure_threshold = 2
recovery_timeout_seconds = 60
fallback_provider = "slack-alerts"

Testing Circuit Breaker Behavior

The webhook-fallback provider intentionally targets http://localhost:9999/fallback, which is not running by default. If PagerDuty trips its circuit breaker after 2 failures, the fallback will also fail -- demonstrating cascading circuit breaker behavior in the report.

To see successful fallback routing, start a simple HTTP server before running:

python3 -m http.server 9999 &

After recovery_timeout_seconds (60s), the circuit breaker enters half-open state and allows one probe request through. If it succeeds, the circuit closes and normal routing resumes.


Event State Management

The manage-incident.sh script demonstrates Acteon's event state machine, which tracks incidents through their lifecycle:

stateDiagram-v2
    [*] --> open : dispatch creates event
    open --> acknowledged : operator ACKs
    acknowledged --> investigating : team begins work
    investigating --> resolved : issue fixed
    resolved --> [*]

Each transition is driven by an API call:

# Acknowledge an incident
curl -X PUT "http://localhost:8080/v1/events/incident-db-001/transition" \
  -H "Content-Type: application/json" \
  -d '{"to": "acknowledged", "namespace": "incidents", "tenant": "ops-team"}'

# Begin investigation
curl -X PUT "http://localhost:8080/v1/events/incident-db-001/transition" \
  -H "Content-Type: application/json" \
  -d '{"to": "investigating", "namespace": "incidents", "tenant": "ops-team"}'

# Resolve the incident
curl -X PUT "http://localhost:8080/v1/events/incident-db-001/transition" \
  -H "Content-Type: application/json" \
  -d '{"to": "resolved", "namespace": "incidents", "tenant": "ops-team"}'

Background Processing

The [background] section enables five background processors:

Processor Interval Purpose
Group flush 10s Flush accumulated low-severity alert batches
Timeout processing 10s Cancel chains that exceed timeout_seconds
Cleanup 60s Remove completed chains older than completed_chain_ttl_seconds
Recurring actions 30s Check and execute * * * * * health-check
Retention reaper 60s Delete expired audit records and event state
[background]
enabled = true
group_flush_interval_seconds = 10
timeout_check_interval_seconds = 10
cleanup_interval_seconds = 60
enable_group_flush = true
enable_timeout_processing = true
enable_recurring_actions = true
recurring_check_interval_seconds = 30
max_recurring_actions_per_tenant = 10
enable_retention_reaper = true
retention_check_interval_seconds = 60
namespace = "incidents"
tenant = "ops-team"

Audit Trail and Redaction

Every dispatched alert is recorded in the PostgreSQL audit backend with full outcome details. Sensitive fields are automatically redacted:

[audit]
enabled = true
backend = "postgres"
url = "postgres://localhost:5432/acteon"
store_payload = true
ttl_seconds = 604800  # 7 days

[audit.redact]
enabled = true
fields = ["api_key", "webhook_url", "pagerduty_key"]
placeholder = "[REDACTED]"

Query the audit trail to see what happened:

# All dispatches
curl -s "http://localhost:8080/v1/audit?namespace=incidents&tenant=ops-team&limit=50" | jq .

# Only suppressed actions
curl -s "http://localhost:8080/v1/audit?namespace=incidents&tenant=ops-team&outcome=suppressed" | jq .

# Only chain-started actions
curl -s "http://localhost:8080/v1/audit?namespace=incidents&tenant=ops-team&outcome=chain_started" | jq .

Expected Outcomes

When running send-alerts.sh, you should see these outcomes:

Alerts Count Expected Outcome
Critical (database, api-gateway) 2 chain_started (incident-triage + war-room sub-chain)
High (cache-layer) 3 chain_started (incident-triage, simpler path)
Low (cdn, search, auth) 5 grouped (batched by service, 30s flush)
Storm (rapid-fire medium) 5 Some throttled (if >20/min reached)
Duplicate (same dedup_key) 3 1 executed, 2 deduplicated
Test environment 2 suppressed

The exact counts for throttling depend on timing -- if all 20 alerts are sent within one minute, the throttle limit of 20/min may cause later alerts to be throttled.


File Structure

incident-response-pipeline/
├── acteon.toml              # Server config (providers, chains, circuit breakers, audit)
├── rules/
│   ├── triage.yaml          # Suppress test, throttle storms, route to chains
│   ├── routing.yaml         # Dedup, enrich metadata, group low-severity, allow
│   └── safety.yaml          # Catch-all suppress
├── scripts/
│   ├── setup.sh             # Create quotas + retention + recurring via API
│   ├── send-alerts.sh       # Fire 20 sample alerts exercising all features
│   ├── manage-incident.sh   # Transition events through lifecycle via API
│   ├── show-report.sh       # Query audit/chains/events/health/quotas/groups
│   └── teardown.sh          # Clean up API-created resources
└── README.md

Extending the Pipeline

Adding Real PagerDuty Integration

Replace the log provider with a webhook targeting PagerDuty's Events API:

[[providers]]
name = "pagerduty"
type = "webhook"
url = "https://events.pagerduty.com/v2/enqueue"
headers = { "Content-Type" = "application/json" }

Then update the chain step payloads to match PagerDuty's event format:

[[chains.definitions.steps]]
name = "escalate"
provider = "pagerduty"
action_type = "create_incident"
payload_template = {
    routing_key = "{{origin.metadata.pd_routing_key}}",
    event_action = "trigger",
    payload = {
        summary = "{{origin.payload.alert_id}}: {{origin.payload.service}}",
        severity = "{{origin.payload.severity}}",
        source = "acteon-pipeline"
    }
}

Adding Slack Webhook Notifications

Add a real Slack webhook for low-severity alert batches:

[[providers]]
name = "slack-alerts"
type = "webhook"
url = "https://hooks.slack.com/services/T00/B00/xxx"
headers = { "Content-Type" = "application/json" }

Adding an Escalation Timer

Use a recurring action to check for incidents that have been open longer than a threshold:

curl -X POST "http://localhost:8080/v1/recurring" \
  -H "Content-Type: application/json" \
  -d '{
    "namespace": "incidents",
    "tenant": "ops-team",
    "cron_expr": "*/5 * * * *",
    "timezone": "UTC",
    "enabled": true,
    "action_template": {
      "provider": "slack-alerts",
      "action_type": "escalation_check",
      "payload": {"check": "stale_incidents", "threshold_minutes": 30}
    },
    "description": "Check for stale incidents every 5 minutes"
  }'

Production Considerations

High Availability

For production, run multiple Acteon instances behind a load balancer. PostgreSQL state ensures consistency across instances for dedup, throttle, and event state:

[state]
backend = "postgres"
url = "postgres://pgbouncer:6432/acteon"

[executor]
max_retries = 3
timeout_seconds = 30
max_concurrent = 64

Monitoring

Use the Grafana dashboards to monitor:

  • Alert suppression and throttle rates
  • Chain completion times and failure rates
  • Circuit breaker state transitions
  • Quota usage per tenant
  • Event state distribution (open vs acknowledged vs resolved)

Alert Routing by Source

Add rules that route alerts differently based on their monitoring source:

- name: datadog-critical-to-pagerduty
  priority: 5
  condition:
    all:
      - field: action.metadata.source
        eq: "datadog"
      - field: action.payload.severity
        in_list: ["critical", "high"]
  action:
    type: chain
    chain: "incident-triage"

- name: cloudwatch-to-slack
  priority: 7
  condition:
    all:
      - field: action.metadata.source
        eq: "cloudwatch"
      - field: action.payload.severity
        eq: "medium"
  action:
    type: reroute
    target_provider: "slack-alerts"

Comparison: Acteon vs Custom Incident Pipeline

Capability Custom (PagerDuty + Lambda + SNS) Acteon
Alert dedup PagerDuty dedup keys (limited) Configurable TTL, any field
Throttle / rate limit Custom Lambda logic YAML rule, per-tenant
Alert grouping PagerDuty intelligent grouping Configurable group-by fields + flush window
Circuit breaker fallback Custom health checks + routing Built-in with automatic recovery
Multi-step triage Step Functions or custom code Built-in chains with branching
Incident lifecycle PagerDuty-only Provider-agnostic state machine
Audit trail CloudWatch logs (unstructured) Structured audit with field redaction
Test noise suppression Separate PagerDuty service YAML rule by environment field
Configuration changes Code deploy + PagerDuty UI YAML edit, no redeployment
Recurring health checks CloudWatch Events + Lambda Built-in recurring actions with cron

Acteon replaces the "glue" between monitoring tools and notification providers. Instead of building custom Lambda functions for dedup, throttle, routing, and escalation logic, you declare these behaviors in YAML rules and TOML configuration.