LLM Guardrails¶

LLM guardrails use AI-powered evaluation to gate actions through content analysis. Actions can be blocked or flagged based on an LLM's assessment with configurable confidence thresholds.

How It Works¶

flowchart LR
    A[Action] --> B{LLM Guardrail Rule?}
    B -->|Match| C[Send to LLM]
    C --> D{LLM Decision}
    D -->|Allowed| E[Continue Pipeline]
    D -->|Blocked| F[Suppress Action]
    D -->|Flagged| G[Route to Review Queue]
    B -->|No match| E

An action matching an LLM guardrail rule is sent to the configured LLM endpoint
The LLM evaluates the action's payload against a system prompt
Based on the response (allowed/blocked) and the configured policy, the action proceeds or is blocked

Configuration¶

Server Configuration¶

acteon.toml

[llm_guardrail]
endpoint = "https://api.openai.com/v1/chat/completions"
model = "gpt-4"
api_key_env = "OPENAI_API_KEY"     # Read API key from environment
policy = "block"                    # "block" or "flag"
temperature = 0.0
max_tokens = 256

Rule Configuration¶

rules/guardrails.yaml

rules:
  - name: content-safety-check
    priority: 1
    description: "Check message content for policy violations"
    condition:
      field: action.action_type
      eq: "send_message"
    action:
      type: llm_guardrail
      evaluator_name: "content-safety"
      block_on_flag: true
      send_to: "review-queue"

Parameters¶

Parameter	Type	Required	Description
`evaluator_name`	string	Yes	Identifier for the LLM evaluator
`block_on_flag`	bool	No	Whether to block the action when flagged
`send_to`	string	No	Provider to route flagged actions to

LLM Evaluator Interface¶

#[async_trait]
pub trait LlmEvaluator: Send + Sync {
    async fn evaluate(&self, action: &Action) -> Result<LlmGuardrailResponse>;
}

pub struct LlmGuardrailResponse {
    pub allowed: bool,           // Whether action passes
    pub reasoning: String,       // Explanation
    pub confidence: f32,         // 0.0 to 1.0
}

Built-in Evaluators¶

Evaluator	Description
`HttpLlmEvaluator`	Calls an OpenAI-compatible API
`MockLlmEvaluator`	Always allows (for testing)
`CapturingLlmEvaluator`	Captures all calls for test assertions
`FailingLlmEvaluator`	Simulates LLM failures

Policy Modes¶

Block Mode¶

When policy = "block", actions flagged by the LLM are suppressed:

Action → LLM says "not allowed" → ActionOutcome::Suppressed

Flag Mode¶

When policy = "flag", flagged actions are routed to a review queue:

Action → LLM says "not allowed" → Route to review-queue provider

Use Cases¶

Content Moderation¶

Check user-generated content before sending:

- name: moderate-messages
  condition:
    field: action.action_type
    eq: "send_user_message"
  action:
    type: llm_guardrail
    evaluator_name: "content-safety"
    block_on_flag: true

PII Detection¶

Flag actions containing personally identifiable information:

- name: pii-check
  condition:
    field: action.provider
    eq: "external-api"
  action:
    type: llm_guardrail
    evaluator_name: "pii-detector"
    block_on_flag: false
    send_to: "compliance-review"

Prompt Injection Prevention¶

Protect LLM-targeted actions from prompt injection:

- name: prompt-injection-guard
  condition:
    field: action.provider
    eq: "llm-gateway"
  action:
    type: llm_guardrail
    evaluator_name: "injection-detector"
    block_on_flag: true

Failure Handling¶

When the LLM evaluator errors — timeout, HTTP failure, JSON parse error, provider rate limit — the gateway must decide whether to let the action through or block it. The fail_open parameter on [llm_guardrail] controls that choice:

Mode	Behavior on evaluator error	Counter that increments
`fail_open = false` (default, fail-closed)	Action is denied (`ActionOutcome::Suppressed`) with reason `"LLM guardrail unavailable: <error>"`	`acteon_llm_guardrail_errors_total`
`fail_open = true` (fail-open)	Action proceeds as if the guardrail allowed it	`acteon_llm_guardrail_errors_total`

Either way, acteon_llm_guardrail_errors_total increments — the counter measures evaluator availability, not the dispatch outcome. The deny vs. allow decision is what fail_open flips.

Why fail-closed is the default. Operators who turn the guardrail on are explicitly opting into a content-safety check. If the evaluator is unreachable, the natural mental model is "the check didn't run, so the action should be blocked." Fail-open delivers the opposite: the check silently doesn't run and the action proceeds. An attacker who can force evaluator timeouts (oversized prompts, prompt-injection that triggers a long thought chain, regional outages) gets a free pass past the guard.

When to set fail_open = true. Override the default if the cost of a missed action exceeds the cost of a security bypass — for example, when the guardrail is an advisory check on an internal-only flow that must not stall on third-party LLM availability. Document the choice in your config (with a comment) so a future operator can tell the deviation from default was deliberate.

acteon.toml — explicit fail-open opt-in

[llm_guardrail]
enabled = true
endpoint = "https://api.openai.com/v1/chat/completions"
api_key = "ENC[...]"
policy = "Internal advisory only — availability over correctness."
fail_open = true   # Deliberately fail-open: this guardrail is advisory
                   # and must not block the action if the LLM is down.

Migration note

Acteon ≤ 0.1.x defaulted fail_open to true. This release flips the default to false to align with the principle of least surprise for security-oriented deployments. If you were relying on the implicit fail-open behavior, set fail_open = true explicitly in your acteon.toml to preserve the old behavior.

Monitoring¶

Prometheus Metrics¶

The guardrail emits three counters via GET /metrics/prometheus (and as JSON at GET /metrics):

Metric	Counted on
`acteon_llm_guardrail_allowed_total`	Evaluator returned `Allow` (action passes through)
`acteon_llm_guardrail_denied_total`	Evaluator returned `Deny` or `Flag` + `block_on_flag=true` (action suppressed)
`acteon_llm_guardrail_errors_total`	Evaluator errored — timeout, HTTP failure from the LLM, JSON parse error on the response, etc.

Grafana. The bundled acteon-overview dashboard has an "LLM Guardrail" row with a decisions rate timeseries and a stat panel for the totals.

What to alert on. Alerting on errors is the primary security-critical signal, not the deny ratio. In fail-open configurations the guardrail lets actions through when the evaluator errors, so an attacker who can force timeouts (large inputs, upstream LLM slowness) quietly bypasses the guard. The deny ratio in that attack stays flat or even drops because denied doesn't grow while allowed + errors keep going up. Page on errors first:

rate(acteon_llm_guardrail_errors_total[5m]) > 0.1

For baseline health — is the evaluator denying about as often as expected, or has something drifted? — compute deny prevalence against all evaluated traffic (include errors in the denominator). The + 1e-9 guards against division-by-zero NaN during quiet periods, which Grafana would otherwise render as "No Data" and hide the alert entirely:

rate(acteon_llm_guardrail_denied_total[5m])
  /
(rate(acteon_llm_guardrail_allowed_total[5m])
   + rate(acteon_llm_guardrail_denied_total[5m])
   + rate(acteon_llm_guardrail_errors_total[5m])
   + 1e-9) > 0.2

A sustained non-zero denied rate on rules targeting external input surfaces (public webhooks, customer-facing dispatch) is still worth investigating — prompt-injection attempts that the evaluator successfully catches show up here — but treat it as a secondary signal. The errors alert above is what catches an actual bypass.