LLM Guardrails¶
LLM guardrails use AI-powered evaluation to gate actions through content analysis. Actions can be blocked or flagged based on an LLM's assessment with configurable confidence thresholds.
How It Works¶
flowchart LR
A[Action] --> B{LLM Guardrail Rule?}
B -->|Match| C[Send to LLM]
C --> D{LLM Decision}
D -->|Allowed| E[Continue Pipeline]
D -->|Blocked| F[Suppress Action]
D -->|Flagged| G[Route to Review Queue]
B -->|No match| E - An action matching an LLM guardrail rule is sent to the configured LLM endpoint
- The LLM evaluates the action's payload against a system prompt
- Based on the response (allowed/blocked) and the configured policy, the action proceeds or is blocked
Configuration¶
Server Configuration¶
[llm_guardrail]
endpoint = "https://api.openai.com/v1/chat/completions"
model = "gpt-4"
api_key_env = "OPENAI_API_KEY" # Read API key from environment
policy = "block" # "block" or "flag"
temperature = 0.0
max_tokens = 256
Rule Configuration¶
rules:
- name: content-safety-check
priority: 1
description: "Check message content for policy violations"
condition:
field: action.action_type
eq: "send_message"
action:
type: llm_guardrail
evaluator_name: "content-safety"
block_on_flag: true
send_to: "review-queue"
Parameters¶
| Parameter | Type | Required | Description |
|---|---|---|---|
evaluator_name | string | Yes | Identifier for the LLM evaluator |
block_on_flag | bool | No | Whether to block the action when flagged |
send_to | string | No | Provider to route flagged actions to |
LLM Evaluator Interface¶
#[async_trait]
pub trait LlmEvaluator: Send + Sync {
async fn evaluate(&self, action: &Action) -> Result<LlmGuardrailResponse>;
}
pub struct LlmGuardrailResponse {
pub allowed: bool, // Whether action passes
pub reasoning: String, // Explanation
pub confidence: f32, // 0.0 to 1.0
}
Built-in Evaluators¶
| Evaluator | Description |
|---|---|
HttpLlmEvaluator | Calls an OpenAI-compatible API |
MockLlmEvaluator | Always allows (for testing) |
CapturingLlmEvaluator | Captures all calls for test assertions |
FailingLlmEvaluator | Simulates LLM failures |
Policy Modes¶
Block Mode¶
When policy = "block", actions flagged by the LLM are suppressed:
Flag Mode¶
When policy = "flag", flagged actions are routed to a review queue:
Use Cases¶
Content Moderation¶
Check user-generated content before sending:
- name: moderate-messages
condition:
field: action.action_type
eq: "send_user_message"
action:
type: llm_guardrail
evaluator_name: "content-safety"
block_on_flag: true
PII Detection¶
Flag actions containing personally identifiable information:
- name: pii-check
condition:
field: action.provider
eq: "external-api"
action:
type: llm_guardrail
evaluator_name: "pii-detector"
block_on_flag: false
send_to: "compliance-review"
Prompt Injection Prevention¶
Protect LLM-targeted actions from prompt injection:
- name: prompt-injection-guard
condition:
field: action.provider
eq: "llm-gateway"
action:
type: llm_guardrail
evaluator_name: "injection-detector"
block_on_flag: true
Failure Handling¶
When the LLM evaluator errors — timeout, HTTP failure, JSON parse error, provider rate limit — the gateway must decide whether to let the action through or block it. The fail_open parameter on [llm_guardrail] controls that choice:
| Mode | Behavior on evaluator error | Counter that increments |
|---|---|---|
fail_open = false (default, fail-closed) | Action is denied (ActionOutcome::Suppressed) with reason "LLM guardrail unavailable: <error>" | acteon_llm_guardrail_errors_total |
fail_open = true (fail-open) | Action proceeds as if the guardrail allowed it | acteon_llm_guardrail_errors_total |
Either way, acteon_llm_guardrail_errors_total increments — the counter measures evaluator availability, not the dispatch outcome. The deny vs. allow decision is what fail_open flips.
Why fail-closed is the default. Operators who turn the guardrail on are explicitly opting into a content-safety check. If the evaluator is unreachable, the natural mental model is "the check didn't run, so the action should be blocked." Fail-open delivers the opposite: the check silently doesn't run and the action proceeds. An attacker who can force evaluator timeouts (oversized prompts, prompt-injection that triggers a long thought chain, regional outages) gets a free pass past the guard.
When to set fail_open = true. Override the default if the cost of a missed action exceeds the cost of a security bypass — for example, when the guardrail is an advisory check on an internal-only flow that must not stall on third-party LLM availability. Document the choice in your config (with a comment) so a future operator can tell the deviation from default was deliberate.
[llm_guardrail]
enabled = true
endpoint = "https://api.openai.com/v1/chat/completions"
api_key = "ENC[...]"
policy = "Internal advisory only — availability over correctness."
fail_open = true # Deliberately fail-open: this guardrail is advisory
# and must not block the action if the LLM is down.
Migration note
Acteon ≤ 0.1.x defaulted fail_open to true. This release flips the default to false to align with the principle of least surprise for security-oriented deployments. If you were relying on the implicit fail-open behavior, set fail_open = true explicitly in your acteon.toml to preserve the old behavior.
Monitoring¶
Prometheus Metrics¶
The guardrail emits three counters via GET /metrics/prometheus (and as JSON at GET /metrics):
| Metric | Counted on |
|---|---|
acteon_llm_guardrail_allowed_total | Evaluator returned Allow (action passes through) |
acteon_llm_guardrail_denied_total | Evaluator returned Deny or Flag + block_on_flag=true (action suppressed) |
acteon_llm_guardrail_errors_total | Evaluator errored — timeout, HTTP failure from the LLM, JSON parse error on the response, etc. |
Grafana. The bundled acteon-overview dashboard has an "LLM Guardrail" row with a decisions rate timeseries and a stat panel for the totals.
What to alert on. Alerting on errors is the primary security-critical signal, not the deny ratio. In fail-open configurations the guardrail lets actions through when the evaluator errors, so an attacker who can force timeouts (large inputs, upstream LLM slowness) quietly bypasses the guard. The deny ratio in that attack stays flat or even drops because denied doesn't grow while allowed + errors keep going up. Page on errors first:
For baseline health — is the evaluator denying about as often as expected, or has something drifted? — compute deny prevalence against all evaluated traffic (include errors in the denominator). The + 1e-9 guards against division-by-zero NaN during quiet periods, which Grafana would otherwise render as "No Data" and hide the alert entirely:
rate(acteon_llm_guardrail_denied_total[5m])
/
(rate(acteon_llm_guardrail_allowed_total[5m])
+ rate(acteon_llm_guardrail_denied_total[5m])
+ rate(acteon_llm_guardrail_errors_total[5m])
+ 1e-9) > 0.2
A sustained non-zero denied rate on rules targeting external input surfaces (public webhooks, customer-facing dispatch) is still worth investigating — prompt-injection attempts that the evaluator successfully catches show up here — but treat it as a secondary signal. The errors alert above is what catches an actual bypass.