2026-01-28
Monitoring AI Agent Decisions in Real Time
Intended Team · Founding Team
Observability for Governance
You have deployed Intended. Your AI agents are governed. Every action is classified, evaluated, and recorded. But are you watching? Can you tell, right now, whether your governance system is healthy? Can you see when something is wrong before it becomes an incident?
Governance observability is the practice of monitoring your AI governance system in real time: understanding what decisions are being made, how they are distributed, and when patterns change. It is the difference between "we have governance" and "we know our governance is working."
The Four Signal Categories
Intended exposes metrics across four categories. Each category tells you something different about the health of your governance system.
Decision Metrics
Decision metrics tell you what the Authority Engine is doing. The primary signals are decision volume (intents per minute, segmented by domain and outcome), decision distribution (the ratio of allow, allow-with-conditions, escalate, and deny outcomes), and decision latency (the time from intent submission to decision return, measured at p50, p95, and p99).
A healthy governance system has stable decision volume (correlated with agent activity), a consistent decision distribution (policy changes will shift this, but unexpected shifts are a signal), and low latency (p99 under 100 milliseconds for standard evaluations).
What to watch for: sudden spikes in deny rate may indicate a policy misconfiguration or a compromised agent. Sudden drops in total volume may indicate that agents have lost connectivity to Intended. Latency increases may indicate infrastructure issues or database contention.
Risk Score Metrics
Risk score metrics tell you about the risk profile of your agent operations. The primary signals are risk score distribution (histogram of composite risk scores across all decisions), risk score by domain (average and p95 risk scores per domain), and high-risk action rate (the percentage of decisions with risk scores above your escalation threshold).
A healthy governance system has a risk distribution that matches your expectations. If most of your agents perform routine operations, the distribution should be heavily weighted toward low scores with a thin tail of high-risk actions.
What to watch for: a shift in the risk distribution indicates changing agent behavior. If the median risk score increases over time, agents are performing riskier actions. This might be intentional (new capabilities deployed) or unintentional (scope drift). Either way, it deserves investigation.
Escalation Metrics
Escalation metrics tell you about the human oversight layer. The primary signals are escalation rate (percentage of decisions that result in escalation), escalation resolution time (median and p95 time from escalation to human resolution), escalation outcomes (the distribution of approve, deny, and modify outcomes for escalated decisions), and escalation timeout rate (percentage of escalations that hit the timeout without human action).
What to watch for: increasing escalation rate may mean policies are too conservative or agent behavior is drifting. Increasing resolution time may mean the review team is overwhelmed. High timeout rates mean escalations are not being handled, which means either the routing is wrong or the review team is understaffed.
Audit Health Metrics
Audit health metrics tell you about the integrity of the audit subsystem. The primary signals are chain length (the current length of the hash chain), chain verification status (whether the most recent verification passed), write latency (time to persist audit records), and storage utilization (how much of the allocated audit storage is in use).
What to watch for: chain verification failures indicate tampering or corruption. Write latency increases indicate storage bottleneck. Storage approaching capacity requires expansion before it causes write failures.
Building Dashboards
Intended exposes all metrics through a Prometheus-compatible endpoint. You can scrape these metrics with Prometheus, Datadog, Grafana Cloud, or any metrics platform that supports the Prometheus exposition format.
We recommend three dashboards.
**Operational Dashboard.** This is the real-time view that the platform team monitors during business hours. It shows decision volume (time series), decision distribution (stacked bar), latency (percentile lines), and error rate (time series). The time window is the last 4 hours.
**Risk Dashboard.** This is the view for the security team. It shows risk score distribution (histogram), high-risk actions (table with details), escalation queue (current pending escalations), and top agents by risk score. The time window is the last 24 hours.
**Compliance Dashboard.** This is the view for the compliance team. It shows audit chain health (status indicator), evidence volume (decisions per day), policy coverage (percentage of domains with active policies), and escalation resolution statistics. The time window is the last 30 days.
Alert Configuration
Dashboards are for proactive monitoring. Alerts are for reactive response. Intended recommends the following alerts.
**Critical alerts** (page the on-call engineer):
- Authority Engine p99 latency exceeds 500 milliseconds for 5 minutes
- Audit chain verification failure
- Authority Engine error rate exceeds 1 percent for 5 minutes
- Token Service unavailable for 1 minute
**Warning alerts** (notify the team channel):
- Decision deny rate increases by more than 50 percent over the hourly baseline
- Escalation timeout rate exceeds 10 percent over 1 hour
- Median risk score increases by more than 20 percent over the daily baseline
- Agent behavior anomaly detected (any agent)
**Informational alerts** (daily digest):
- New agent enrolled
- Policy updated or deployed
- Domain pack updated
- Audit storage utilization above 70 percent
Anomaly Detection
Static thresholds work for most alerts, but some signals require anomaly detection: identifying when a metric deviates from its expected pattern without knowing in advance what the threshold should be.
Intended includes built-in anomaly detection for three signals. Agent behavioral anomalies compare each agent's recent behavior to its historical baseline using statistical deviation analysis. A sudden change in action types, action volume, or domain distribution triggers an anomaly alert.
Risk score trend anomalies track the moving average of risk scores per domain and flag when the trend shifts significantly. This catches gradual drift that a static threshold would miss.
Escalation pattern anomalies track escalation rates per agent and per domain and flag unusual patterns. If an agent that never triggers escalations suddenly starts triggering them frequently, that is an anomaly worth investigating.
The Monitoring Flywheel
Good monitoring creates a flywheel. You detect a pattern in the metrics. You investigate and find a policy gap or an agent misconfiguration. You fix the issue and update the policies. You verify the fix in the metrics. Each cycle makes your governance more robust.
Without monitoring, issues accumulate silently. With monitoring, every issue is visible, investigable, and fixable. The governance system improves continuously because you can see what it is doing.
The organizations that get the most value from Intended are the ones that treat governance metrics with the same rigor as their production application metrics. They have dashboards on screens. They have alerts routed to on-call rotations. They review governance metrics in weekly operations meetings.
Governance is not a set-and-forget configuration. It is a living system that needs observation, tuning, and continuous improvement. Real-time monitoring makes that possible.