2026-03-01
The CTO Guide to Evaluating AI Governance Solutions
Intended Team · Founding Team
The CTO Guide to Evaluating AI Governance Solutions
Your board is asking about AI governance. Your CISO is asking about AI governance. Your compliance team has sent you three different vendor comparison spreadsheets. Everyone agrees you need something. Nobody agrees on what.
This guide provides a 10-point framework for evaluating AI governance solutions. It is opinionated. It reflects what we have learned building Intended and working with organizations that have deployed AI agents in production. Use it as a starting point, not a gospel.
Point 1: Pre-Execution vs. Post-Execution
The single most important distinction in AI governance is when the governance happens. Does the system evaluate actions before they execute, or does it analyze logs after the fact?
Post-execution governance is monitoring. It tells you what happened. It is useful for understanding patterns and identifying issues. It does not prevent anything. By the time you know an agent made an unauthorized purchase, the purchase has been made.
Pre-execution governance is enforcement. It evaluates the action before it happens and either allows, denies, or escalates. The unauthorized purchase never occurs because the authority check blocks it.
Ask vendors: does your system evaluate actions before they execute, or does it analyze them after? If the answer is "after," you are buying a monitoring tool, not a governance tool. Both are valuable. They solve different problems.
Point 2: Fail-Closed vs. Fail-Open
When the governance system is unavailable, what happens? If the answer is "actions proceed without governance," the system is fail-open. Every outage in your governance layer becomes a window where agents operate without oversight.
Fail-closed means that if the governance system cannot evaluate an action, the action is blocked. This is more disruptive during outages but dramatically safer. An outage in a fail-open system is invisible to your security posture. An outage in a fail-closed system is immediately obvious because agents stop working.
Ask vendors: what is the default behavior when your system is unreachable? Is it configurable? What is the recommended production configuration?
Point 3: Intent Awareness
Does the system understand what the agent is trying to do, or does it only see API calls and permissions? A system that sees "POST /api/transfers" knows an API was called. A system that sees "transfer $50,000 to a new international recipient at 2 AM" understands intent.
Intent awareness requires semantic classification of actions. The system needs to map raw tool calls to meaningful categories, extract risk-relevant parameters, and evaluate them in context. This is fundamentally different from checking permissions.
Ask vendors: how does your system classify actions? Does it understand the difference between a $100 transfer and a $100,000 transfer? Can it evaluate the same action differently based on context?
Point 4: Latency Budget
If governance adds 500ms to every agent action, your engineering teams will bypass it. They will hardcode exceptions, cache stale decisions, or remove the integration entirely. Governance that slows agents down does not survive contact with production.
The target should be sub-50ms p99 for authority decisions. That is fast enough to be invisible in the context of typical tool call latency (200-2000ms for most API calls). Anything over 100ms will generate friction. Anything over 500ms will be removed.
Ask vendors: what is your p99 evaluation latency? How is it measured? Can you provide benchmark data from production deployments?
Point 5: Cryptographic Proof
When an auditor asks "prove that this action was authorized," what evidence does the system produce? A log entry is not proof. Log entries can be modified, deleted, or fabricated. Cryptographic proof is a signed token that can be independently verified using a public key.
The proof should include the complete evaluation context: what was evaluated, which policies were applied, what the risk assessment was, and what the decision was. The signature should cover the entire payload so that modifying any field invalidates the proof.
Ask vendors: does your system produce cryptographic proof for every decision? Can the proof be verified independently, without access to your system? What signing algorithm is used?
Point 6: Audit Chain Integrity
An audit trail stored in a mutable database is only as trustworthy as the access controls on that database. A hash-linked audit chain provides mathematical integrity guarantees. Each entry references the hash of the previous entry. Modifying or deleting any entry breaks the chain.
This is not a nice-to-have for regulated industries. SOC 2, HIPAA, and financial regulations increasingly require demonstrable log integrity. A hash chain provides it without depending on operational procedures or access control policies.
Ask vendors: how is audit data stored? Can entries be modified or deleted? Is the integrity of the audit trail cryptographically verifiable?
Point 7: Escalation Workflows
Allow and deny are not sufficient outcomes. Many actions should be held for human review rather than automatically approved or rejected. The governance system needs a built-in escalation workflow that holds actions, notifies reviewers, and records the human decision.
The escalation workflow should support multiple approval patterns: single approver, multi-party approval, delegation chains, and time-bounded approvals. It should integrate with existing communication tools like Slack, Teams, PagerDuty, and email.
Ask vendors: does your system support escalation as a first-class decision outcome? What approval patterns are supported? How does escalation integrate with existing tools?
Point 8: Domain Specificity
A governance system that treats every domain identically will be either too permissive or too restrictive in each domain. Financial operations have different risk profiles than infrastructure operations. Healthcare data has different sensitivity than marketing analytics.
The system should support domain-specific governance models that encode the risk characteristics, normal patterns, and regulatory requirements for each operational domain. These models should be customizable to match your organization's specific needs.
Ask vendors: does your system include domain-specific governance models? Can they be customized? How granular are the risk assessments for different operational domains?
Point 9: Integration Depth
A governance system that requires rewriting your agent code is a governance system that will not be adopted. The integration should be lightweight: a wrapper around existing tools, a middleware in existing pipelines, or a gateway that sits between agents and downstream services.
The integration should preserve the existing developer experience. Tool definitions should not change. Agent logic should not change. The governance layer should be transparent to the agent and visible to operators and auditors.
Ask vendors: how many lines of code does the integration require? Does it change existing tool definitions or agent logic? Can it be added to an existing deployment without refactoring?
Point 10: Total Cost of Ownership
The sticker price is the smallest part of the cost. The real costs are integration engineering, policy authoring, operational maintenance, and the opportunity cost of agents that are blocked or slowed by governance.
A system that is cheap to license but expensive to integrate is not cheaper. A system that is fast to integrate but requires a dedicated team to maintain policies is not cheaper. A system that is easy to maintain but adds 500ms of latency, causing teams to bypass it, is not cheaper.
Ask vendors: what is the total integration cost, including engineering time? What is the ongoing operational cost? What is the expected latency impact on agent operations?
The Evaluation Matrix
Use this matrix to score each solution on a 1-5 scale for each of the 10 points:
| Criterion | Weight | Score | |-----------|--------|-------| | Pre-execution enforcement | High | | | Fail-closed default | High | | | Intent awareness | High | | | Latency (sub-50ms p99) | High | | | Cryptographic proof | Medium | | | Audit chain integrity | Medium | | | Escalation workflows | Medium | | | Domain specificity | Medium | | | Integration depth | High | | | Total cost of ownership | High | |
The weights reflect our view of what matters most. Adjust them based on your organization's priorities. A regulated financial services firm will weight audit chain integrity higher. A fast-moving startup will weight integration depth and latency higher.
One More Thing
Ask every vendor this question: "If your system goes down for an hour during peak traffic, what happens to my AI agents?"
The answer tells you everything you need to know about their architectural philosophy. Fail-open means your agents operate without governance during outages. Fail-closed means your agents stop until governance is restored. The right answer depends on your risk tolerance. But you need to know which one you are buying.
Your AI agents are making decisions that affect your customers, your revenue, and your compliance posture. The governance system you choose will determine whether those decisions are controlled or hoped-for. Choose carefully.