2026-02-08

Scaling to One Million Decisions per Month

Intended Team · Founding Team

The Scale Challenge

A million decisions per month sounds like a lot. It is roughly 23 decisions per minute, sustained 24/7. For a large enterprise with dozens of AI agents operating across multiple domains, this is a realistic baseline. Some of our enterprise customers are already projecting two to five million decisions per month within their first year.

The challenge is not just throughput. It is throughput with latency constraints. Every decision adds latency to the AI agent's operation. If governance takes 500 milliseconds per decision, and an agent makes 10 decisions per workflow, governance adds 5 seconds of overhead. That is noticeable. That is the kind of overhead that makes teams bypass governance to hit performance targets.

Intended targets p99 latency under 100 milliseconds for standard policy evaluations, even at high throughput. Here is how we achieve that.

Stateless Evaluation

The Authority Engine is stateless. Each evaluation request contains everything the engine needs to make a decision: the classified intent, the agent identity, the policy set, and the risk scoring parameters. The engine does not maintain session state between requests.

Stateless design means horizontal scaling is trivial. Need more throughput? Add more Authority Engine instances behind the load balancer. Each instance handles requests independently. There is no coordination overhead, no distributed locks, no consensus protocol.

In practice, each Authority Engine instance handles approximately 500 evaluations per second. For a million decisions per month (23 per minute), a single instance is more than sufficient. But enterprises do not experience uniform load. They experience bursts: deployment pipelines that trigger hundreds of governance decisions in seconds, incident response workflows that generate spikes, and batch operations that submit thousands of intents simultaneously.

For burst handling, we recommend a minimum of three instances with autoscaling configured to add instances when request queue depth exceeds a threshold. Most enterprise deployments run five to ten instances to handle burst traffic with comfortable headroom.

Connection Pooling

The Authority Engine is stateless, but it needs to read policies and write audit records. Both operations involve database access, and database connections are expensive to establish.

Intended uses connection pooling at two levels. At the application level, each Authority Engine instance maintains a pool of database connections. The pool is sized based on the expected concurrency: typically 20-50 connections per instance. Connections are reused across requests, eliminating the overhead of connection establishment.

At the infrastructure level, we use PgBouncer as a connection pooler between the Authority Engine instances and the PostgreSQL database. PgBouncer multiplexes application connections onto a smaller number of database connections, reducing the total connection count on the database server. This is critical when running many Authority Engine instances, each with its own connection pool.

The combination of application-level and infrastructure-level pooling means the database sees a manageable number of connections regardless of how many Authority Engine instances are running. A deployment with 10 Authority Engine instances, each with a 30-connection pool, does not create 300 database connections. PgBouncer multiplexes them onto 50-100 actual database connections.

Read Replicas for Policy Evaluation

Policy evaluation is a read-heavy operation. The Authority Engine reads the policy set, reads the agent's trust level, reads the domain pack configuration, and reads historical patterns. The only write is the audit record.

Intended separates reads and writes at the database level. Policy data, agent data, and domain pack configurations are served from read replicas. Audit writes go to the primary database. This separation has two benefits.

First, read replicas can be scaled independently of the primary. If policy evaluation is the bottleneck, add more read replicas. If audit writes are the bottleneck, scale the primary.

Second, read replicas can be placed closer to the Authority Engine instances, reducing network latency for the most frequent database operations. In multi-region deployments, each region has local read replicas while audit writes are replicated to the primary region asynchronously.

Caching

Not every governance decision requires a full database round-trip. Many inputs to the evaluation are stable over time: policy definitions change infrequently, domain pack configurations are updated rarely, and agent trust levels change gradually.

Intended uses a multi-layer caching strategy. The first layer is an in-process cache within each Authority Engine instance. Policy definitions and domain pack configurations are cached in memory with a configurable TTL (default: 60 seconds). Cache invalidation is triggered by a pub/sub notification when policies are updated.

The second layer is a shared cache (Redis) for data that is expensive to compute but shared across instances. Risk scoring model parameters, agent historical pattern summaries, and compiled policy decision trees are cached at this level.

The third layer is database-level caching. PostgreSQL's buffer cache handles frequently accessed rows. Proper index design ensures that policy lookups and agent data queries hit indexes rather than scanning tables.

With caching, the median policy evaluation touches the database zero times. The policy is in the in-process cache, the agent data is in the shared cache, and the only database operation is the audit write. The p50 evaluation latency drops from 15-20 milliseconds (with database access) to 3-5 milliseconds (cache-only).

Audit Write Optimization

The audit ledger is the write-heavy component. Every decision produces an audit record, and the hash-chained structure requires serializable transactions to maintain chain integrity. This creates a potential bottleneck: the hash chain is inherently sequential because each record depends on the previous record's hash.

Intended addresses this with a batched write strategy. Audit records are collected in an in-memory buffer for a short window (default: 50 milliseconds). At the end of the window, the batch is written to the ledger in a single transaction. The hash chain is computed over the batch: each record in the batch includes the hash of the previous record, and the batch itself is linked to the previous batch.

Batching reduces the number of database transactions by a factor of 10-50x at typical throughput levels. Instead of 23 individual writes per minute, the database handles 2-3 batched writes per minute. The trade-off is a small increase in audit record latency (up to 50 milliseconds), which is acceptable for audit purposes.

For organizations that require immediate audit persistence (zero buffering), the batch window can be set to 0 milliseconds, reverting to per-record writes. This reduces write throughput but ensures every audit record is persisted immediately.

Horizontal Scaling Playbook

Here is the practical playbook for scaling Intended from startup to enterprise throughput.

**10,000 decisions per month** (startup tier): Single Authority Engine instance, single PostgreSQL instance, no caching layer. Total infrastructure: 2 containers, 1 database. Cost: minimal.

**100,000 decisions per month** (growth tier): Two Authority Engine instances behind a load balancer, PostgreSQL with one read replica, Redis for shared caching. Total infrastructure: 5 containers, 2 database instances. Autoscaling configured for burst handling.

**1,000,000 decisions per month** (enterprise tier): Five to ten Authority Engine instances with autoscaling, PostgreSQL with three read replicas, Redis cluster for shared caching, PgBouncer for connection multiplexing. Total infrastructure: 10-15 containers, 4 database instances. Multi-AZ deployment for high availability.

**10,000,000 decisions per month** (large enterprise): Twenty or more Authority Engine instances across multiple regions, PostgreSQL with regional read replicas, Redis cluster per region, dedicated PgBouncer per region. Audit writes use a write-ahead log with asynchronous replication to secondary regions. Total infrastructure varies by deployment model.

Each tier is a natural evolution of the previous one. You do not need to re-architect to scale. You add instances, add replicas, and adjust pool sizes. The application code is the same at every tier.

Measuring Performance

You cannot optimize what you do not measure. Intended exposes detailed performance metrics through a Prometheus-compatible metrics endpoint.

Key metrics to monitor: evaluation latency (p50, p95, p99), audit write latency, cache hit rates (per layer), database connection pool utilization, request queue depth, and error rates by category.

We publish baseline performance numbers and recommend alerting thresholds based on our operational experience. When a metric approaches a threshold, it is time to scale the relevant component before performance degrades.