operator runbooks

Intended Documentation

Incident Response

Investigate authorization incidents using audit trails, decision token inspection, emergency kill switches, and post-mortem procedures.

advanced7 min readimplemented

Incident Response#

This runbook covers investigating authorization incidents in the Intended runtime: analyzing audit trails, inspecting decision tokens, activating the emergency kill switch, and conducting post-mortems.

Prerequisites#

The Intended CLI installed and authenticated with operator or platform-admin role
Access to the audit log subsystem
Familiarity with the Trust Model

Danger

During an active incident, prioritize containment over investigation. If AI execution is behaving unexpectedly, activate the kill switch first, then investigate.

Incident Severity Levels#

Severity	Description	Response Time	Example
Critical	Unauthorized AI execution in production	Immediate	Policy bypass detected
High	Authorization failures blocking critical services	< 15 min	Mass denials after deploy
Medium	Unexpected decision patterns	< 1 hour	Drift-induced allow/deny flip
Low	Audit anomalies, non-blocking	< 4 hours	Missing audit entries

Step 1: Audit Trail Analysis#

Begin every investigation by querying the audit trail for the affected time window.

$intended audit query

Query the authorization audit trail.

--startstring *

Start time (ISO 8601 or relative: 1h, 6h, 24h)

--endstring

End time (default: now)

--identity(-i)string

Filter by caller identity

--action(-a)string

Filter by action (deploy, invoke, write, etc.)

--result(-r)string

Filter by result: allow, deny, error

--environment(-e)string

Filter by environment

--format(-f)string

Output format: table, json, csv

bash

$ intended audit query \
    --start 2h \
    --environment production \
    --result deny \
    --format table

TIMESTAMP              IDENTITY                    ACTION   RESOURCE                  RESULT  POLICY
2026-03-08T10:32:01Z   svc:data-pipeline-runner    write    env:prod/service-b        deny    restrict-production-writes
2026-03-08T10:31:44Z   svc:data-pipeline-runner    write    env:prod/service-b        deny    restrict-production-writes
2026-03-08T10:28:12Z   svc:batch-processor         deploy   env:prod/batch-service    deny    restrict-production-deploys
2026-03-08T10:27:58Z   user:alice                  deploy   env:prod/service-a        deny    restrict-production-deploys

Identify Patterns#

Look for clusters of denials from the same identity, policy, or time window:

bash

$ intended audit query \
    --start 6h \
    --environment production \
    --result deny \
    --format json \
    --group-by policy

{
  "restrict-production-writes": {
    "count": 147,
    "firstSeen": "2026-03-08T08:15:00Z",
    "lastSeen": "2026-03-08T10:32:01Z",
    "uniqueIdentities": 3
  },
  "restrict-production-deploys": {
    "count": 12,
    "firstSeen": "2026-03-08T10:27:58Z",
    "lastSeen": "2026-03-08T10:28:12Z",
    "uniqueIdentities": 2
  }
}

Tip

A sudden spike in denials from a single policy shortly after a deployment almost always indicates a policy regression. Cross-reference with intended deploy history to confirm.

Step 2: Decision Token Inspection#

Every authorization decision produces a cryptographically signed decision token. Inspect it to see the full evaluation chain.

$intended token inspect <decision-token>

Inspect a decision token to see the full authorization evaluation.

--verifyboolean

Verify the cryptographic signature (default: true)

--verbose(-v)boolean

Show full evaluation trace including intermediate steps

bash

$ intended token inspect dtk_m8x2p4q7 --verbose

DECISION TOKEN INSPECTION
─────────────────────────────────────────────────
Token ID:       dtk_m8x2p4q7
Signature:      VALID (ES256, signed by runtime-prod-01)
Issued at:      2026-03-08T10:32:01Z

REQUEST:
  Identity:     svc:data-pipeline-runner
  Action:       write
  Resource:     env:prod/service-b
  Trust level:  0.85

EVALUATION TRACE:
  1. restrict-production-writes/rules[0]
     Match: action=write, scope=production  →  MATCHED
     Condition: approval(required=1, from=role:data-lead)
     Status: NOT SATISFIED (no approval on record)
     Result: DENY

  2. restrict-production-writes/fallback
     Result: DENY (fallback not reached, rule[0] matched)

FINAL DECISION: DENY
REASON: Approval condition not satisfied

Trace a Chain of Decisions#

For complex incidents involving multiple dependent decisions, trace the full chain:

bash

$ intended audit trace \
    --identity svc:data-pipeline-runner \
    --start 1h \
    --environment production

DECISION CHAIN for svc:data-pipeline-runner
─────────────────────────────────────────────────
10:28:01  read   env:prod/config-store     ALLOW  (trust-level: 0.85)
10:28:03  invoke env:prod/model-a          ALLOW  (trust-level: 0.85)
10:32:01  write  env:prod/service-b        DENY   ← first failure
10:32:15  write  env:prod/service-b        DENY   (retry)
10:32:30  write  env:prod/service-b        DENY   (retry)

Step 3: Emergency Kill Switch#

The kill switch immediately suspends all AI execution authorization for a target scope. Use it when you observe unauthorized or dangerous AI behavior.

Danger

The kill switch is a last-resort control. It will deny ALL authorization decisions in the affected scope, including legitimate operations. Use it only when the risk of continued execution outweighs the impact of a full stop.

$intended emergency kill

Activate the emergency kill switch to suspend all authorization.

--scope(-s)string *

Kill scope: environment, service, or identity pattern

--reasonstring *

Reason for activation (recorded in audit trail)

--duration(-d)string

Auto-expire duration (default: manual lift required)

--notifystring

Notification channels: slack, pagerduty, email

Activate the kill switch

bash

$ intended emergency kill \
    --scope "env:production" \
    --reason "Suspected unauthorized AI execution via policy bypass" \
    --notify slack,pagerduty

EMERGENCY KILL SWITCH ACTIVATED
─────────────────────────────────────────────────
Scope:        env:production (all services, all identities)
Activated by: alice@example.com
Activated at: 2026-03-08T10:45:00Z
Reason:       Suspected unauthorized AI execution via policy bypass
Duration:     indefinite (manual lift required)

Notifications sent: #incident-response (Slack), PagerDuty

Verify kill switch is active

bash

$ intended emergency status

ACTIVE KILL SWITCHES:
  Scope:       env:production
  Since:       2026-03-08T10:45:00Z (12 minutes ago)
  Activated:   alice@example.com
  Decisions blocked: 234 since activation

Lift the kill switch

Once the incident is contained and the root cause addressed:

bash

$ intended emergency lift \
    --scope "env:production" \
    --reason "Root cause identified and remediated. Policy v3 restored."

Kill switch lifted for env:production
Lifted by:  alice@example.com
Lifted at:  2026-03-08T11:30:00Z
Duration:   45 minutes

Step 4: Post-Mortem#

After containment and resolution, conduct a structured post-mortem.

Generate an incident report

bash

$ intended incident report \
    --incident inc_r4t7w2 \
    --include audit-trail,decision-tokens,deploy-history \
    --format markdown \
    --output reports/inc_r4t7w2-postmortem.md

Incident report generated: reports/inc_r4t7w2-postmortem.md
  Audit entries included: 1,247
  Decision tokens included: 42
  Deploy events included: 3

Identify the root cause

Common root causes for authorization incidents:

Root Cause	Indicators	Resolution
Policy regression	Deny spike after deployment	Rollback, fix policy, redeploy
Configuration drift	Runtime differs from repository	Reconcile and redeploy from source
Trust level degradation	Decisions flip without policy change	Investigate trust scoring inputs
Credential compromise	Unauthorized identity in audit trail	Rotate credentials, tighten scope

Define remediation actions

bash

$ intended incident update inc_r4t7w2 \
    --status resolved \
    --root-cause "Policy v4 introduced approval condition not satisfied by svc:data-pipeline-runner" \
    --remediation "Rolled back to v3. Updated v5 with service account exception. Added blast radius check to CI."

Publish and review

Share the post-mortem with the team. Intended tracks incident history for compliance and continuous improvement:

bash

$ intended incident list --status resolved --limit 5

ID           TITLE                                SEVERITY  DURATION   RESOLVED
inc_r4t7w2   Policy v4 rollback — deny rate        medium    45m        2026-03-08
inc_q3k8m1   Staging drift — allow bypass           high     2h 15m     2026-03-01
inc_p2j7n9   Kill switch — batch processor           crit    12m        2026-02-22

Incident Response Checklist#

Use this checklist during any authorization incident:

[ ] Assess severity and notify the on-call team
[ ] If critical: activate the kill switch immediately
[ ] Query audit trail for the affected time window
[ ] Inspect decision tokens for anomalous evaluations
[ ] Cross-reference with recent deployments
[ ] Contain: rollback or kill switch
[ ] Investigate: trace the root cause
[ ] Remediate: fix policy, rotate credentials, or resolve drift
[ ] Post-mortem: document findings and actions
[ ] Follow-up: verify remediation and close the incident

Next Steps#

Deploy and Rollback — rollback procedures for policy issues
Control Center — real-time monitoring to detect incidents early
Simulate Impact — prevent incidents with pre-deploy simulation