Skip to content

operator runbooks

Intended Documentation

Incident Response

Investigate authorization incidents using audit trails, decision token inspection, emergency kill switches, and post-mortem procedures.

Incident Response#

This runbook covers investigating authorization incidents in the Intended runtime: analyzing audit trails, inspecting decision tokens, activating the emergency kill switch, and conducting post-mortems.

Prerequisites#

  • The Intended CLI installed and authenticated with operator or platform-admin role
  • Access to the audit log subsystem
  • Familiarity with the Trust Model

Danger

During an active incident, prioritize containment over investigation. If AI execution is behaving unexpectedly, activate the kill switch first, then investigate.

Incident Severity Levels#

SeverityDescriptionResponse TimeExample
CriticalUnauthorized AI execution in productionImmediatePolicy bypass detected
HighAuthorization failures blocking critical services< 15 minMass denials after deploy
MediumUnexpected decision patterns< 1 hourDrift-induced allow/deny flip
LowAudit anomalies, non-blocking< 4 hoursMissing audit entries

Step 1: Audit Trail Analysis#

Begin every investigation by querying the audit trail for the affected time window.

$meritt audit query

Query the authorization audit trail.

--startstring *
Start time (ISO 8601 or relative: 1h, 6h, 24h)
--endstring
End time (default: now)
--identity(-i)string
Filter by caller identity
--action(-a)string
Filter by action (deploy, invoke, write, etc.)
--result(-r)string
Filter by result: allow, deny, error
--environment(-e)string
Filter by environment
--format(-f)string
Output format: table, json, csv
bash
$ meritt audit query \
    --start 2h \
    --environment production \
    --result deny \
    --format table

TIMESTAMP              IDENTITY                    ACTION   RESOURCE                  RESULT  POLICY
2026-03-08T10:32:01Z   svc:data-pipeline-runner    write    env:prod/service-b        deny    restrict-production-writes
2026-03-08T10:31:44Z   svc:data-pipeline-runner    write    env:prod/service-b        deny    restrict-production-writes
2026-03-08T10:28:12Z   svc:batch-processor         deploy   env:prod/batch-service    deny    restrict-production-deploys
2026-03-08T10:27:58Z   user:alice                  deploy   env:prod/service-a        deny    restrict-production-deploys

Identify Patterns#

Look for clusters of denials from the same identity, policy, or time window:

bash
$ meritt audit query \
    --start 6h \
    --environment production \
    --result deny \
    --format json \
    --group-by policy

{
  "restrict-production-writes": {
    "count": 147,
    "firstSeen": "2026-03-08T08:15:00Z",
    "lastSeen": "2026-03-08T10:32:01Z",
    "uniqueIdentities": 3
  },
  "restrict-production-deploys": {
    "count": 12,
    "firstSeen": "2026-03-08T10:27:58Z",
    "lastSeen": "2026-03-08T10:28:12Z",
    "uniqueIdentities": 2
  }
}

Tip

A sudden spike in denials from a single policy shortly after a deployment almost always indicates a policy regression. Cross-reference with meritt deploy history to confirm.

Step 2: Decision Token Inspection#

Every authorization decision produces a cryptographically signed decision token. Inspect it to see the full evaluation chain.

$meritt token inspect <decision-token>

Inspect a decision token to see the full authorization evaluation.

--verifyboolean
Verify the cryptographic signature (default: true)
--verbose(-v)boolean
Show full evaluation trace including intermediate steps
bash
$ meritt token inspect dtk_m8x2p4q7 --verbose

DECISION TOKEN INSPECTION
─────────────────────────────────────────────────
Token ID:       dtk_m8x2p4q7
Signature:      VALID (ES256, signed by runtime-prod-01)
Issued at:      2026-03-08T10:32:01Z

REQUEST:
  Identity:     svc:data-pipeline-runner
  Action:       write
  Resource:     env:prod/service-b
  Trust level:  0.85

EVALUATION TRACE:
  1. restrict-production-writes/rules[0]
     Match: action=write, scope=production  →  MATCHED
     Condition: approval(required=1, from=role:data-lead)
     Status: NOT SATISFIED (no approval on record)
     Result: DENY

  2. restrict-production-writes/fallback
     Result: DENY (fallback not reached, rule[0] matched)

FINAL DECISION: DENY
REASON: Approval condition not satisfied

Trace a Chain of Decisions#

For complex incidents involving multiple dependent decisions, trace the full chain:

bash
$ meritt audit trace \
    --identity svc:data-pipeline-runner \
    --start 1h \
    --environment production

DECISION CHAIN for svc:data-pipeline-runner
─────────────────────────────────────────────────
10:28:01  read   env:prod/config-store     ALLOW  (trust-level: 0.85)
10:28:03  invoke env:prod/model-a          ALLOW  (trust-level: 0.85)
10:32:01  write  env:prod/service-b        DENY   ← first failure
10:32:15  write  env:prod/service-b        DENY   (retry)
10:32:30  write  env:prod/service-b        DENY   (retry)

Step 3: Emergency Kill Switch#

The kill switch immediately suspends all AI execution authorization for a target scope. Use it when you observe unauthorized or dangerous AI behavior.

Danger

The kill switch is a last-resort control. It will deny ALL authorization decisions in the affected scope, including legitimate operations. Use it only when the risk of continued execution outweighs the impact of a full stop.

$meritt emergency kill

Activate the emergency kill switch to suspend all authorization.

--scope(-s)string *
Kill scope: environment, service, or identity pattern
--reasonstring *
Reason for activation (recorded in audit trail)
--duration(-d)string
Auto-expire duration (default: manual lift required)
--notifystring
Notification channels: slack, pagerduty, email

Activate the kill switch

bash
$ meritt emergency kill \
    --scope "env:production" \
    --reason "Suspected unauthorized AI execution via policy bypass" \
    --notify slack,pagerduty

EMERGENCY KILL SWITCH ACTIVATED
─────────────────────────────────────────────────
Scope:        env:production (all services, all identities)
Activated by: alice@example.com
Activated at: 2026-03-08T10:45:00Z
Reason:       Suspected unauthorized AI execution via policy bypass
Duration:     indefinite (manual lift required)

Notifications sent: #incident-response (Slack), PagerDuty

Verify kill switch is active

bash
$ meritt emergency status

ACTIVE KILL SWITCHES:
  Scope:       env:production
  Since:       2026-03-08T10:45:00Z (12 minutes ago)
  Activated:   alice@example.com
  Decisions blocked: 234 since activation

Lift the kill switch

Once the incident is contained and the root cause addressed:

bash
$ meritt emergency lift \
    --scope "env:production" \
    --reason "Root cause identified and remediated. Policy v3 restored."

Kill switch lifted for env:production
Lifted by:  alice@example.com
Lifted at:  2026-03-08T11:30:00Z
Duration:   45 minutes

Step 4: Post-Mortem#

After containment and resolution, conduct a structured post-mortem.

Generate an incident report

bash
$ meritt incident report \
    --incident inc_r4t7w2 \
    --include audit-trail,decision-tokens,deploy-history \
    --format markdown \
    --output reports/inc_r4t7w2-postmortem.md

Incident report generated: reports/inc_r4t7w2-postmortem.md
  Audit entries included: 1,247
  Decision tokens included: 42
  Deploy events included: 3

Identify the root cause

Common root causes for authorization incidents:

Root CauseIndicatorsResolution
Policy regressionDeny spike after deploymentRollback, fix policy, redeploy
Configuration driftRuntime differs from repositoryReconcile and redeploy from source
Trust level degradationDecisions flip without policy changeInvestigate trust scoring inputs
Credential compromiseUnauthorized identity in audit trailRotate credentials, tighten scope

Define remediation actions

bash
$ meritt incident update inc_r4t7w2 \
    --status resolved \
    --root-cause "Policy v4 introduced approval condition not satisfied by svc:data-pipeline-runner" \
    --remediation "Rolled back to v3. Updated v5 with service account exception. Added blast radius check to CI."

Publish and review

Share the post-mortem with the team. Intended tracks incident history for compliance and continuous improvement:

bash
$ meritt incident list --status resolved --limit 5

ID           TITLE                                SEVERITY  DURATION   RESOLVED
inc_r4t7w2   Policy v4 rollback — deny rate        medium    45m        2026-03-08
inc_q3k8m1   Staging drift — allow bypass           high     2h 15m     2026-03-01
inc_p2j7n9   Kill switch — batch processor           crit    12m        2026-02-22

Incident Response Checklist#

Use this checklist during any authorization incident:

  • [ ] Assess severity and notify the on-call team
  • [ ] If critical: activate the kill switch immediately
  • [ ] Query audit trail for the affected time window
  • [ ] Inspect decision tokens for anomalous evaluations
  • [ ] Cross-reference with recent deployments
  • [ ] Contain: rollback or kill switch
  • [ ] Investigate: trace the root cause
  • [ ] Remediate: fix policy, rotate credentials, or resolve drift
  • [ ] Post-mortem: document findings and actions
  • [ ] Follow-up: verify remediation and close the incident

Next Steps#