guides
Intended Documentation
Edge-Verifier Operator Guide
Deployment, monitoring, key rotation, and failure-mode handling for the Intended edge verifier in production cells.
Edge-Verifier Operator Guide (DOC-P8)#
Audience: plant operations / IT teams responsible for deploying and maintaining the Intended edge verifier in production cells.
Status: the Rust edge verifier binary (TOK-P1) is not yet shipped. This guide documents the operating model that the binary will deliver so customers can plan deployment, monitoring, and change-control procedures ahead of GA.
What the edge verifier is#
A single signed binary that:
- Verifies Authority Tokens locally with sub-50ms latency.
- Maintains a JWKS cache and a token revocation bloom filter.
- Buffers audit events to local SQLite and uploads in compressed batches (AUD-P1, AUD-P2).
- Operates for up to 24 hours fully offline (TOK-P3).
- Detects local tamper attempts (file modification, time rollback) and fails closed (AUD-P6).
Distribution: Linux x86_64, ARM64, ARMv7 (TOK-P2). Container image and static binary both published. Reproducible build with attestation.
Deployment topology#
One verifier per cell or per robot. Choosing:
| Option | When to use |
|---|---|
| Per-robot (verifier on each robot's controller PC) | Highest isolation; one robot's verifier outage doesn't affect others. Most edge-CPU. Recommended. |
| Per-cell (one verifier serves several robots in a cell over local network) | Resource-constrained cells. Single point of failure within the cell — make the safe-default conservative. |
| Per-line | Discouraged. Network blast radius too wide. Use only for non-safety-critical telemetry agents. |
The verifier MUST be reachable from the controller over a sub-1ms local network (loopback or dedicated VLAN). Do NOT run the verifier across the WAN from the robot it serves.
Configuration#
The verifier reads a single config file (/etc/intended/verifier.toml):
policy_allowlist_path is the list of OIL codes this verifier is authorized to enforce — a defense-in-depth mechanism against a mis-issued token. Tokens citing an OIL code outside this list are rejected even if the signature is valid.
Provisioning identity#
Each verifier MUST be bound to one robot's identity. Recommended:
- Provision an IEEE 802.1AR IDevID in the robot's TPM at manufacture.
- The verifier reads the IDevID at startup and uses it as
bound_actor_identity. - Tokens issued by the cloud against this identity are the only ones accepted.
Without IDevID hardware, the verifier accepts a manually-configured identifier — but the safety case can no longer claim hardware-anchored identity binding (DOC-P5 G1.2.2.2).
Operations#
Health checks#
The verifier exposes Prometheus metrics on /metrics (default port 9400):
| Metric | Meaning |
|---|---|
| `intended_verifier_verifications_total{result="ok | reject"}` |
intended_verifier_verification_seconds | Histogram of verification latency |
intended_verifier_jwks_age_seconds | Age of cached JWKS (alert if > refresh_interval_secs × 3) |
intended_verifier_audit_buffer_pending_bytes | Local audit buffer depth |
intended_verifier_audit_upload_failures_total | Audit upload failures (alert) |
intended_verifier_audit_gap_seconds | Time since last successful audit upload |
intended_verifier_revocation_check_total | Revocation-list check operations |
intended_verifier_time_source_offset_seconds | PTP/NTP offset (alert if drift) |
Recommended alerts#
| Condition | Severity | Action |
|---|---|---|
jwks_age_seconds > 7200 | warn | check verifier→cloud connectivity |
audit_gap_seconds > 1800 | warn | check audit upload path |
audit_gap_seconds > 7200 | crit | verifier will fail closed at this point — investigate immediately |
verification_seconds_p99 > 0.05 | warn | investigate verifier host load |
time_source_offset_seconds > 0.5 | crit | time source unreliable — fail closed |
Rolling out a new version#
The verifier is a single binary, but it sits on the safety path. Roll out in stages:
- Canary cell. Deploy to one non-safety-critical cell. Run for ≥7 days with full telemetry.
- Single safety-critical cell. Deploy to one cell with safety-rated inputs. Run for ≥7 days. Verify no false-rejects in the audit log.
- Fleet rollout. Stagger across remaining cells, batches of ~10% per day.
The verifier does not live-reload binary updates; it requires a process restart. Schedule restarts during planned maintenance windows or by issuing a STOP to the controller first.
Configuration changes#
policy_allowlist.json and verifier.toml may be reloaded with SIGHUP. Live JWKS cache is preserved across reload.
Audit-relevant config changes (issuer trust list, allowlist) MUST go through your change-control process. The verifier writes a config- change event to the audit chain on every reload.
Key rotation#
The cloud rotates signing keys quarterly (or on incident). The verifier sees this as new kid values in JWKS responses. As long as refresh_interval_secs is honored, rotation is invisible to the controller.
If rotation happens during an offline window: the verifier serves in-flight tokens until their expiresAtMs, then begins rejecting until JWKS refresh succeeds. This is the documented behavior — fail-closed on unverifiable tokens.
For emergency revocation (compromised signing key), the cloud pushes a revocation entry that propagates to the verifier within ≤5 seconds when online (TOK-P7). Offline verifiers cannot honor revocation until they reconnect; this is why the offline budget defaults to 24 hours.
Failure modes#
Verifier process crashes#
systemd unit (intended-verifier.service) restarts the binary automatically. While restarting, the controller MUST treat verification as denied (this is the documented controller behavior — verify behavior in your DOC-P5 safety case section G1.2.2.3). Typical restart: <1 second.
Local disk full#
Audit buffer can't write → verifier fails closed for new tokens (in-flight tokens continue to be honored until expiry). Alert audit_buffer_pending_bytes and provision sufficient disk for at least 24 hours of buffered events.
Time source drift#
PTP grandmaster lost or NTP drift exceeds threshold → verifier fails closed. Cannot trust expiry checks without trustworthy time. Recovery is automatic when time source returns within bounds.
Cloud unreachable#
Within max_offline_secs (default 24h): verifier serves tokens normally using cached JWKS. Audit events buffer locally. Revocation cannot propagate.
After max_offline_secs: verifier fails closed for new tokens. This is intentional — past 24 hours, the cloud's word on whether a key is revoked is too stale to be trustworthy.
Tamper detection#
The verifier's binary is signed; the runtime verifies its own integrity on startup. Audit buffer entries are HMAC'd with a TPM-resident key. On tamper detection, the verifier writes a tamper event to the audit chain, refuses to verify any token, and exits.
Recovery requires a verified reinstall (signed binary + fresh provisioning).
Backup and DR#
The verifier is stateless in any operationally-meaningful sense — its only persistent state is the audit buffer, which is uploaded to the cloud. Backups of the verifier itself are unnecessary. To replace a crashed verifier: provision a new host, install, configure with the same bound_actor_identity and config file. Audit events not yet uploaded from the crashed host are lost; cloud-side audit chain still records every issuance, so the gap is visible and bounded.
For high-availability deployments, run two verifiers per cell behind a small selector that routes to whichever is healthy. The two verifiers do NOT need to share state.
Procurement / supply-chain#
The shipped binary is reproducibly buildable from public source plus a signed manifest. SBOM published per release. Customers in regulated industries (medical, automotive, aerospace) typically request:
- Signed binary + signed SBOM
- Reproducible-build attestation
- Statement of cryptographic primitives (we use Ring + Rustls; FIPS- validated build available for federal customers)
- Penetration test report (annual, third-party)
Available from the Intended security portal once the binary GAs.
Until the verifier ships#
If you need authority gating today, the cloud is the verifier:
- The cloud signs + verifies in one round trip on
/v1/physical/authority-tokens. - Latency: ~80–120ms typical, ~250ms p99.
- Acceptable for non-RT planning loops.
- NOT acceptable for ms-critical control loops — those wait on the edge binary.
When the binary ships, migration is configuration-only: install the binary, point the controller at 127.0.0.1:7400 instead of the cloud URL, swap the verifier library in your firmware. The wire format and JWT claims do not change.
See also#
- DOC-P4 Rust safety-critical firmware — the verification contract the binary implements
- DOC-P5 safety-case writing — argues the verifier's behavior in formal safety terms
docs/runbooks/— operational runbooks for the Intended cloud product