guides

Intended Documentation

Edge-Verifier Operator Guide

Deployment, monitoring, key rotation, and failure-mode handling for the Intended edge verifier in production cells.

advanced8 min readimplemented

Edge-Verifier Operator Guide (DOC-P8)#

Audience: plant operations / IT teams responsible for deploying and maintaining the Intended edge verifier in production cells.
Status: the Rust edge verifier binary (TOK-P1) is not yet shipped. This guide documents the operating model that the binary will deliver so customers can plan deployment, monitoring, and change-control procedures ahead of GA.

What the edge verifier is#

A single signed binary that:

Verifies Authority Tokens locally with sub-50ms latency.
Maintains a JWKS cache and a token revocation bloom filter.
Buffers audit events to local SQLite and uploads in compressed batches (AUD-P1, AUD-P2).
Operates for up to 24 hours fully offline (TOK-P3).
Detects local tamper attempts (file modification, time rollback) and fails closed (AUD-P6).

Distribution: Linux x86_64, ARM64, ARMv7 (TOK-P2). Container image and static binary both published. Reproducible build with attestation.

Deployment topology#

One verifier per cell or per robot. Choosing:

Option	When to use
Per-robot (verifier on each robot's controller PC)	Highest isolation; one robot's verifier outage doesn't affect others. Most edge-CPU. Recommended.
Per-cell (one verifier serves several robots in a cell over local network)	Resource-constrained cells. Single point of failure within the cell — make the safe-default conservative.
Per-line	Discouraged. Network blast radius too wide. Use only for non-safety-critical telemetry agents.

The verifier MUST be reachable from the controller over a sub-1ms local network (loopback or dedicated VLAN). Do NOT run the verifier across the WAN from the robot it serves.

Configuration#

The verifier reads a single config file (/etc/intended/verifier.toml):

toml

[verifier]
bound_actor_identity   = "cobot-east-3"      # IEEE 802.1AR DevID
trusted_issuers        = ["https://api.intended.so"]
policy_allowlist_path  = "/etc/intended/policy-allowlist.json"
listen_addr            = "127.0.0.1:7400"

[jwks]
cache_path             = "/var/lib/intended/jwks.cache"
refresh_interval_secs  = 3600
max_offline_secs       = 86400               # 24 hours

[audit]
buffer_path            = "/var/lib/intended/audit-buffer.sqlite"
upload_interval_secs   = 30
upload_max_batch_bytes = 4194304             # 4 MiB
fail_closed_after_gap_secs = 7200            # 2 hours

[time]
source                 = "ptp"               # "ptp" | "ntp" | "system" (system is dev-only)
ptp_interface          = "eth1"

[logging]
level                  = "info"
audit_log_path         = "/var/log/intended/audit.log"

policy_allowlist_path is the list of OIL codes this verifier is authorized to enforce — a defense-in-depth mechanism against a mis-issued token. Tokens citing an OIL code outside this list are rejected even if the signature is valid.

Provisioning identity#

Each verifier MUST be bound to one robot's identity. Recommended:

Provision an IEEE 802.1AR IDevID in the robot's TPM at manufacture.
The verifier reads the IDevID at startup and uses it as bound_actor_identity.
Tokens issued by the cloud against this identity are the only ones accepted.

Without IDevID hardware, the verifier accepts a manually-configured identifier — but the safety case can no longer claim hardware-anchored identity binding (DOC-P5 G1.2.2.2).

Operations#

Health checks#

The verifier exposes Prometheus metrics on /metrics (default port 9400):

Metric	Meaning
`intended_verifier_verifications_total{result="ok	reject"}`
`intended_verifier_verification_seconds`	Histogram of verification latency
`intended_verifier_jwks_age_seconds`	Age of cached JWKS (alert if > refresh_interval_secs × 3)
`intended_verifier_audit_buffer_pending_bytes`	Local audit buffer depth
`intended_verifier_audit_upload_failures_total`	Audit upload failures (alert)
`intended_verifier_audit_gap_seconds`	Time since last successful audit upload
`intended_verifier_revocation_check_total`	Revocation-list check operations
`intended_verifier_time_source_offset_seconds`	PTP/NTP offset (alert if drift)

Recommended alerts#

Condition	Severity	Action
`jwks_age_seconds > 7200`	warn	check verifier→cloud connectivity
`audit_gap_seconds > 1800`	warn	check audit upload path
`audit_gap_seconds > 7200`	crit	verifier will fail closed at this point — investigate immediately
`verification_seconds_p99 > 0.05`	warn	investigate verifier host load
`time_source_offset_seconds > 0.5`	crit	time source unreliable — fail closed

Rolling out a new version#

The verifier is a single binary, but it sits on the safety path. Roll out in stages:

Canary cell. Deploy to one non-safety-critical cell. Run for ≥7 days with full telemetry.
Single safety-critical cell. Deploy to one cell with safety-rated inputs. Run for ≥7 days. Verify no false-rejects in the audit log.
Fleet rollout. Stagger across remaining cells, batches of ~10% per day.

The verifier does not live-reload binary updates; it requires a process restart. Schedule restarts during planned maintenance windows or by issuing a STOP to the controller first.

Configuration changes#

policy_allowlist.json and verifier.toml may be reloaded with SIGHUP. Live JWKS cache is preserved across reload.

Audit-relevant config changes (issuer trust list, allowlist) MUST go through your change-control process. The verifier writes a config- change event to the audit chain on every reload.

Key rotation#

The cloud rotates signing keys quarterly (or on incident). The verifier sees this as new kid values in JWKS responses. As long as refresh_interval_secs is honored, rotation is invisible to the controller.

If rotation happens during an offline window: the verifier serves in-flight tokens until their expiresAtMs, then begins rejecting until JWKS refresh succeeds. This is the documented behavior — fail-closed on unverifiable tokens.

For emergency revocation (compromised signing key), the cloud pushes a revocation entry that propagates to the verifier within ≤5 seconds when online (TOK-P7). Offline verifiers cannot honor revocation until they reconnect; this is why the offline budget defaults to 24 hours.

Failure modes#

Verifier process crashes#

systemd unit (intended-verifier.service) restarts the binary automatically. While restarting, the controller MUST treat verification as denied (this is the documented controller behavior — verify behavior in your DOC-P5 safety case section G1.2.2.3). Typical restart: <1 second.

Local disk full#

Audit buffer can't write → verifier fails closed for new tokens (in-flight tokens continue to be honored until expiry). Alert audit_buffer_pending_bytes and provision sufficient disk for at least 24 hours of buffered events.

Time source drift#

PTP grandmaster lost or NTP drift exceeds threshold → verifier fails closed. Cannot trust expiry checks without trustworthy time. Recovery is automatic when time source returns within bounds.

Cloud unreachable#

Within max_offline_secs (default 24h): verifier serves tokens normally using cached JWKS. Audit events buffer locally. Revocation cannot propagate.

After max_offline_secs: verifier fails closed for new tokens. This is intentional — past 24 hours, the cloud's word on whether a key is revoked is too stale to be trustworthy.

Tamper detection#

The verifier's binary is signed; the runtime verifies its own integrity on startup. Audit buffer entries are HMAC'd with a TPM-resident key. On tamper detection, the verifier writes a tamper event to the audit chain, refuses to verify any token, and exits.

Recovery requires a verified reinstall (signed binary + fresh provisioning).

Backup and DR#

The verifier is stateless in any operationally-meaningful sense — its only persistent state is the audit buffer, which is uploaded to the cloud. Backups of the verifier itself are unnecessary. To replace a crashed verifier: provision a new host, install, configure with the same bound_actor_identity and config file. Audit events not yet uploaded from the crashed host are lost; cloud-side audit chain still records every issuance, so the gap is visible and bounded.

For high-availability deployments, run two verifiers per cell behind a small selector that routes to whichever is healthy. The two verifiers do NOT need to share state.

Procurement / supply-chain#

The shipped binary is reproducibly buildable from public source plus a signed manifest. SBOM published per release. Customers in regulated industries (medical, automotive, aerospace) typically request:

Signed binary + signed SBOM
Reproducible-build attestation
Statement of cryptographic primitives (we use Ring + Rustls; FIPS- validated build available for federal customers)
Penetration test report (annual, third-party)

Available from the Intended security portal once the binary GAs.

Until the verifier ships#

If you need authority gating today, the cloud is the verifier:

The cloud signs + verifies in one round trip on /v1/physical/authority-tokens.
Latency: ~80–120ms typical, ~250ms p99.
Acceptable for non-RT planning loops.
NOT acceptable for ms-critical control loops — those wait on the edge binary.

When the binary ships, migration is configuration-only: install the binary, point the controller at 127.0.0.1:7400 instead of the cloud URL, swap the verifier library in your firmware. The wire format and JWT claims do not change.