The flight recorder
for AI agents.

A scannable specification of tamper-evident records of agent execution — for CCOs, Heads of AI Risk, Internal Audit, and the platform engineers who’ll implement the capture.

Version1.0
Skim~ 3 min
Read~ 14 min
Words~ 4,200

If you only read this page.

  1. Your agents are already in production. Within twenty-four months, someone — an auditor, a regulator, opposing counsel — will ask you to reconstruct a specific agent decision end-to-end and prove the log hasn’t been touched.

  2. The tools you bought to ship them were not built for that conversation. Observability is mutable and 14-day. GRC tools are checklists. AI governance is policy-layer. None of them indexes the agent run as evidence.

  3. It’s a data-model gap, not a feature gap. Spans, controls, and policy registries are the wrong primitives. The right primitive is the agent execution graph — prompts, tools, retrievals, approvals, refusals, side effects — hashed, signed, and externally anchored.

  4. Boring primitives, conservatively combined. SHA-256 hash chain · daily Merkle root signed in an HSM · weekly anchor to Sigstore Rekor · storage in S3 Object Lock compliance mode. The auditor verifies offline, with a Go binary we publish.

  5. We lead with the deadlines that are dated and live. DORA since Jan 2025. April 2026 interagency MRM principles. GDPR Article 22. SOX, HIPAA. The EU AI Act high-risk window opens 2 Dec 2027 — now fixed, not standards-conditional. If none of those is within twelve months of you, we’re the wrong product. We’ll say so on the call.

Honest brackets.

Runfile is design-partner-stage. The architecture in §04–§08 is specified, partially implemented, and being validated with a small named set of design partners. Where a component ships, we say so. Where it’s v1.5, we say so.

Three audiences should read this in three different orders. Pick yours.

Chief Compliance Officer
Read sections 01, 02, 07, 09.
About 6 minutes.
Head of Internal Audit
Read sections 02, 07, 08, 09 — section 08 is the load-bearing one.
About 7 minutes.
Platform engineer
Read sections 03 through 08, in order.
About 12 minutes.

Executive summary.

In 2025 and 2026, regulated firms moved AI agents into production. The questions that follow have not changed in twenty years of financial-services audit: who acted, on whose behalf, when, on what basis — and can you prove the record has not been altered.

The tools the engineering teams bought were built for a different question.

01 / Dev observability

Built to debug, not to defend.

Datadog, Langfuse, LangSmith, Arize. 14–15 day retention by default. Mutable spans. No control mapping. Optimised for finding why a prompt regressed last Tuesday.

02 / GRC platforms

Built for posture, not for runs.

Vanta, Drata, OneTrust. Index (control_id, evidence_artifact, owner). Excellent for SOC 2 evidence rooms. Cannot represent an agent run as a first-class object.

03 / AI governance

Built for policy, not for proof.

Credo AI, Holistic AI, ValidMind. Index the AI system as a registered object. Good for the model-risk function. Not the runtime record of what the agent did on Tuesday at 11:42.

04 / Runfile

Built for the auditor’s question.

Indexes the agent execution graph. Hash-chained, signed, externally anchored, control-mapped, retained for the obligation in scope. Verifiable offline, with a binary we publish.

Bottom line

The gap isn’t capability. It’s the data model. Runfile indexes the right object.

The auditor’s question.

A useful way to test any compliance tool: write down the question the auditor will ask, then check whether the tool can answer it.

Take a credit-decisioning agent at a UK retail bank. It pulls a bureau file, computes debt-to-income, runs a policy threshold, and either approves, declines, or escalates. Three months after go-live, a declined customer files with the Ombudsman. Internal Audit is asked to reconstruct.

Show me every action this agent took on behalf of customer 8041 between 11:42 and 11:43 GMT on 14 March 2026. Include the prompt, the model version, the retrieved bureau response, the policy that fired, the human approval if any, and the final decision. Prove the log has not been modified since the action was taken. Map each event to GDPR Article 22, SS1/23, and the Consumer Duty fair-value test.

— Internal Audit reconstruction request, paraphrased

Every element is concrete. No judgement calls. The answers either exist or they do not. The auditor expects a citation, not a dashboard.

The vocabulary changes by regulator. The shape does not. A SOX §404 question on a claims-routing agent. A DORA Article 17 incident reconstruction at an EU asset manager. An FDA 21 CFR Part 11 protocol-amendment audit at a pharma. All ask the same thing of the same object: the agent execution graph, signed, complete, control-mapped, retained, and verifiable.

Bottom line

The auditor’s job is to ask the question. Runfile’s job is to make the answer producible in a form they’ll accept.

Why existing tools
do not answer it.

Three adjacent categories each solve a real problem. None of them solves the one in §02. The reason is the shape of their data model, not their capability.

Dev observability · the OTel trace

Datadog, Langfuse, LangSmith, Braintrust, Arize. Data model: (trace_id, span_id, input, output, latency, model, tokens). Three blockers as evidence:

  • Retention. 14–15 day default. SOX wants 7 years. HIPAA wants 6. EU AI Act floors at 6 months and routinely longer under sector law.
  • Mutability. Spans are editable by design — debugging tools should let you fix a label. Mutability and integrity are opposed primitives.
  • No control mapping. The OTel schema has no notion of regulatory control. Tags are ad hoc, editable, and don’t travel with the evidence.

We’re not trying to be a better Datadog. We feed off their OTel emissions where they’re already running.

GRC platforms · the control-evidence pair

Vanta hit $300M ARR in April 2026. Drata at $100M. OneTrust at half a billion. Excellent for SOC 2 evidence rooms, where the unit is a state-of-the-world at a point in time. An agent run isn’t a point in time. It’s a sequence. The GRC model cannot represent a sequence as a first-class object.

AI governance · the system posture

Credo AI, Holistic AI, ValidMind. Index the AI system itself — its policies, its risk assessment, its bias evaluation, its registry mapping. The right product for the model-risk function. Not the runtime record.

The three-by-four matrix

Layer Object indexed Vendors Audit-grade for runs?
Engineering debug OpenTelemetry trace Datadog · Langfuse · LangSmith · Arize No
GRC (control, evidence_artifact) Vanta · Drata · OneTrust No
AI governance AI system posture Credo AI · Holistic AI · ValidMind No
Agent assurance Agent execution graph Runfile Yes — built for it
Bottom line

Three categories. Three indexes. None of them is the auditor’s object.

The agent execution graph.

Every agent invocation is recorded as one bounded object: the run. A run has a beginning, an end, an outcome, and a directed graph of events. The graph is the unit of evidence.

Eight event classes

Within each run, Runfile records the following event classes. Each event is structured, typed, and validated against a schema before it is hashed into the chain.

01 Identity & provenance
The agent identity, expressed as a did:web decentralized identifier; the agent version, the prompt-template version hash, the model and its version and provider, the retrieval index version where relevant, the deployment environment, the tenant, and the principal on whose behalf the agent acts. The principal is typically the human user or the customer whose decision is at stake.
02 LLM calls
Each model invocation records the system prompt hash, the user prompt content (subject to the tokenisation rules in §06), the full response, token usage in and out, the model parameters (temperature, top-p, and seed where supplied), and the wall-clock latency. The seed is critical for reproducibility under SR 11-7-successor model-risk frameworks; we capture it where the model provider exposes it.
03 Tool calls
Each tool invocation records the tool name, the full argument payload, the full response payload, the success or error state, the number of retries, and the elapsed time. Tools include retrievals, external API calls, database writes, and any function exposed to the agent — including those exposed through the Model Context Protocol.
04 Retrieval events
When the agent queries a vector index, document store, or knowledge base, Runfile records the query, the index identifier and version, the retrieved chunks with their document identifiers and versions, and the relevance scores. This is the predicate that lets the auditor reconstruct which document the model was looking at when it produced an output.
05 Approvals & human-in-the-loop
Where the agent’s workflow requires human approval — for high-value transactions, escalations, or anything covered by GDPR Article 22’s human-oversight carve-out — Runfile records the identity of the approver, the timestamp, the decision, and the signed justification. This is the evidence that lifts a decision out of Article 22’s “solely automated” scope.
06 Refusals & guardrail activations
When a policy fires, a guardrail blocks an output, or the agent declines an action, Runfile records which policy fired, what was blocked, the reason, and the policy version. Refusals are first-class events in our model because demonstrating effective oversight — that the controls fired when they should have — is half of what the auditor wants to prove.
07 External side effects
Where the agent writes to an external system — a CRM update, an email send, a payment instruction, a file modification — Runfile records the target system, the operation, the idempotency key, and the response. Side effects are the events that have legal weight. They get extra care.
08 Outcome
Each run ends with an outcome: success, partial completion, escalation, refusal, or error. The user-visible artefact — the credit decision, the email draft, the routing assignment — is hashed, and the hash is recorded with the outcome.

Graph, not trace

A tool call may be triggered by an LLM call, which may be triggered by an orchestrator step. Parent-child edges are first-class. The auditor’s question is a graph traversal. A trace is a list; a run is a graph.

Schema, versioning, wire format

Schema authored in TypeScript + Zod, single source of truth. Generates JSON Schema (Python/Pydantic), Go structs (event processor, verifier CLI), TS types. Wire format is canonical JSON per RFC 8785 — the precondition for hashing. Auditor-facing exports emit JSON-LD with a stable context. Schema migrations are additive only for the lifetime of the retention obligation.

Bottom line

Spans are lists. Runs are graphs. You can’t traverse a list to answer the auditor’s question.

Cryptographic
chain of custody.

Trust here means cryptographic, not contractual. The auditor should not need to trust Runfile’s policies or our staff. They should be able to verify the data alone, offline, with a binary they downloaded from our public GitHub releases.

The architecture, in one diagram

Per event At ingest, in milliseconds.
Event n−1hash …9c0fd6e4
Event nSHA-256 chain entry
Event n+1hash 73e5b1a2…
Daily, per tenant Signed in the HSM, written to storage.
Merkle root8a3f1c9b…
SignedKMS HSM · per-tenant key
StoredS3 Object Lock · compliance mode
Weekly Anchored to a public log Runfile cannot rewrite.
Meta-rootaggregated · signed
AnchoredSigstore Rekor · public log
EU optionaleIDAS QTSP timestamp

Three independent integrity properties

Per-event. Tamper with one payload → its hash changes → every later chain entry diverges → the day’s Merkle root mismatches the signed manifest. Detected from any later event onward.

Daily. The Merkle root is signed by a per-tenant KMS key (FIPS 140-2 Level 3 HSM). Re-checkable offline with the public key we publish.

External. The weekly meta-root is anchored to Sigstore Rekor. Even Runfile, with full database access, structurally cannot alter a past root without producing a meta-root that contradicts the public log. The auditor verifies us without us.

Storage: S3 Object Lock, compliance mode

Compliance mode is the stricter Object Lock variant. Not even the AWS root account can delete or modify an object within the retention window. Runfile is removed from the threat model. An insider with full root credentials cannot alter a single payload. Verifiable independently via the customer’s CloudTrail.

A line we will not cross

Customers ask: can we hold the signing key and sign our own logs? No. If the agent runtime can sign its own logs, the logs aren’t credible evidence against the customer — same problem as a defendant signing their own affidavit. The separation between the runtime and the signing infrastructure is the chain-of-custody claim. We won’t break it on request.

Bottom line

Boring primitives, conservatively combined. An evidence package signed today remains verifiable if Runfile ceased to exist tomorrow.

SDK & trust boundary.

The capture SDK is the only Runfile component that runs inside the customer’s environment. Open source, Apache 2.0.

What the SDK does

Python · v1

Full coverage

LangGraph, OpenAI Agents SDK, Anthropic Claude SDK, Model Context Protocol (MCP).

TypeScript · v1

Partial coverage

LangGraph.js, Claude SDK TS. Mastra and Vercel AI SDK land in v1.5.

High-level API

Decorator wrap

One @capture decorates an agent function. Every framework-native event captured automatically.

Passive mode

OTel subscriber

If you already emit OTel GenAI semantic conventions, we ride alongside — no agent code changes.

PII never leaves the customer’s environment

Deterministic tokenisation at the SDK boundary. The mapping lives in the Runfile Token Vault — separate service, separate KMS key, separate IAM boundary, separate audit log from the event store. Three properties follow:

  • Our event store never holds cleartext PII. Ingest sees only tokens. Most auditor questions are at the agent-action layer (did the agent call this tool with these arguments) and answer fine from the tokenised store.
  • Reidentification is per-resolution, audit-logged. Rate-limited, scoped, justification required. The auditor sees cleartext when they need it; the platform engineer querying for latency does not.
  • Enterprise customers hold the reidentification key. Even Runfile with full Vault access cannot resolve a token to cleartext without the customer’s cross-account signature.

The Vault is its own service deliberately. Combining auth, PII reidentification, and signing behind one IAM boundary would concentrate three different secret classes with one blast radius. Different access patterns, different audit requirements, different threat models.

Bottom line

The SDK is open. The PII stays put. The signing key is on our side of the wall by design.

Regulatory mapping.

Some frameworks are dateable. Some are principles-based. We lead the sales motion with the dateable ones and offer the principles-based ones as supported, not promised.

Dateable obligations — we lead with these

Framework Date What it demands Retention
DORA Live
17 Jan 2025
48-hour ICT incident report; ICT third-party register; audit trail. BaFin (mid-2025) brought AI explicitly into scope; UK CTP regime parallel from 1 Jan 2025. ~ 5 years
Fed / FDIC / OCC MRM Live
17 Apr 2026
Principles-based, risk-tiered, technology-neutral interagency guidance. SR 11-7’s three pillars — independent validation, ongoing monitoring, documentation — remain the operating template. n/a
GDPR Art. 22 & 30 Live
May 2018
Human oversight on solely-automated decisions with legal or significant effect; records of processing under Article 30. Sector law
SOX §404 Live
2002
When an agent triggers or affects a financial control — revenue recognition, journal entries, period close — the agent action becomes SOX-relevant. 7 years
HIPAA OCR audit logs Live
1996
When an agent is the actor touching PHI, the covered entity needs an identifiable agent principal, the action, the PHI fields touched, and any human approval. 6 years
EU AI Act
Art. 12 & 26(6)
Standalone
2 Dec 2027
Embedded
2 Aug 2028
Automatic recording of events over the lifetime of the system; deployers retain logs. Fixed dates per the May 2026 Digital Omnibus — no longer standards-conditional. ≥ 6 mo floor
TRAIGA (Texas) Effective
1 Jan 2026
Intentional-misuse focus rather than broad high-risk categorisation. Audit-ready evidence on AG investigation. n/a
In flux

Colorado AI Act has moved three times in nine months — effective 1 Feb 2026, delayed to 30 Jun 2026 (SB 25B-004), then stayed 27 Apr 2026 by federal magistrate order in the xAI matter. SB 189 would push to 1 Jan 2027 if signed. We support whatever the final Act requires. We don’t market a Colorado date.

Penalty stack — EU AI Act, Article 99

Tier Trigger Max fine
1Article 5 prohibited practices€35M / 7% of global turnover
2Article 16 breaches (incl. Art. 12 logging)€15M / 3% of global turnover
3Incorrect information to authorities€7.5M / 1% of global turnover

Principles-based — supported, not led

  • NIST AI RMF 1.0 + GenAI Profile (AI 600-1). De facto US vocabulary. CSA Agentic Profile reframes for agents. Our event-graph feeds the Measure / Manage outputs.
  • ISO/IEC 42001:2023. AI management system. Procurement pull is real. We hold the cert ourselves by H2 2027.
  • UK FCA / PRA. Technology-neutral. We support SS1/23 model-documentation pack generation.
  • Singapore MAS FEAT / APRA / AIDA. Principles-based, lower GTM priority. Supported on request.
  • FINRA Notice 24-09. Existing rules applied to AI-assisted communications. We capture the comms side.

How the mappings are held

YAML repository, version-controlled, signed at release. Public to customers. The mapping version applied to a run is recorded in that run’s manifest, so evidence produced today can be re-verified two years from now against the mapping in force at capture.

Bottom line

DORA is live. April 2026 MRM is live. The AI Act high-risk window opens 2 Dec 2027 — fixed, not standards-conditional. The window to be ready is now.

Evidence package
& how it’s verified.

A third party — auditor, regulator, opposing counsel — can verify a Runfile evidence package without trusting Runfile. The package is designed for that property from the inside out.

What’s in the package

For the workpaper

Signed PDF cover

Scope, event count, integrity status, controls in scope, Merkle root, Rekor entry #, QR to permanent identifier.

For re-verification

JSON-LD event graph

Full canonical-JSON event graph, one run per file, parent-child edges preserved.

For integrity

Signed manifests + Rekor proofs

Every relevant daily Merkle manifest. The inclusion proof against the public Sigstore Rekor log. The public key.

For traceability

Mapping repo + README

Versioned control-mapping repository as applied. README in plain English on how to verify.

The verifier CLI

Go binary, single statically-linked executable, public GitHub release. Zero dependency on Runfile’s running infrastructure.

$ runfile verify ./acme_credit_review_2026_q1.zip

Six checks:

  1. Parse the package, validate against bundled schema.
  2. Recompute every event’s SHA-256 chain entry from canonical-JSON; compare.
  3. Recompute the daily Merkle root from chain heads; compare to signed manifest.
  4. Verify each daily manifest signature against the bundled public key.
  5. Per week, verify Sigstore Rekor inclusion proof — either online or from the bundle (offline mode).
  6. Apply the bundled control mapping — events in scope, controls satisfied, gaps.

Output: a signed PDF + a JSON detail file. Attach both to the workpaper.

Offline matters

Air-gapped audit environments can’t depend on the internet. Runfile’s verifier doesn’t require it — we ship the Rekor inclusion proofs with the package. The auditor is verifying using the public key, canonical JSON, SHA-256, the Merkle construction (RFC 6962), and the Rekor format — all open standards. Runfile is not in the trust loop.

Bottom line

Internal Audit doesn’t buy the platform because they trust Runfile. They buy it because they don’t need to.

What Runfile is not.

A whitepaper that doesn’t list its disqualifiers shouldn’t be trusted. Here’s where we’re the wrong answer.

Wrong product if…

You need real-time enforcement.

Runfile records; it does not gate. Buy Lakera or Galileo. Their refusal events show up in our chain as first-class evidence of effective oversight.

Wrong product if…

You’re debugging an agent.

High-cardinality, mutable, 14-day spans is the right shape for that job. Datadog, Langfuse, LangSmith, Arize, Braintrust. Runfile rides alongside.

Wrong product if…

You need a SOC 2 evidence room.

Vanta, Drata, OneTrust. Runfile feeds into them. We’re not them.

Wrong product if…

You’re a consumer app, no regulated workload.

Most early-stage AI products do not need Runfile yet. Buy LangSmith, ship the product, come back when the auditor calls.

Wrong product if…

No regulated obligation within 12 months.

The case for paying for tamper-evident evidence today is weaker if the demand-driving regulation is more than a year out. We’ll say so on the call.

Wrong product if…

You want us to AI-generate the audit.

We don’t. The category was burned recently by a startup faking AI-generated evidence. Agents act, we record, auditors verify.

Bottom line

Selling to a buyer for whom Runfile is wrong is worse than not selling at all. The disqualifiers are how we filter our pipeline.

What changes in v1.5.

The v1 architecture is designed to permit v1.5 capabilities as additive changes — not rewrites. Schema fields exist now and are null. IAM boundaries are already in place. Evidence signed today remains verifiable through v1.5 and beyond.

Capability v1 v1.5
Nitro Enclave signer KMS HSM signing Attestation document with code provenance
Multi-region One region per tenant Multi-region replication via Terraform stack
BYO-cloud Runfile AWS account Ingest in customer’s AWS account, cross-account roles
TypeScript SDK Partial (LangGraph.js, Claude SDK TS) Parity (Mastra, Vercel AI SDK)
Self-service signup Sales-led provisioning Starter tier self-service
Auth Scoped API keys OAuth2 client credentials, mTLS
SIEM streaming Runfile audit-of-audit only Customer-side Splunk / Sentinel / Datadog Security
Two-approver workflows Single approver Multi-approver for Enterprise tier
Bottom line

v1 is shippable today. v1.5 is additive, not a rewrite.

The argument, in
one paragraph.

The argument: the agent execution graph is the right data object for AI compliance, and nobody ships it today.

It’s contestable. Other models exist — OTel traces, control-evidence pairs, AI-system-as-object — and any of them may turn out to be the right shape too. We’ve made our case; the market will adjudicate.

What is not contestable is the question. The auditor will ask: what did this agent do, on behalf of this customer, at this time, and can you prove the record is intact. The question is twenty years old in financial services. It doesn’t become a different question when the actor is an LLM with a tool belt.

Runfile is what we’re building to answer it. If you’re the CCO, Head of AI Risk, or Head of Internal Audit at a bank, insurer, or asset manager in the US, UK, or EU, we’d like to hear from you. Internal Audit signs off on what we ship; bring them to the call.

Runfile Systems Ltd · registered in England & Wales
EU & UK data residency
hello@runfile.ai · runfile.ai

Whitepaper v1.0 · May 2026. Updated when the regulatory landscape shifts materially or v1.5 ships. Latest at runfile.ai/whitepaper; previous versions archived.