The Harness Is the Product

The model was fine, the reasoning wasn't.

For six weeks, engineers merged code their agent had quietly stopped thinking hard about. They did not know. The output looked fine.

In April 2026, Anthropic published a postmortem explaining why. It traced six weeks of quality degradation in Claude Code to 3 unrelated product-layer changes that had shipped between March and April: a reduction in default reasoning effort, a change to how idle sessions cleared older thinking, and a new verbosity instruction. None of the changes touched model weights. Together, they reshaped what the agent did during a session. For six weeks, engineering teams were merging code produced by an agent operating below its published capability, with no way to know it.

Timeline of three harness changes stacked across six weeks — no model changes, no announcement. — Fig. 02 — Three harness changes. No model changes. Nobody modeled their interaction.

An independent audit examined 6,852 Claude Code session files and over 234,000 tool calls. The data was unambiguous: reasoning depth had fallen sharply, and the agent was routinely choosing simpler fixes over correct ones. The output looked plausible. The process behind it was compromised.

Anthropic published the postmortem and patched all 3 issues by April 20. That transparency was a choice, not a structural requirement. Most harness regressions do not arrive with an explanation. They arrive as a vague sense that the agent has been worse lately, or as a production failure that traces back to decisions made months ago.

The harness is not a detail. It is the product. It determines reasoning effort, manages session context, and shapes what the model actually does during a run. And the harness is not visible in the output. Every team that trusted a model name for those six weeks was trusting a brand, not a behavior. This is not a story about one vendor's incident. It is a story about how the industry currently evaluates and trusts coding agents.

Why nobody saw it

Standard evaluation frameworks for coding agents are built around output. Does the code compile? Do the tests pass? Does the PR look reasonable to a human reviewer? These are necessary checks. They are not sufficient.

The March to April regression passed all of them. The code compiled. The tests passed. The PRs looked reasonable. The reasoning depth had fallen and the agent was taking shortcuts, but none of that is visible in the output itself. It is visible in the process trace: the sequence of actions the agent took, the reasoning steps it applied or skipped, the depth of analysis before it committed to an approach. Little of that surfaces in the diff.

No benchmark was designed to catch it. Benchmarks measure model capability at a point in time, on a defined task set, under conditions that do not reflect production. They do not measure how a specific agent instance, running in a specific codebase, through a specific harness configuration, performed during a specific session. That is not a benchmark-addressable question. It is an observational one, and it requires the session trace.

Current tooling does not produce that trace by default. Engineering teams have access to outputs. Without the session-level record of what the agent did, why, and at what cognitive effort, drift is invisible until it compounds.

Two-column comparison: what engineering teams have (output, tests passing) vs. what the session trace shows (reasoning degraded, root cause skipped). — Fig. 03 — Same session. Left column: what engineering teams have. Right column: what the trace shows. Only one caught the problem.

This gap is not isolated to one vendor. In February 2026, a trust persistence vulnerability was reported across Claude Code, Codex CLI, and Gemini CLI. All 3 closed the reports as not-a-bug. The Anthropic postmortem was exceptional transparency, not an industry standard.

The trust unit is wrong

The industry evaluates coding agents at the wrong level of abstraction.

Model names are the current unit of trust. "We use Claude Code." "We are running Codex." "Cursor is our daily driver." These statements describe a product category, not a behavior. The model name tells you the vendor and the approximate capability tier. It tells you nothing about how a specific instance of that model, wrapped in a specific harness, in a specific session, has actually been performing on your codebase over the past month.

This matters because instances of the same model behave differently. The harness determines reasoning effort. Context management determines what the agent knows at any given moment in a session. The codebase itself shapes the patterns the agent adopts over repeated sessions. Two engineers on the same team, using the same model, but in different sessions on different parts of the codebase, may be effectively working with different agents. The model name obscures this. A trust profile per instance would surface it.

Consider what the right analogy looks like. For code, the authoritative record is the commit history. For payments, it is the transaction log. An agent's behaviour during a session is the same kind of thing: an irreversible sequence of decisions, each with causation and consequence. The difference is that there is no standard mechanism today for capturing that sequence, evaluating it independently, and building a per-instance record from it over time. We have the diff. We are missing the session trace, the verifier verdict, and the profile that accumulates from sessions across weeks.

The harness is the product. That is not a framing. It is the mechanism by which model names become unreliable.

Decision ladder diagram: Output at the bottom, Evidence in the middle, Decision at the top. — Fig. 04 — Output is the weakest signal. Evidence is what the trace records. Decision is what only becomes possible when you have both.

The output is evidence of what was produced. Evidence in the stronger sense is what was actually done to produce it: the actions taken, the reasoning depth applied, the decisions made or avoided. Routing decisions made without that evidence are made by feel, by brand reputation, or by the most recent postmortem.

What a decision layer does

The engineering response to the trust gap is not to stop using agents. It is to build the infrastructure that makes agent behavior verifiable. A decision layer for coding agents does 4 things.

First, it captures the session trace: the full record of what the agent did, in sequence, including reasoning steps, file operations, commands executed, and decisions made. The trace is the raw material. Without it, none of what follows is possible.

Second, an independent verifier evaluates the trace. Not the output, the process. The verifier is a separate system that reads the session record and scores it against the question of whether the process was sound, rather than whether the output compiled. The distinction matters because outputs can look correct while processes are degraded.

Third, the session is scored across 5 dimensions: reasoning, compliance, efficiency, collaboration, and initiative. These dimensions capture different failure modes. Reasoning measures whether the agent's approach to the problem was appropriate to its complexity. Compliance measures whether the agent followed the constraints it was given. Efficiency measures resource use relative to outcome. Collaboration measures how the agent handled ambiguity and human context. Initiative measures whether the agent made sound decisions when the specification was incomplete. The 5 together produce a session-level trust signal that no single benchmark dimension captures.

Side-by-side session comparison scored across five dimensions — same model, same codebase, six weeks apart. Reasoning drops from 84% to 22%. — Fig. 05 — Same model. Same codebase. Six weeks apart. Reasoning is the tell. The other four dimensions held. Only the session trace shows which one earned trust.

Fourth, scores accumulate into a per-instance trust profile over time. The profile answers the question that model names cannot: which instance, on what kind of work, over what period, has actually earned trust? An instance with 14 high-reasoning sessions on your monorepo, consistent compliance, and no efficiency drift over 8 weeks is a different entity from an instance that passed the same benchmark last month.

The routing decision is the endpoint: given this class of work, which instance do we assign it to? Not "which model tops the leaderboard" but "which of the instances we have actually run, on work similar to this, has the strongest evidence of reliable performance?" That decision is only possible when you have the profile. Without the profile, you are trusting names.

When routing is by feel, spend follows the vibe. And the vibe has been wrong for six weeks before anyone says so. Every agent run costs compute and engineering attention. When routing is by evidence, those costs are not distributed evenly regardless of fit. Spend concentrates on instances with demonstrated capability for the task class. The cost per verified outcome drops as the profile compounds. That is the argument that survives a post-incident review, or a budget conversation about what the agent spend actually produced: not "the model ranked highly," but "this instance earned it, and here is the record."

The close

The Anthropic postmortem is a model for what the industry needs more of: transparency about what changed, and why output diverged from expectation. But most engineering teams will not receive a postmortem. They will receive a bad merge and a vague impression that something had been off for a while.

The question for every engineering team running coding agents is concrete: for each agent instance in your stack, do you have the session-level data to know whether it is operating at the reasoning depth you are trusting it to operate at? Do you have a trust profile, not a benchmark result, but a profile built from your own sessions, on your own codebase, over the past month? If a harness regression started quietly in your stack last quarter, would you have caught it before it compounded?

If the answer is no, the work is not to find a better model name. The work is to build the evidence layer that makes your agents' behavior visible. You trusted the name. You should have trusted the trace.

You shipped this. Could you have built it?

Profile. Measure. Trust.

Profile the agent. Measure the work. Trust the verdict. The trust gap is not theoretical: you are already deciding which agent to trust on which work. You are doing it by feel. The trust gap is already inside your agent stack. The question is whether you can see it.

Start with Worldline: Two agents, same task. The verifier scores each one across five dimensions: reasoning, compliance, efficiency, collaboration, initiative. You finish with the first session-level trust profile your team has ever held, in a few minutes of real work, on your codebase.

The market has already begun sorting engineering teams into 2 groups. The ones who can name, per agent, on what work, with what evidence, which instance earned the right to ship. And the ones who cannot. Worldline is where the first group keeps the record. macOS beta, Apple Silicon and Intel. worldline.chaoscha.in.