Three months ago, your team started using coding agents. Things got faster. Then something went wrong.

Maybe the agent modified a file it wasn't supposed to touch. Maybe it completed a task in a way that passed review but broke something three levels down. Maybe you rolled back a change and spent two hours figuring out which agent did what, in what order, and why.

The instinct: upgrade the version. Switch tools. Try a different prompt. Blame the model. That's the wrong diagnosis.

Three-step ladder: Output, Evidence, Decision. Most teams stop at Output. — Fig. 01 — Output is what the agent said. Evidence is what it actually did. Decision is which agent to trust next. Most teams stop at Output.

The model name tells you nothing.

The same model behaves differently depending on the codebase it's working in, the context it's been given, the way it's been set up, and what it's been asked to do before. You can run Claude Code and Codex on the same task and get results that differ not because of model capability but because of setup, context, and session history.

You're not comparing models. You're comparing instances. And you have no system for distinguishing which instance earned trust, on which kind of work, over time.

Same model, different behavior. The model name tells you nothing about which instance, on your codebase, earned the decision it was handed.

The gap is between output and evidence.

Your agent told you what it did. You looked at the output. You shipped it. That's the workflow for almost every team using coding agents today.

Three levels. Most teams reach one.

Output — what the agent said it did. The commit message, the summary, the PR description.

Evidence — what it actually did: every file it touched, every command it ran, every decision in the session trace.

Decision — which agent to run on the next task, the next codebase, the next review. Built from evidence. Repeatable.

Most teams are stuck at output. A few have gotten to evidence. Nobody has a systematic way to get from evidence to decision — to say, with confidence: this instance, on this kind of work, has earned trust.

Two instances of Claude Code on the same task. Different session traces. Instance B scores 43 (red) on compliance. — Fig. 02 — Same model, same task, different session traces. Different scores. The model name does not predict which one earns trust.

The cost of not knowing.

When you route tasks to agents by habit or by feel, you're not running an AI workflow. You're running a guessing game with AI tooling.

The agent you trust because it performed well last week might be the wrong agent for this week's codebase. The one you've been avoiding might score highest on reasoning and compliance for the work you're doing now. You don't know. That's not a model problem. That's a trust problem.

What trust actually requires.

Trust requires evidence, repeated over time, scored against specific criteria. Not a vibe. Not "it worked last time." Not a benchmark from a dataset your codebase doesn't resemble.

Reasoning. Does the agent understand why it's doing what it's doing, or is it pattern-matching its way to a plausible output?

Compliance. Does it stay within the boundaries it's given, or does it take initiative in directions you didn't sanction?

Efficiency. Does it reach the right outcome in the right number of steps, or does it take paths that compound errors downstream?

Collaboration. Does it hand off cleanly, surface its reasoning, and integrate with the rest of the workflow?

Initiative. When it encounters ambiguity, does it make the right call or the safe one?

A score on each of these, across real sessions on your actual work, is the start of a trust profile. Not a benchmark. Not a leaderboard. A record — per agent instance, over time, on the work that actually matters to your team.

The five trust dimensions: Reasoning, Compliance, Efficiency, Collaboration, Initiative. — Fig. 03 — A score on each of these, across real sessions on your actual work, is the start of a trust profile. Not a benchmark. A record.

Run them in parallel. Trust the one that earned it.

Run two instances of the same model on the same task. You will not get the same session trace. Different file paths, different decision points, different error recovery patterns. Both produce output that looks reasonable. Exactly one of them earned trust on this specific work. The other didn't.

Right now, you're guessing which is which. Worldline is built to answer that question with evidence.

It runs your agents in parallel, captures every action in the session trace, puts an independent AI verifier across each run, and scores each session across five dimensions. The result is not a ranking. It's a decision: which instance, on which kind of work, has earned trust. Per session. Over time. On your codebase.

Profile. Measure. Trust.

Profile the agent. Measure the work. Trust the verdict. The trust gap is not theoretical: you are already deciding which agent to trust on which work. You are doing it by feel. The trust gap is inside your agent stack. The question is whether you can see it.

Start the Worldline diagnostic: Two agents, same task. The verifier scores each one across five dimensions — reasoning, compliance, efficiency, collaboration, initiative. You finish with the first session-level trust profile your team has ever held, in two minutes, on your codebase.

The market has already begun sorting engineering teams into two groups. The ones who can name, per agent, on what work, with what evidence, which instance earned the right to ship. And the ones who cannot. Worldline is where the first group keeps the record. macOS beta, Apple Silicon and Intel. studio.chaoscha.in.

You Don't Have an AI Problem. You Have a Trust Problem.