v1.20Grok Build CLI support →

Harness engineering

February 5, 2026 · updated May 29, 2026

The work you do so an agent never makes the same mistake twice, and the system that lets you do it once across every agent and every project.

Where the word comes from

The term harness engineering was coined by Mitchell Hashimoto in his February 5, 2026 essay, My AI Adoption Journey. His definition is the one we use:

Anytime you find an agent makes a mistake, you take the time to engineer a solution such that the agent never makes that mistake again.

The shape of the work is small and recursive. An agent guesses a file path that does not exist; you add a line to AGENTS.md telling it where the file actually lives. An agent writes a React component without checking the design tokens; you add a pre-commit hook that fails the run if the diff imports an unapproved hex. An agent ships a screenshot of the wrong page; you add a tool the agent can call to dump the rendered DOM. Each mistake becomes a constraint. The constraints accumulate. The harness is what is left.

Agent = model + harness

LangChain frames the thing this way: Agent = Model + Harness. The model is the part you pull off a registry. claude, codex, gemini. The harness is everything around it: the instructions it reads at boot, the tools it can call, the feedback it receives from the world, the gates it has to pass before its work counts as done.

A frontier model with a poor harness will repeat the same mistake in three projects. A mid-tier model with a strong harness will catch its own errors before you read them. The variance in agent output across teams in 2026 is dominated by harness quality, not by which model anyone picked.

Birgitta Böckeler, writing for Martin Fowler, extends the equation with a third element she calls a “garbage collector” agent — a smaller agent whose sole job is to prune stale context, archive sessions, and keep the working set inside the model's window. Same idea: the harness is everything you build so the model can do its job.

The two forms of harness

Hashimoto names two forms. The first is better implicit prompting — the rules file the agent reads before it touches a line of code. In this codebase that file is AGENTS.md, with CLAUDE.md and GEMINI.md as symlinks so every agent reads the same source of truth. A representative line:

No mocking. Real services only. Tests must exercise the actual critical path.

The second form is actual programmed tools — scripts the agent can call to learn something it could not otherwise see. A screenshot tool. A filtered test runner. A log tail. The agents browser binary is one such tool: the agent runs agents browser refsand gets back a numbered list of clickable elements on the page instead of guessing CSS selectors from a screenshot.

The two forms compose. The rules file tells the agent which tool to reach for; the tool gives the agent the kind of feedback the rules file presupposes.

The seven surfaces of an agents-cli harness

A harness has more than two surfaces in practice. Here are the seven that agents-cli exposes as first-class objects. Each one has a command. None is hidden behind a config screen.

1. Instructions

The rules file the agent reads first. Canonical name is AGENTS.md; CLAUDE.md and GEMINI.md are symlinks to it, so the same constraints land in every agent regardless of which CLI you launched.

ln -s AGENTS.md CLAUDE.md
ln -s AGENTS.md GEMINI.md

2. Tools

The binaries the agent can invoke from inside a run. The CLI dispatches the agent itself, and the agent in turn calls agents browser, mq, project scripts, and whatever else lives on PATH.

agents run claude "verify the deploy by curling the health endpoint"

3. Feedback

The ways the world talks back to the agent. A screenshot of the rendered UI. The console errors from a Playwright session. A filtered test report. Every feedback surface is a tool the agent can call, and every one returns structured output the model can read.

agents browser screenshot
agents browser console

4. Gates

The checks that fail a run before its output reaches you. Pre-commit hooks, type checks, linters, CI. A gate's job is to be the agent's second reader. If the gate is missing, you are the second reader, and you will be the one to find the regression at 11 p.m.

# .git/hooks/pre-commit
bun run typecheck && bun run lint && bun test

5. Sessions

Every run is captured to a session record. You can list runs, search across them, resume one mid-thought, or export a transcript as a gist when you open a PR. Sessions are how a harness keeps state between runs without leaking state into the codebase.

agents sessions --last 20
agents sessions <id> --resume

6. Secrets

Named bundles of credentials backed by macOS Keychain. Touch-ID gated. The agent never sees plaintext; the secrets are injected into the run's environment for the duration of the command and torn down at exit.

agents secrets exec prod -- bun run scripts/migrate.ts

7. Multi-agent

When the work has more than two independent surfaces, you stop running one agent and start running a team. agents teams enforces boundary contracts: each teammate owns a set of files, must not touch the others, and reports back when its slice is done.

agents teams create refactor-auth
agents teams add refactor-auth claude "Owns: src/auth/*. Not: src/ui/*."
agents teams add refactor-auth codex  "Owns: src/ui/login.tsx. Not: src/auth/*." --after claude
agents teams start refactor-auth --watch

Why meta-harness

Most writing on harness engineering treats the harness as something you build for one project and one agent. You write a CLAUDE.md for the codebase you happen to be in. You wire a screenshot script into the IDE you happen to use. You memorize the prompt that finally got the model to stop importing the wrong logger. None of it travels.

agents-cli is the layer above that. The rules file lives at the repo root and is read by every agent the CLI dispatches. The browser binary is one binary, not one per agent. Secrets live in Keychain and are attached to runs by name, not pasted into each agent's config. Sessions are searchable across agents and across projects from a single command. Version pins are declared once in agents.yaml and respected everywhere.

That is the meaning of meta in the tagline. You author a harness; the CLI runs it across agents, projects, and machines. The same AGENTS.md reaches claude, codex, and gemini without modification. The same secrets bundle decrypts on your laptop and on the build box. The same team contract parallelizes a refactor on Monday and a ship-readiness review on Friday.

The honest framing: a harness per project is fine, and most engineers will end up writing one whether they call it that or not. agents-cli is what you reach for once you have written the third one and noticed you keep solving the same problems.

What this article is not

This is not a ten-minute setup. A harness is not something you finish; it is a thing you accrete over the months you spend running agents against real work. The first week, you will have an AGENTS.md with six lines in it and one screenshot tool. The third month, you will have a hundred lines of rules, a half-dozen tools, a gate that catches the failure mode that bit you twice in March, and a session archive you can search.

This is not a story about AI for everyone. Harness engineering presumes you can read a stack trace, write a shell script, and stand up a pre-commit hook. If the agent is the only programmer in the loop, there is nobody to engineer the harness.

This is not a claim that the harness makes the model unnecessary. A stronger model still helps. The point is that the gap between “agent that frustrates you” and “agent you trust with a Friday afternoon” is, in practice, harness work and not model selection.

Further reading