Harness Engineering

Posted on: Thu 19 February 2026

Previously, I had picked up on Mitchell Hashimoto using the phrase “harness engineering”. That term might be catching on.

Quoting OpenAI’s recent post, entitled “Harness engineering: leveraging Codex in an agent-first world”:

Over the past five months, our team has been running an experiment: building and shipping an internal beta of a software product with 0 lines of manually-written code.

The product has internal daily users and external alpha testers. It ships, deploys, breaks, and gets fixed. What’s different is that every line of code—application logic, tests, CI configuration, documentation, observability, and internal tooling—has been written by Codex. We estimate that we built this in about 1/10th the time it would have taken to write the code by hand.

Birgitta Böckeler, a Distinguished Engineer at Thoughtworks, follows up further on the impacts of harness engineering.

It was very interesting to read OpenAI’s recent write-up on “Harness engineering” which describes how a team used “no manually typed code at all” as a forcing function to build a harness for maintaining a large application with AI agents. After 5 months, they’ve built a real product that’s now over 1 million lines of code.

The article is titled “Harness engineering: leveraging Codex in an agent-first world”, but only mentions “harness” once in the text. Maybe the term was an afterthought inspired by Mitchell Hashimoto’s recent blog post. Either way, I like “harness” as a word to describe the tooling and practices we can use to keep AI agents in check.

The OpenAI team’s harness components mix deterministic and LLM-based approaches across 3 categories (grouping based on my interpretation):

Context engineering: Continuously enhanced knowledge base in the codebase, plus agent access to dynamic context like observability data and browser navigation

Architectural constraints: Monitored not only by the LLM-based agents, but also deterministic custom linters and structural tests

“Garbage collection”: Agents that run periodically to find inconsistencies in documentation or violations of architectural constraints, fighting entropy and decay

Working from an assumption that OpenAI is engaged in good-faith reporting on the final app, which is currently an internal beta, Böckeler draws some implications for production software development. Will harnesses become the new service templates? Does the AI need more constraints on its autonomy, not fewer guardrails, to deliver robust software? Is it even possible to apply this approach to pre-AI codebases?

She closes out with:

And finally, for once, I like a term in this space. Though it’s only 2 weeks old — I can probably hold my metaphorical breath until somebody calls their one-prompt, LLM-based code review agent a harness…

Personally, I had been thinking of “the harness” from the perspective of an individual developer. What’s the GUI-based IDE, or TUI that they use on a daily basis to interact with a codebase and drive agentic coding? What are its affordances, and what constraints does it place on a model’s code generation?

Zooming out a bit to the project or organizational level makes a lot of sense, though. As described by OpenAI, that’s not a single system or tool but a set of processes they discovered on the fly for this effort. With repetition, patterns will emerge, become codified, and eventually reified into new software tools. Then you could have a nice intersection between the project’s harness and a developer’s harness. Sort of like how git is a narrow waist that joins CI/CD approaches and developer environments, based upon a particular perspective on source code version control.

Hold on to your knickers! Prepare for turbulence.