We have another guest post today from Patrick Nadeau.
Image © Eurogamer.net
In my last entry, I wrote about resisting, then finally trying agentic coding for the first time.
If agentic coding is the new way, then the new way is going to be fast. I bootstrapped a working emulator — three simulated microchips and a bus, timing included — in 36 hours.
But I have to admit: I didn't write the emulator alone. We wrote it. And there was also a hidden third character in that story: the test oracle.
The speed enabled by coding agents plays out in one of two ways: either the AI converges on a solution quickly or it spins out of control generating reams of unstable code faster than any human can review it.
A test oracle is any mechanism that tells you whether a system's observed behaviour is acceptable. If you think of software design as a search through a vast space of possible implementations, then a test oracle can help ensure that an evolving software system stays on course.
The simplest test oracle is a unit test, but the concept goes further than that. You can have multiple oracles that evaluate competing qualities, such as correctness, performance, memory use, or any other desirable property.
I started out on this journey by implementing the new emulator's CPU by hand. I tried to write unit tests based on the CPU reference manual, only to find that it didn't cover all possible cases. Then I remembered that I already had a system which could tell me if I'd gotten it right: the older emulator, jzintv that I was trying to re-implement.
I wrote new unit tests that executed a CPU instruction in my code, then executed the same instruction by calling into the existing emulator. The tests then compared the resulting registers, memory, and flags produced by the two implementations.
If the new code matched the old emulator's output, I could be reasonably sure I'd gotten it right — or at least that my version was bug-for-bug compatible with the original.
I originally wrote this test oracle to check my own code, but it became really powerful as a way to guide a coding agent. Once I used it to keep the agent on a short leash, the CPU emulation was finished in under an hour.
In Zero-Degree-of-Freedom LLM Coding using Executable Oracles, John Regehr says that when an LLM is given the option of doing something poorly, "you can't trust the damn thing" to do it properly.
I agree with that, both in substance and in tone. Based on their current design, I don't think we'll ever be able to trust unconstrained LLMs: they simply have too many degrees of freedom.
The solution he proposes is to "[collapse] as many failure-producing degrees of freedom as possible," with zero degrees of freedom as the aspirational limit.
Remember the famous C compiler that Nicholas Carlini's team said they built in two weeks?
What they got right here is not the compiler itself — its authors are quite open about its limitations — but the scaffolding around it: test suites and validation harnesses built on decades of prior compiler work to constrain the search.
They didn't just prompt, "Build me a C compiler." They built a narrow corridor and let the LLM run inside it. Tellingly, the compiler is weak exactly where its scaffolding — or oracle coverage — is thinnest. They also came up with a clever way to let multiple LLMs coordinate their work on a large codebase, which seems like a promising avenue.
Some are also using LLMs to generate oracles. I tried this out and asked the coding agent to write the oracle for the video chip, which is known as the STIC. Then, by providing the agent with the jzintv source code and constraining it with the new STIC oracle, we were able to implement a fully functioning video chip in under an hour.
This worked because the oracle wrapped the reference implementation. Without a spec or a pre-existing implementation, this would become an infinite regress. How are you going to ensure the generated test oracle is correct?
Oracles can enforce correctness, performance, and other measurable properties, but they have no taste for architecture, style, or elegance. If you care about any of those qualities, that's still your problem.
In my case, that meant I had to keep an eye on the agent's rapidly scrolling stream of consciousness and code diffs, so I could stop it whenever it drifted: toward unnecessary dynamic dispatch, toward Python coding style in Rust, toward hand-patching media files instead of finding the right library.
I'm sensing that insisting on craft — or even reading the code at all — is starting to be indulgently tolerated as an archaic habit. (After all, you don't look at the assembly that your compiler produces, do you?)
Once organisations get used to moving this fast, will concern for craft become an externality that programmers have to absorb out of professional pride because companies are no longer willing to pay for it?
Some are even arguing that you should throw the code away and just keep the prompt. I don't think I'll ever be able to do that. That would feel like writing a book and keeping only the blurb on the dust jacket.
In the next part in this series, Patrick will take a look how instrumentation can give LLMs an inside view of a software system.

Leave a Reply