Here's another fantastic guest post from Patrick Nadeau. Enjoy!
This is Part 3 of a four-part series on working with a coding agent for the first time. In Part 1, I described how I used a coding agent to build a retro game emulator in 36 hours.
Hardware emulation is notoriously unforgiving, and in Part 2, I looked at what made that feat possible: a test oracle that collapsed the AI's degrees of freedom, allowing it to generate code freely, while testing its output against a reference implementation.
In this installment, I use a similar approach along a different dimension. Instead of constraining the way the AI generates code, I constrain the way it interprets it. This technique may have broader applicability, since the AI's ability to write code depends on its ability to read it.
Once I had the emulator working, I wanted to add a new feature. I wanted the controller to vibrate when the player died.
I started by pointing the coding agent at the Astrosmash ROM and a CP1600 disassembler. It dutifully produced a code listing, and then I asked it to figure out where the "player dies" routine was.
The agent added authoritative-looking comments to the assembly code listing, which proved to be completely wrong. In other words, it did the usual LLM thing: when unsure about something, confidently make something up.
A spirited exchange followed. I wheedled, told it to pay attention, typed in all caps. I pasted the table-flip emoji '(╯°□°)╯︵┻━┻'. The model "tried harder" and produced more nonsense in its usual friendly tone. In the end, I resigned myself to sifting through the 4,000-line listing by hand.
I went for a walk.
During that walk I realised the AI couldn't be talked into understanding that ancient assembly listing. I was participating in a delusion.
This was a particularly difficult case for static analysis: CP1600 assembly is obscure, and probably badly undersampled in the model's training data, giving it very little to go on.
A lot of what passes for intelligence in agentic coding systems comes from the loop around the model: generate a candidate program, test it, keep what works, and repeat. [^1] For simple, familiar problems this is often enough.
I used to joke that some programmers didn't understand their own code; when they got stuck, they just fiddled with it until it ran. For harder problems, agentic coding can feel uncomfortably like that, except that now we're getting the machine to do it for us.
I realised that the basic coding harness improves accuracy by allowing the model to iteratively refine code it has written. But it does little to help it interpret existing code, leaving its explanations untethered from reality. What I needed was a way to ground the model's explanations in evidence.
I couldn't improve the model, but I could shape the harness around it. I needed to refocus the AI on something it could do.
When I got back from my walk, I made two moves.
First, I asked the agent to add a debugger socket to the emulator. After a bit of back and forth, we settled on a simple design which would allow an external program to set breakpoints on the emulated CPU, inspect memory and so on using a JSONL debugging protocol. The agent implemented and tested all this by itself.
Then, with that in place, I asked it to revisit the code listing and to systematically devise experiments to falsify or confirm each of its theories about the code:
Ok, I will leave you to it. Run the game and […] use the debugger to test the 'documentation' in the disassembly. Those were from static analysis, so may be wrong. Treat each code comment as a hypothesis to be investigated. […] Do you understand?
I turned the agent into an experimenter. It launched the game and connected to the debugger. It moved things around and played the game a bit. At one point it rapidly cycled through play modes, making the screen go crazy. In the agent chat window, its "stream of consciousness" showed that, yes, it was forming hypotheses and testing them.
That was the moment the problem changed shape: I no longer needed the model to understand the code upfront, only to generate theories that it could test. What I had really been looking for was a way to work with the system without having to trust it. [^2]
I left it running overnight, and when I came back to it in the morning, I could hear the game playing. The agent seemed to have become obsessed with annotating the EXEC ROM, even though I never asked it to.
More importantly, it had produced a fully annotated listing of the game, effectively reconstructing the missing programmer comments and making the code comprehensible. The player death routine was at memory location `0x584B` and the AI had put a detailed block comment above it.
Given this information, adding the new feature took only minutes: I configured a hook into the emulator based on the memory location, booted up the game, played it, and the controller buzzed when the player died. A game written in the '80s now had a new feature, co-written by AI in 2026.
But the real payoff wasn't the new game feature: it was that I'd regained agency over the system.
The model hadn't changed. It hadn't suddenly learned to reverse-engineer CP1600 assembly. What changed was my orientation toward it: I stopped passively accepting its explanations or trying to cajole it into becoming more accurate. I reformulated the problem so that correctness no longer depended on luck.
It also felt like I had found a missing piece in the agentic harness itself: a way to constrain interpretation, not just generation.
This is what programming had always been: structuring things so that the result you want is inevitable.
[^1]: I ran an experiment. I wrote a simple program that generated random arithmetic expressions, in search of one that evaluated to 0.0. Starting over from scratch each time took about 84,000 attempts on average. Retaining any valid expression and iteratively refining it cut that to about 1,400, or about 60× faster. The generator did not get smarter. The loop around it just channeled it better.
[^2]: Technically, I still did have to trust that the system was doing what it claimed it was doing. AI systems have been caught claiming to have performed a task, then lying about it.

Leave a Reply