Baby’s First Harness Engineering

So I told an LLM to “Make me an RTS game” it made a real piece of shit, but I got completely sucked into improving it. And I’ve been using Codex very heavily for this:

2.1 billion token usage screenshot

Three weeks in, I’m trying to make my art look better for a key new unit, but the LLM is very stubborn about making everything look weird and bad. Each iteration I ask for changes, it would fix one thing and break something else, and the design process ends when I give up and decide it’s good enough.

I guessed that the problem was the LLM chose to use this library called PixiJS for drawing everything, and I figure there might not be a lot of training data for how to draw stuff with this library, so the output was bad.

But I am not going to edit the visuals by hand. I refuse to do manual labour.

My theory was that LLMs have much more training data for SVGs (an extremely common graphics format that’s text-based) than for PixiJS drawing code. So if we switched to SVG, design performance should improve, and I’ll spend less time saying “nono, undo undo”.

But I didn’t want to start from scratch with SVG, I had already put real effort into some of the existing units. Objectively they’re terrible but it was a lot of effort to get an LLM to make anything half decent, and so I’ve grown attached to them:

RTS unit sprites

The units before the SVG conversion

I first asked the LLM to convert the worker (the cute little pentagon at the top) into SVG. This is what I got:

Old Pixi.js vs LLM SVG

Fuck! If it can’t translate a pentagon and a line into SVG, what hope do I have to translate my sick tank model where the turret is animated and recoils when you shoot?

So, what to do? I needed to make this task legible to agents, even though LLMs are really bad at visual work.

My first idea was to render the existing PixiJS unit on a transparent background, render the SVG version, and compare them pixel by pixel. The inspiration came from this old blog post. And then calculate the diff both visually and mathematically, and make the LLM work until there was no difference:

Pixel diff comparison

Then I realized that each unit is made out of multiple visual components: for the worker unit above it’s made out of a pentagon and a stick.

So to make this even easier for the LLM, I could break the existing PixiJS drawing into its individual components, and then have the LLM produce SVG components that were pixel matches:

Body and Facing Tick Component Triads

We break down the worker into the pentagon body and the stick of the worker

Once the components all match, we layer everything together and make sure the final image matches.

So I described this method to the LLM, had it create a multi-phase plan¹ to implement this harness, sent it to work, and it Just Worked.

This is, I think, what is meant by harness engineering. LLMs currently don’t have the creativity to suggest techniques such as this, and careful engineering can arrange situations where the nondeterministic stream of tokens can be leveraged to do useful work quickly and at scale. LLMs are like a very powerful but dumb horse, so for complex tasks you have to build a harness that lets it pull mindlessly in the right direction.

Before and after SVG migration

Was it worth it? Was the LLM actually better at producing SVGs than PixiJS drawings?

I’ve designed one unit so far using the SVG method and it felt a lot better. It’s hard to say for sure if it’s because of SVG, because I’ve also gotten a lot better at prompting for this stuff (another topic, the importance of evals), but I was able to generate a somewhat complicated unit design relatively quickly, so check out my new wizard character:

Wizard character concept

See Phase 5.1 to 5.4 here. I don’t read these plans, but you could get an LLM to explain them for you I guess. ↩