site banner

Small-Scale Question Sunday for March 22, 2026

Do you have a dumb question that you're kind of embarrassed to ask in the main thread? Is there something you're just not sure about?

This is your opportunity to ask questions. No question too simple or too silly.

Culture war topics are accepted, and proposals for a better intro post are appreciated.

3
Jump in the discussion.

No email address required.

I'm probably one of the more AI-sceptical people on this board. I don't think the God Machine is going to techno-rapture us to cyber-heaven anytime soon, but I do try to keep an open mind around the idea that it might have domain specific value, particularly in coding tasks.

I've noted here in the past that I haven't seen much value, even when using frontier models. The responses that I get are:

  • I'm not using a frontier model, despite the fact that I mention that I'm using what is nominally a frontier model.
  • I'm using the wrong frontier model.
  • I'm not using the right harness and tooling.
  • Even if I am using the right harness and tooling, my lack of success is a personal failure on my part, because I'm clearly just prompting it wrong.

I had a chance at work to try using Codex with GPT-5.4. This is allegedly a top tier stack, and so far as I can tell, as close to the frontier as you can get.

I targeted a fairly straightforward performance issue in our codebase, where some JPA code was generating an inefficient query when two tables each grew two orders of magnitude larger than we usually see them grow. This is the kind of thing that would normally take me 30 - 40 minutes to write a few automated end to end tests, then ten minutes to fix.

Since I clearly have a problem with Prompting It Wrong, I spent almost two hours working with the planner describing the problem, and the root cause, and where the failing method was used. I described what might be at risk of breaking, and what tests we would need to write to prove out the fix. I described the architecture of the automated test system, and what the tests would need to verify.

After doing all this, I let Codex churn.

It generated tests that verified the wrong thing.

Then it did the fix wrong, in the name of "efficiency".

Then, rather than fixing the issue correctly, it tried to rewrite all the sites that called the method.

After losing most of a day to this, I fixed it my own damned self. I'm starting to think Djikstra was on to something.

My job is mostly just routine ports of legacy code into a new framework, with a few minor architectural decisions thrown in here and there. It took me some time to set up sufficiently exact instructions, but by now GPT-5.4 can do it pretty well. The end result isn't any worse or better than if I had done it myself (which makes sense, since I'm telling the copilot to utilize my own methodology), and it's just moderately faster, but notably it can do it while I'm off doing something else, with a handful of corrections on the way.

I don't like it, but it looks like it can do my job, at least.

Wow love that Djikstra essay!