site banner

Friday Fun Thread for June 5, 2026

Be advised: this thread is not for serious in-depth discussion of weighty topics (we have a link for that), this thread is not for anything Culture War related. This thread is for Fun. You got jokes? Share 'em. You got silly questions? Ask 'em.

1
Jump in the discussion.

No email address required.

Well, it's happened. The pressures of our workload have driven a push to use AI to help with some of the bitchwork of coding. There is going to be a push to use AI to generate unit tests.

The toolchain involves OpenCode, so I figured I'd install it locally to get familiar with it before I start burning GPU time at work. Also, for reasons, we aren't allowed to use the GPUs. So last night I installed Ollama, OpenCode and the gemma4:e4b model on my humble RTX 4070 Super with 12 GB of VRAM. I tried to have it do the simplest of tasks. Create a Hello World project in dotnet10, and write a single unit test to verify it's output.

The first thing that happened was it created a new project. New projects actually begin with hello world output already. It then added a second hello world output. This poisoned the well, as now the AI was horribly confused about why there were two hello world lines. It never fully recovered. The project was generated without it's int main format which I prefer, so I tried to have it restructure the project to use that. After several missteps because it couldn't get over the fact that there were duplicate hello world statements, it finally figured it out.

Next came the unit tests. It created a unit test project, but then didn't actually populate the tests or link the projects. Then it wanted to refactor hello world, and pull in all sorts of abstraction frameworks, so it could test the output without redirecting stdout. I told it forget all that and redirect stdout. It had already done half the refactoring in a state that could not compile, and then never undid it, and then the whole project was totally broken and it couldn't figure out how to fix it.

I remind you, this was a "Hello world" and a single unit test. I told the AI it fucked everything up, it asked what it could do to fix it, and I told it that it could shutdown. It did. I think.

I know a lot of people reading this are AI evangelist. Where did I go wrong? What the fuck do people see in this shit?

Your work is doing things in the most retarded way possible if they're forcing you to use local CPU inference only. I'm not one of the AI boosters around here but I do use models from the big US labs a fair bit at work. I can see the local models becoming more capable and even quite useful for local tasks, but there's no way I'm getting one to vomit up a project from nothing and expecting miracles. I have had asked for audits of old code and had it found bugs that I missed (as well as a lot of noise).

As far as I can tell, llama.cpp is a lot better than ollama. While it's much less of a turn-key experience, many models never seem to make it across to ollama, and its inference is slower in my experience. It also seems to be slower at picking up newer developments, like multi-token prediction:

By pairing a heavy target model (e.g., Gemma 4 31B) with a lightweight drafter (the MTP model), we can utilize idle compute to “predict” several future tokens at once with the drafter in less time than it takes for the target model to process just one token. The target model then verifies all of these suggested tokens in parallel.

For running local models, there are quite a few things to think about:

  • Which model to use: at the moment, the strongest locally-runnable coding model seems to be Qwen3.6.
  • How big of a model to use: smaller models are stupider. If it doesn't fit in VRAM, inference speed generally tanks too low to be useful. Qwen3.6 comes in two sizes: 27B (27 billion parameters) and 35B-A3B (35 billion parameters, but only 3B parameters are active for each token). The 27B model will be smarter because it activates more parameters per token but 35B-A3B model will be a lot faster. For your GPU, I'd try Qwen3.6-35B-A3B, not least because as a "mixture of experts" model, some of those experts can be kept on CPU.
  • What quantization of that model to use: this is where someone crunches the model down to make it take up less space at the cost of making it dumber. More accurate quantizations will also have slower inference. It seems like Q4_K_XL is usually a "sweet spot" most people go for, and here's someone claiming 80 tokens/sec with Qwen3.6-35B-A3B-Q4_K_XL on a 12GB card: https://old.reddit.com/r/LocalLLaMA/comments/1t82zxv/80_toksec_and_128k_context_on_12gb_vram_with/ . Ignore the stuff about the MTP PR; that's since been merged to master.

Then there's actually getting it running and doing something useful. For that I'll defer to the above guide on how to launch llama-server for this model on a 12GB card. You'll then have to point OpenCode at your local server and see if it goes any better for you. No promises, my sense is that local stuff is on the edge of being "quite decent" and it's worth having a finger in the local model pie so when it does get genuinely good. I don't ever want to be locked into paying for subscription compute.

Thanks for having an actionable and helpful reply. I'll give these things a try.

Don't get me wrong, the people who say these local models are "almost-Claude-tier" are still too starry-eyed. The model I recommended you - Qwen3.6-A3B-Q4_K_XL - told me earlier that both the Linux kernel and Busybox used the autotools. It will confabulate as badly as a previous-gen frontier model but if you drop it in an established project where it can read stuff written by people with a clue, it can often be guided into doing useful things like push through refactors and updates where the compiler and tests can keep it on track.