site banner

Small-Scale Question Sunday for March 15, 2026

Do you have a dumb question that you're kind of embarrassed to ask in the main thread? Is there something you're just not sure about?

This is your opportunity to ask questions. No question too simple or too silly.

Culture war topics are accepted, and proposals for a better intro post are appreciated.

3
Jump in the discussion.

No email address required.

So how did LLMs impressed you this week? My case - I had lost original 3mf, but had OrcaSlicer temp folder, which ironically orca can't open. I ask codex-cli to try and reconstruct a proper 3mf. For my astonishment is did on the second try.

Finally got around to test progress on image generation again.

My go-to test is creating an entire fake Instagram influencer from scratch. That nicely tests consistency between images, spacial understanding of the scene, prompt following on minute details, ect. It keeps me up to date on what the problems are when you fake photos of people (or fake entire people in general). I mostly create women, because that's more fun to me and also because this tests model censorship more effectively - the commercial models are a lot more touchy when creating women than when they create men.

The main result of my most recent session is particularly funny: Nano Banana 2 is another significant step forward on photo-realism, but it is exceedingly difficult to get it to produce images of conventionally beautiful people from scratch. Getting just a portrait of a woman that is above a 7 requires a lot of coaxing. If the major focus of the prompt is on some other detail, it will generate the most mid women you've ever seen. Nano Banana 1 was perfectly happy to just spit out 10s. You start the prompt with "photo-realistic full body shot of an attractive female college student..." and you could focus on scene, clothes, body position, camera equipment, ect., and it only needed minor coaxing for some body types and poses (as long as you kept it SFW). But Nano Banana 2 will often simply ignores instructions that coax other models towards conventional beauty. I wonder why. Peak body positivity seems long past. Did earlier models train predominantly on pictures of influencers on social media (because they post so much), and now photos of the rest of humanity have a more proportional ratio in the training data? Or are they trying to stop me, in particular, from creating and monetizing an Instagram e-thot? (I'm not, of course, I've lost interest in image generation, again, very quickly).

Other than that: prompt following is truly impressive now. You can pick scene, clothes, and body positions (either by describing them or supplying reference photos), and it will usually one-shot them down to the correct head tilt angle. Consistency (same person in different images) requires a bit of care, or ideally tons of reference images. We're not completely out of the uncanny valley for faces created completely from scratch, but this is where I notice the most progress (Nano Banana 1 makes beautiful people, but they look like influencers with the filters maxed out in the best case, and like very good paintings in the median case). Around 1% of images still have extra limbs or other easy tells.

Oh, and making images that help explain a technical concept is still hilariously bad. A straight rip-off of an existing image with a liberal dose of detail errors is the best you can expect. Ah, factual correctness in every detail... the old nemesis of AI still lives on.

I recently had sonnet make it through an entire session without mixing up VBIL and VBILX. I'm going to call the improvement.

A lot of the newest hotness has been a little too automated for my tastes, and haven't had much free time, so mostly screwing around with older configs.

Successes :

  • Writing's still surprising me. The prose quality is still lackluster, and there's been very few times where I haven't wanted to revise whole sections, but I've gotten into the mid-5k word and low-10k word ranges with a coherent plot, characterization, and escalating tension.
  • Some of that's smut, with its lower bar (hurr hurr), but some of it isn't.
  • And, perhaps more useful, includes criticism of things I've written conventionally. Sometimes pretty biting criticism!
  • Simple webdev stuff has kinda worked. I'm not a webdev guy, and a lot of my requirements are stupid (oh boy aspnet, I sure do love aspnet!) and use cases simple, but it hasn't really mattered whether Grok, Claude, or Qwen for simple one-off-stuff that's just meant for a short-term use.
  • FRC students have been using it on and off. I try to emphasize the limitations and make sure they understand what the code is doing, and sometimes it's just not capable of handling their goals, but it's been useful as a reference tool in environments where a lot of the info is outdated or outright wrong. Which is weird, given the general code quality of FIRST-specific tools...
  • Been vibe-coding (vide-building?) a homelab rebuild. My current home server setup is very traditional (installing things without wrapping them in four layers of containerization, like an animal), and I'm probably gonna stick with that, but it's been helpful to see how the other half lives, and a lot less frustrating than trying to get the right docker flags and commands from the normal documentation.

Failures :

  • Very long form writing is struggling. Took a shot at phailyoor's trial, but while there's definitely some battles won against the old exponential explosions from context window scaling, most of the 100B+ param models go from 4 t/s at the start to <1 t/s by 5k words in. Which wouldn't necessarily be a critical problem, since I can just run it overnight, except the models also sometimes go wonky -- either looping around the same few paragraphs repeatedly, or adding tangents -- that make the most naive attempts at setting up a 'run-and-forget' run unpalatable.
  • Spacial manipulation is Not Doing Great Bob. I had a problem that was effectively two axis of living-edge hinge, and to be fair that's a weird and uncommon problem, but it's ultimately either calculus or solvable by exhaustion (or Fusion360, which is nearly the same thing), but even the closed models just panicked over it and tried to send me to completely unrelated tools.
  • Similarly, TRELLIS2 and Hunyuan3D are simultaneously impressive and absolutely useless. Sometimes they fail to produce a useful image, and that's mostly understandable (as funny as it is for extrapolated magnets to end up monopoles or video game characters to turn Janus-faced literally), but they can often give nice-looking models... that are absolutely unusable, with complete disconnections, unnecessary duplicated 'layers' of meshes sharing the same texture, random islands of tiny features, so on.
  • Ironically, either my expectations for smut and fiction are higher than for professional writing, or the LLMs are worse at it, specifically. I've beaten the purple prose, em-dashs, not-x-but-y, and weird misplaced detailed from some form-letter grade business writing stuff out of even pretty dumb LLMs. But sometimes you can get an LLM to make surprisingly detailed conclusions that are pretty far outliers (discount code: knot) and then other times it misses really obvious stuff (including an actual 'how make babies'-level problem, and that was in an M/F attempt!).
  • Weirdly bad at picking out names. Whether for characters, for programs, even individual variables. Not necessarily unimaginative, but repetitive (why does GLM love the name Kael?). Dunno what the hell's going on there.
  • Trying to get something like VideoContext-Engine running. Still screwing it up. Not an LLM problem, just haven't had the time to figure out Yet Another Stupid Cuda Fuckery.

Try build123 for cad cam

Not sure if this counts, but I can't get over the fact that Hollywood wordcells, notorious for their poor understanding of science and technology, somehow ended up being right about how computers (will) work:

  • "Zoom! Enhance!"
  • "Computer, in the Holmesian style, create a mystery to confound Data with an opponent who has the ability to defeat him."

I've noticed this too. I've also noticed that the Enterprise's computer says "Acknowledged" instead of "Wow, that command is absolutely chef's kiss and has real Starfleet energy, I'll get on that right away. While we're here, tell me what thoughts you have on Deck Seven?"

I see you haven't watched any of thr new Treks, then.

They didn't, but mostly because I have been too busy to use them for anything I did not expect them to do by default. Whenever a new and exciting model launches, I stress test it extensively, but for at least a year, the models are good enough for my personal and professional needs. Last time I saw a massive improvement in quality that unlocked entirely new use-cases that blew me away was o3, otherwise I tend to feel slightly impressed.

From memory, GPT-2, 3 and 4, then whatever Claude just came out then, then o1 (from seeing others use it), then R1/o3. Native image gen with a variant of 4o. Those standout. Everything else falls under slightly better in ways that don't stick out.

But I am happy enough with them being good for research or editing my writing, or generating images. If they get significantly better in a manner that is glaringly obvious in normal use, I'm close to worrying (much harder) about losing my job.

Agent was successfully able to submit a public record request via an online portal, given sufficient detail and just a little nudging.

The power this potentially unlocks is quite sizeable, actually.

Using github copilot, GPT 5.4 seems pretty solid. Far more capable than 5.2, and more robust than 5.3. The only downside is that it seems a tad slower, but that I accept given how much more it ends up doing.

I've been using it privately as well as at work, and right now my main complaint is that it tends to be a little too eager to write more code, when a little less logic would keep the overall codebase a lot more maintainable. But maybe I'll yet be able to evangelize the LLM until it believes in the gospel of clean code.