site banner

Small-Scale Question Sunday for March 22, 2026

Do you have a dumb question that you're kind of embarrassed to ask in the main thread? Is there something you're just not sure about?

This is your opportunity to ask questions. No question too simple or too silly.

Culture war topics are accepted, and proposals for a better intro post are appreciated.

4
Jump in the discussion.

No email address required.

I've run GLM-4.5 and GLM-4.6 on a desktop computer that set me back around 1.5k USD, including everything down to the power supply and case, just using an nVidia 3090 24GB and 224GB RAM. You get a significant performance penalty going with GPU+RAM compared to a pure GPU run, but it's nowhere near as bad as pure-CPU or CPU+SSD speeds. Not fast enough for synchronous work like a Codex replacement, by a long shot; for something you can set-and-forget a series of prompts to run overnight, it's fine and can churn out 10k-ish words at around 150w power consumption. For GLM 4.6 specifically you do end up needing one of the more heavily-trimmed quants, but you can run it down to 32GB VRAM + 128 GB RAM without cutting too hard on context.

((I will caveat that the heavily-trimmed quants give weird failure modes. I'd naively expected quantization to result in typos, logic problems, or looping, and sometimes that happens, but you also get bizarre focuses on certain names, places, or plot points not present in more-precise variants.))

That said, it cost me 1.5k USD at September 2025 prices, and even then I was making compromises on RAM to keep to budget (hence the bizarre RAM total). Wouldn't recommend it at current prices, since a rough estimate hits around 3.5k-3.8k. Putting more emphasis on VRAM might make more sense... which is a bizarre thing to say.

There's a lot of cases where a faster, lower-parameter model is a better choice, even with this setup. For synchronous work, smaller or MoE-focused models are night-and-day in terms of being able to just throw tokens at a problem. Even for async work, sometimes GLM-4.5-Air's (110B to GLM-4.6's 357B) going to save enough time and energy that it's close enough, and something like Cydonia (24B) can handle longer contexts surprisingly well if you prompt carefully. Hell, I've got a few models I've requantized down so I can run shorter prompts at the higher fidelity with all layers on GPU, and then drop down to a 'dumber' variant for long-context operations that would exceed VRAM.