site banner

Small-Scale Question Sunday for March 22, 2026

Do you have a dumb question that you're kind of embarrassed to ask in the main thread? Is there something you're just not sure about?

This is your opportunity to ask questions. No question too simple or too silly.

Culture war topics are accepted, and proposals for a better intro post are appreciated.

4
Jump in the discussion.

No email address required.

I'm sorry, you're using GLM-4.6 on a 12Gb VRAM card?

Are you swapping weights in and out from SSD? I tried it once and it took about 5 min for the first word.

No, I'm using FlareRebellion/WeirdCompound-v1.6-24b. @gattsuru is the person who mentioned (in a comment that I linked above) that he is using GLM.

I'm sorry, lacked reading comprehension. I thought this was your leaderboard - it gives GLM as 'local' which seems a bit optimistic.

I've run GLM-4.5 and GLM-4.6 on a desktop computer that set me back around 1.5k USD, including everything down to the power supply and case, just using an nVidia 3090 24GB and 224GB RAM. You get a significant performance penalty going with GPU+RAM compared to a pure GPU run, but it's nowhere near as bad as pure-CPU or CPU+SSD speeds. Not fast enough for synchronous work like a Codex replacement, by a long shot; for something you can set-and-forget a series of prompts to run overnight, it's fine and can churn out 10k-ish words at around 150w power consumption. For GLM 4.6 specifically you do end up needing one of the more heavily-trimmed quants, but you can run it down to 32GB VRAM + 128 GB RAM without cutting too hard on context.

((I will caveat that the heavily-trimmed quants give weird failure modes. I'd naively expected quantization to result in typos, logic problems, or looping, and sometimes that happens, but you also get bizarre focuses on certain names, places, or plot points not present in more-precise variants.))

That said, it cost me 1.5k USD at September 2025 prices, and even then I was making compromises on RAM to keep to budget (hence the bizarre RAM total). Wouldn't recommend it at current prices, since a rough estimate hits around 3.5k-3.8k. Putting more emphasis on VRAM might make more sense... which is a bizarre thing to say.

There's a lot of cases where a faster, lower-parameter model is a better choice, even with this setup. For synchronous work, smaller or MoE-focused models are night-and-day in terms of being able to just throw tokens at a problem. Even for async work, sometimes GLM-4.5-Air's (110B to GLM-4.6's 357B) going to save enough time and energy that it's close enough, and something like Cydonia (24B) can handle longer contexts surprisingly well if you prompt carefully. Hell, I've got a few models I've requantized down so I can run shorter prompts at the higher fidelity with all layers on GPU, and then drop down to a 'dumber' variant for long-context operations that would exceed VRAM.

It appears that GLM can be run productively (at 4-bit quantization) on a computer that contains two 96-GiB GPUs. That's very expensive but far from impossible.

You aren’t going to be doing this under your desk. If you’re renting the compute from someone, it’s a cloud service for all intents and purposes.

Personally I would love for a GPU revolution from AMD to make this stuff possible for consumers. Anything below 300b is surprisingly impressive but IMO just not good enough if you care about consistency and detail over 30,000 tokens. Has any interesting new model come out?

You aren't going to be doing this under your desk.

It appears that people indeed are doing so.

Has any interesting new model come out?

I have no idea. I'm just a dabbler.