Do you have a dumb question that you're kind of embarrassed to ask in the main thread? Is there something you're just not sure about?
This is your opportunity to ask questions. No question too simple or too silly.
Culture war topics are accepted, and proposals for a better intro post are appreciated.

Jump in the discussion.
No email address required.
Notes -
How did AI impressed you this week?
I have been playing STS2 this week a lot. When I ask AI to create me advanced savescum script and just pointed it to the save directory, I was not impressed when it succeeded on the first try. When I asked it to check if something is editable in save (that was the literal prompt) and made it create a script to edit a save with more gold and max and current xp, I was not impressed. When it had to figure out the Courrier artifact and add it to a new game - I was also not impressed. When the AI started showing awareness of the game itself talking about maps and exit nodes, when it tried to help me because I was stuck - (if there are no relics to be had, the chests didn't generate a circlet, but get stuck) - and figured out on itself that is should ask me about which path to take after the check. Well I was impressed.
Also Codex cli as personality is a lot less annoying than its ChatGPT version.
Several people on this website have already sung the praises of cloud LLMs and large local LLMs (Grok jailbroken (1 2), GLM derestricted (1 2)). IMO, it also is worth pointing out that, if you neither are willing to jump through hoops for cloud providers nor have a multi-kilodollar local GPU setup, even small local LLMs can be surprisingly good at writing. Here is an example (prompt+output, then three separate prompt+output branches with the first prompt+output still in context) generated with my cute little 12-GiB GPU.
The specific model that I used for this example is FlareRebellion/WeirdCompound-v1.6-24b (1 2). According to one leaderboard:
Even the apparently minor difference between CriminalComputingConfig's writing score of 35 and WeirdCompound's writing score of 44 is noticeable.
More example prompts (without outputs)
I'm sorry, you're using GLM-4.6 on a 12Gb VRAM card?
Are you swapping weights in and out from SSD? I tried it once and it took about 5 min for the first word.
No, I'm using FlareRebellion/WeirdCompound-v1.6-24b. @gattsuru is the person who mentioned (in a comment that I linked above) that he is using GLM.
I'm sorry, lacked reading comprehension. I thought this was your leaderboard - it gives GLM as 'local' which seems a bit optimistic.
I've run GLM-4.5 and GLM-4.6 on a desktop computer that set me back around 1.5k USD, including everything down to the power supply and case, just using an nVidia 3090 24GB and 224GB RAM. You get a significant performance penalty going with GPU+RAM compared to a pure GPU run, but it's nowhere near as bad as pure-CPU or CPU+SSD speeds. Not fast enough for synchronous work like a Codex replacement, by a long shot; for something you can set-and-forget a series of prompts to run overnight, it's fine and can churn out 10k-ish words at around 150w power consumption. For GLM 4.6 specifically you do end up needing one of the more heavily-trimmed quants, but you can run it down to 32GB VRAM + 128 GB RAM without cutting too hard on context.
((I will caveat that the heavily-trimmed quants give weird failure modes. I'd naively expected quantization to result in typos, logic problems, or looping, and sometimes that happens, but you also get bizarre focuses on certain names, places, or plot points not present in more-precise variants.))
That said, it cost me 1.5k USD at September 2025 prices, and even then I was making compromises on RAM to keep to budget (hence the bizarre RAM total). Wouldn't recommend it at current prices, since a rough estimate hits around 3.5k-3.8k. Putting more emphasis on VRAM might make more sense... which is a bizarre thing to say.
There's a lot of cases where a faster, lower-parameter model is a better choice, even with this setup. For synchronous work, smaller or MoE-focused models are night-and-day in terms of being able to just throw tokens at a problem. Even for async work, sometimes GLM-4.5-Air's (110B to GLM-4.6's 357B) going to save enough time and energy that it's close enough, and something like Cydonia (24B) can handle longer contexts surprisingly well if you prompt carefully. Hell, I've got a few models I've requantized down so I can run shorter prompts at the higher fidelity with all layers on GPU, and then drop down to a 'dumber' variant for long-context operations that would exceed VRAM.
More options
Context Copy link
It appears that GLM can be run productively (at 4-bit quantization) on a computer that contains two 96-GiB GPUs. That's very expensive but far from impossible.
You aren’t going to be doing this under your desk. If you’re renting the compute from someone, it’s a cloud service for all intents and purposes.
Personally I would love for a GPU revolution from AMD to make this stuff possible for consumers. Anything below 300b is surprisingly impressive but IMO just not good enough if you care about consistency and detail over 30,000 tokens. Has any interesting new model come out?
It appears that people indeed are doing so.
I have no idea. I'm just a dabbler.
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link