site banner

Friday Fun Thread for April 24, 2026

Be advised: this thread is not for serious in-depth discussion of weighty topics (we have a link for that), this thread is not for anything Culture War related. This thread is for Fun. You got jokes? Share 'em. You got silly questions? Ask 'em.

1
Jump in the discussion.

No email address required.

Has anyone here ever played around with training their own language model from scratch? I've been led to believe that it's achievable at only minor expense as long as you don't need your creation to be smart or useful. I've been toying with the idea of at some point conjuring some sort of idiot cartoon character as a pet and then giving it memory and a ton of scaffolding to make it periodically wake up and do things.

The thing is, I don't just want an open source model roleplaying as Homer Simpson (or whatever), I want it thinking nothing but Homer thoughts from the ground up. Honestly some enterprising nerd should start selling them like Pokemon or something. Dopey little models that run locally, use a ton of scripting to act as "alive" as possible, maybe a visual avatar, but no expectation that they'll be able to answer trivia questions or write code.

I don't just want an open source model roleplaying as Homer Simpson (or whatever), I want it thinking nothing but Homer thoughts from the ground up.

The main problem for something like that will training data. You say "at only minor expense", which means you're considering renting GPUs (as opposed to just training on a gaming PC)? The most common tutorials on that scale use many hundreds of millions of tokens of training data (training on TinyStories (500M tokens) and Wikipedia English (4B tokens) is common, books3 (100B tokens) is sometimes used for more ambitious and capable toy models).

And even if you just take a small open source model (or your own toy model trained on TinyStories) and do some post-training fine tuning, you'll still need millions of tokens of training data. I'm not sure Homer had enough lines across the almost 40 seasons to even get this far.

Alas you're right, and it's not like I can afford to spend tens of thousands having ChatGPT Homerize Wikipedia for me. At least this is one of those ideas that will get easier the longer I procrastinate.

And even if you just take a small open source model (or your own toy model trained on TinyStories) and do some post-training fine tuning, you'll still need millions of tokens of training data.

This sounds much more manageable, I'll keep it in mind if I ever start planning to take a real poke at this, thanks.