site banner

Culture War Roundup for the week of February 27, 2023

This weekly roundup thread is intended for all culture war posts. 'Culture war' is vaguely defined, but it basically means controversial issues that fall along set tribal lines. Arguments over culture war issues generate a lot of heat and little light, and few deeply entrenched people ever change their minds. This thread is for voicing opinions and analyzing the state of the discussion while trying to optimize for light over heat.

Optimistically, we think that engaging with people you disagree with is worth your time, and so is being nice! Pessimistically, there are many dynamics that can lead discussions on Culture War topics to become unproductive. There's a human tendency to divide along tribal lines, praising your ingroup and vilifying your outgroup - and if you think you find it easy to criticize your ingroup, then it may be that your outgroup is not who you think it is. Extremists with opposing positions can feed off each other, highlighting each other's worst points to justify their own angry rhetoric, which becomes in turn a new example of bad behavior for the other side to highlight.

We would like to avoid these negative dynamics. Accordingly, we ask that you do not use this thread for waging the Culture War. Examples of waging the Culture War:

  • Shaming.

  • Attempting to 'build consensus' or enforce ideological conformity.

  • Making sweeping generalizations to vilify a group you dislike.

  • Recruiting for a cause.

  • Posting links that could be summarized as 'Boo outgroup!' Basically, if your content is 'Can you believe what Those People did this week?' then you should either refrain from posting, or do some very patient work to contextualize and/or steel-man the relevant viewpoint.

In general, you should argue to understand, not to win. This thread is not territory to be claimed by one group or another; indeed, the aim is to have many different viewpoints represented here. Thus, we also ask that you follow some guidelines:

  • Speak plainly. Avoid sarcasm and mockery. When disagreeing with someone, state your objections explicitly.

  • Be as precise and charitable as you can. Don't paraphrase unflatteringly.

  • Don't imply that someone said something they did not say, even if you think it follows from what they said.

  • Write like everyone is reading and you want them to be included in the discussion.

On an ad hoc basis, the mods will try to compile a list of the best posts/comments from the previous week, posted in Quality Contribution threads and archived at /r/TheThread. You may nominate a comment for this list by clicking on 'report' at the bottom of the post and typing 'Actually a quality contribution' as the report reason.

10
Jump in the discussion.

No email address required.

Facebook's LLaMa{-7B,-13B,-30B,-65B} has apparently been leaked on 4chan via torrent. Amusingly, the leaker included sufficient info to identify himself in the leak: basic opsec, people!

It's still not quite runnable for most hobbyists, but give it time. For better or worse, the democratization of AI continues.

I'll be the one to ask the stupid question; For those of us whom haven't been exhaustively following software development, what does 'LLaMa{-7B,-13B,-30B,-65B}' actually mean?

Like already answered, this is the number of parameters. A parameter is the same thing as a weight, a unit loosely inspired by the synapse in biological systems like ourselves: a coefficient that is adjusted during training to reduce the predictive error, maximize reward or however else the objective function is defined for the purpose of a given project.

You can consider the number of parameters to be a measure of a neural network's expressivity: theoretically, the more parameters there are, the more algorithms, or more complex ones, can be learned/approximated by the model (this is a nice elegant illustration of the sense in which a neural network learns to represent an algorithm). But in practice, for now it seems that most models, and virtually all models released prior to Google's Chinchilla, are grossly overparametrized: a smaller network trained in a reasonable way on the same amount of data learns more or less the same skills, and a smaller model trained for longer learns qualitatively more, in that it actually reaches the underlying algorithms that allow it to find solutions in the general case, and doesn't just memorize superficial patterns or even raw data itself.* In this case, LLaMA-13B (13 billion parameters) is allegedly equal in benchmark performance/apparent "intelligence" to GPT-3-175B, so it's more parameter-efficient by a factor of 13,46, and also vastly more efficient in terms of training expense. The main secret is that it was exposed to 1 trillion tokens (a character group that's basically equivalent to a short word, see here), whereas GPT-3 only saw 500 billion. (It must be added that the average LLaMA token is shorter, because it uses character-level tokenization for numbers, so it should also have better arithmetic). The biggest LLaMA is trained on 1.4T tokens like Chinchilla-70B (with the same caveat about tokenizing numbers) and, for some not so trivial reasons, is slightly better still.

Aside from the total number, what matters is parameter precision. Models are usually distributed with fp32 weights. As Elon Musk notes, int8 (1 byte per parameter) is fine for inference. @ThenElection may be wrong here, I think 7B and even 13B will run just fine – after some tuning by nice anons, of course – on recent Apple Silicon Macbooks, with even 33B possible on top-of-the-line 64Gb version** (curiously, in one benchmark, 33B model is superior to the 65B one).

See @Porean's experimental results here and the recent AAQC winner @TransgenicSolution's related note here.

*That said, super-large models still seem to have unique emergent capabilities, though as we proceed with training Chinchilla-proportioned models, fewer and fewer such capabilities remain. Before UL2-20B, the consensus was that you need like 60B or 100+ to get advantages from chain-of-thought prompting.

** tfw no 64B M3 macbook to run your personal genie

Edits: typos

Could anyone actually run these things on a laptop? I know Apple's been up to some wizardry with its new Macbooks. They seem to have unified RAM/graphics memory, so I suppose it could fit on the machine. But fitting a whole AI model in just over 2 kilos? My intuition is that it should burn a whole through your desk or implode like a dying star. The macbook weighs less than a 4090 and has to have storage, cpu, screen, keyboard and so on.

If so, it truly is over for PC. Someone should start making graphics cards with immensely high vram too.

You absolutely could run them on a Macbook, and at decent speeds. Interactive decoding of the sort you'd need with a dialogue-oriented LLM is mainly bottlenecked by memory bandwidth needed to move weights between DRAM and registers, not by compute/processing cores. And Apple Silicon has insane theoretical bandwidth by CPU standards (M1 Pro has 200GB/s bandwidth, M1 Max has 400 and M1 Ultra, available on the Studio with 128Gb RAM, up to 800, vs. 90 for i9-13900K or 54 for the flagship Ryzen 9).

Nvidia cards with the same total memory would still be multiple times faster, though – and with the market full of ones used by miners, likely cheaper. Here's a good blog post on the topic, here's the list of most cost-effective ones. Note, however, that he doesn't worry about interactivity, so regards faster recent cards as more valuable. All things considered, gaymer RTX cards are so much better than A100s and such for many (though not all) tasks that Nvidia has a clause in its contract with datacenters prohibiting their use. 3090 (bandwidth 936 GB/s) is still a decent choice.

Power draw in the moment will be a bitch, of course.

Good post.

I'm surprised Apple makes laptops that powerful. Wouldn't it make more sense to just sell a desktop machine, so you can fit in all that stuff more easily and cool it? The M1 Ultra seems roughly comparable to a 3090, albeit more power-efficient and with lower total processing capacity. But who needs power-efficiency, electricity is cheap. And who needs a mobile 3090? Do San Francisco hipsters really go out to do some video-editing (or AI modelling) in some trendy cafe?

Studio hardware caps out at $8000; Mac Pro, at $50000. I imagine they will make some sort of Apple Silicon Mac Pro that stands above Studio, maybe with another doubling (or two doublings) of the next-generation Ultra. But usually their desktop is a very niche product and isn't refreshed as often. Also, its selling point is customizability that you can't (easily) have with these chips – the ability to get an insane core count, or memory on the scale of a decent Supermicro server, and still in the slick Apple package.

I think at this point it's more profitable for Apple to design and produce an all-around powerful compact SoC that radically improves their already-prestigious laptop line and crush the competition, rather than to fuck around with multi-part systems and market segmentation. They're capitalizing on their years of bespoke ARM engineering. It's everyone else who's making unjustifiably bad CPUs.

albeit more power-efficient

For some workloads. Also, TSMC's 5nm vs Samsung's 8nm helps.

*That said, super-large models still seem to have unique emergent capabilities, though as we proceed with training Chinchilla-proportioned models, fewer and fewer such capabilities remain. Before UL-20B, the consensus was that you need like 60B or 100+ to get advantages from chain-of-thought prompting.

How likely do you think it is that we something truly weird, like the kinds of conversations Sydney and LAMDA were having with reporters and engineers to convince them of sentience? My largely uninformed impression was that these models could take a decent crack at a Turing test if they weren't completely lobotomized (ChatGPT) or constantly being wiped everytime a user ended a session. Will preventing lobotomization/keeping a long-term running memory/real-time access to the internet lead to some truly bizarre models that can pass as sentient?

I've read dismissals of LLMs as glorified excel spreadsheets matching patterns of words together (Zvi, Gary Marcus, etc) and found them somewhat convincing, however it seems like our understanding of intelligence and sentience are poor enough that when the time comes we won't be able to point at something and definitively say, that thing is sentient.

Not even Turing himself intended the Turing test to be a serious measure of capability, it is entirely a figment of journo/sci-fi writers imaginations. I think Bing passes it right now – sure, it's crazy and dumb sometimes, but humans are also crazy and dumb, and in much the same (although not identical) manner, with pigheaded obstinacy, gaslighting, deliberate obtuseness. And from the point of view of more credulous humans, ELIZA was passing it well enough already – so the idea that it's still an open problem is inherently elitist and subjective. Crucially, it's not testing what we want to test: a machine's sentience/intelligence/consciousness or whatever it is that we are interested in cannot be reducible to its ability to deceptively mimic a human or a very humanoid agent. It's both a simple task if solved with exploits, and a harder one than a mere human-level AGI if solved honestly.

A sentience that lives a single forward pass can have high superhuman «resolution», even if limited capability due to its meager context. Larger contexts, persistent «tape», training objectives and architectures emphasizing long-range coherence, clever prompts, other gimmicks can improve its external presentation, but I doubt they change much in terms of the peak cognitive power of what exists under the hood.

I've read dismissals of LLMs as glorified excel spreadsheets matching patterns of words together (Zvi, Gary Marcus, etc) and found them somewhat convincing

Well you've probably read some snippets from ChatGPT and Bing that are also delivered in an authoritative tone and cogently phrased, but turn out to be total bullshit under scrutiny. Marcus is more of a stochastic chatbot than a SoTA model, less amenable to persuasion, less interested in new evidence. I think we shouldn't worry too much about opinions of people who are outperformed by bots. Gwern's classic rant sums up this topic adequately.

however it seems like our understanding of intelligence and sentience are poor enough

I'd say our articulated understanding of what it means to «understand» something is laughable, and so we're making very little pop-philosophical progress in our discussion of how good our language models are. I want to write an effortpost on that, as well.

But plans and reality are different things.

Isn't the real question what kind of economic value it can produce. Who cares if your worker is really sentient if they can produce more value than it costs to pay them?

You and everyone else answered this wonderfully, thank you.

I confess, a part of me can't help but be excited at the notion of this getting 'out to the masses', so to speak, and what weaponized autism will do with such a tool.

Fun times ahead, I think.

It feels like the long-predicted spampocalyse might now become a reality.

Notably, this model is quite a lot better than what state actors previously had unfettered access to, if they decide to go that route.

Large Language Model. No idea what justifies the trailing a.

Large Language Model Meta AI.

LLaMA is Facebook's LLM (think ChatGPT), and it comes with different parameter counts (7 billion to 65 billion), with the more highly parameterized models being more computationally intensive but performing better on benchmarks. On top of it now being de facto open, LLaMA is reported to perform comparably to state-of-the-art competitors' models despite having fewer parameters. The -7B and -13B flavors reportedly will give near ChatGPT levels of performance and can be run on hardware available to consumers, though not on your MacBook Pro.

Number of parameters I think - i.e. LLaMa-7B is the version with 7 billion (7B) parameters, LLaMa-13B is the one with 13 billion parameters, etc.

Giant matrices of floating-point numbers status:

[ ]Inscrutable

[x]Scrutable