site banner

Culture War Roundup for the week of December 4, 2023

This weekly roundup thread is intended for all culture war posts. 'Culture war' is vaguely defined, but it basically means controversial issues that fall along set tribal lines. Arguments over culture war issues generate a lot of heat and little light, and few deeply entrenched people ever change their minds. This thread is for voicing opinions and analyzing the state of the discussion while trying to optimize for light over heat.

Optimistically, we think that engaging with people you disagree with is worth your time, and so is being nice! Pessimistically, there are many dynamics that can lead discussions on Culture War topics to become unproductive. There's a human tendency to divide along tribal lines, praising your ingroup and vilifying your outgroup - and if you think you find it easy to criticize your ingroup, then it may be that your outgroup is not who you think it is. Extremists with opposing positions can feed off each other, highlighting each other's worst points to justify their own angry rhetoric, which becomes in turn a new example of bad behavior for the other side to highlight.

We would like to avoid these negative dynamics. Accordingly, we ask that you do not use this thread for waging the Culture War. Examples of waging the Culture War:

  • Shaming.

  • Attempting to 'build consensus' or enforce ideological conformity.

  • Making sweeping generalizations to vilify a group you dislike.

  • Recruiting for a cause.

  • Posting links that could be summarized as 'Boo outgroup!' Basically, if your content is 'Can you believe what Those People did this week?' then you should either refrain from posting, or do some very patient work to contextualize and/or steel-man the relevant viewpoint.

In general, you should argue to understand, not to win. This thread is not territory to be claimed by one group or another; indeed, the aim is to have many different viewpoints represented here. Thus, we also ask that you follow some guidelines:

  • Speak plainly. Avoid sarcasm and mockery. When disagreeing with someone, state your objections explicitly.

  • Be as precise and charitable as you can. Don't paraphrase unflatteringly.

  • Don't imply that someone said something they did not say, even if you think it follows from what they said.

  • Write like everyone is reading and you want them to be included in the discussion.

On an ad hoc basis, the mods will try to compile a list of the best posts/comments from the previous week, posted in Quality Contribution threads and archived at /r/TheThread. You may nominate a comment for this list by clicking on 'report' at the bottom of the post and typing 'Actually a quality contribution' as the report reason.

5
Jump in the discussion.

No email address required.

Google Gemini just launched

In other words, GPT-4 has just been beaten, about time I'd say, I'm getting used to the pace of progress in AI being blistering, and it was threatening to slowdown to just mild rash levels.

However, both my hands-on time with it, and the official benchmarks Google released suggest it's a minor, incremental improvement, one that doesn't stand up to the drastic improvement that GPT-4 represented over 3 or 3.5. [For clarity, I, like the rest of you, can only use Gemini Pro, the second best model]

Which is fine, because for a while now, people have been lambasting Google/Deepmind for being too incompetent to ship, or at least ship a competitive product, given how shitty Bard was when it launched, even after being upgraded once or twice.

However, Bard, now running the Gemini Pro model, seems to be roughly as good as paid GPT-4 on ChatGPT, or the free GPT-4 in Bing Copilot (previously Bing Chat). I have yet to spot any new use case it enables, in the sense that GPT-4 can reliably do tasks that simply had 3.5 flailing about in confusion, or worse, hallucinate incorrect answers, such as more involved questions in coding, medicine and everything else really.

However, Google hasn't yet publicly released the best Gemini model, which is currently undergoing an analogous process that GPT-4 or Claude 2 went through, namely more RLHF, red-teaming and safety testing. Pro is the next step down, but it seems pretty good to me, in the sense I would happily use it as an alternative to GPT-4, even if I have no strong opinion on which is better.

There's also a Nano model, which is stripped down to run on mobile devices, and is now being used on the Pixel 8 Pro for a few tasks, potentially silencing the people who claimed it's AI specific computing components were a marketing gimmick, especially since it seemed to offload most AI tasks to the cloud.

Miscellaneous observations:

  1. Bard is fast as fuck compared to GPT-4, in terms of generation speed. It always was, but previously in the "I'm doing 2000 calculations a second in my head, and they're all wrong" sense. (GPT-4, at least before Turbo released, was always pretty slow compared to the competition. Far more unusable, but at the very least I read faster than it can write.)
  2. A quick search suggests all the models have a 32k token context window, or about an operating memory of the last 25k words it read and wrote. Good, if not remotely groundbreaking.
  3. This heavily suggests OAI will ship GPT-5 soon, instead of being content to milk 4 when it ran rings around the competition.
  4. It's multimodal, but then again so was GPT-4 from the start, the capability was just cordoned off for a bit.

To the extent I don't think the next generation (or two) of models after GPT-4 are an existential threat, I'm happy to see them finally arriving. There really isn't much more needed before even the best of us are entirely obsolete, at least for cognitive labor, and something as archaic as GPT-4 was scoring at the 95th percentile in the USMLE, so I'm preparing to explore my competitive advantage in panhandling. *

*This is a joke. For now.

Footnotes to the footnotes:

People on Twitter are correctly pointing out that GPT-4 underwent further post-launch improvements in benchmark scores, some of them pushing it past Gemini's published scores.

Also, just to be clear, the version of Gemini you can use now is not the best one, which may or may not be a modest improvement over GPT-4. Some claim it's more comparable to 3.5, but I haven't used that in ages, not when Bing makes 4 free.*

*Footnote^3 It's probably closer to 3.5. I'm sticking with Bing.

Toe-notes-

So far, it seems that Gemini is "competitive" with GPT-4. It's better at multimodal tasks, but for most people that's a minor fraction of their typical use case. For text, it's somewhere from close to roughly on par.

You can almost feel the desperation in the Deepmind researchers to find any way to massage things so that they come out ahead of GPT-4, from the misleading graphs, an egregious example to be found in a reply, to applying different standards in their inter-model comparisons, such as 5-shot prompting for GPT-4 versus Chain of thought 32 shot prompts for Gemini Ultra. At least the white paper doesn't outright lie, just mislead and prevaricate.

The MMLU is also flawed, with 2-3 percent of the questions simply broken, so a 1 or 2% improvement in score can be a bit questionable, let alone specifying performance to multiple decimal figures.

We don't see any comparisons to GPT-4 Turbo, but I don't hold that against them too hard, it just came out a few weeks back, perhaps not in time for them to finish their paper.

It you use the multimodal capabilities of Bard right now, it uses an older version that is pretty shit compared to GPT-4V or Bing.

Overall, the main benefits of Gemini's existence is largely that it shows Google isn't content to slumber indefinitely, and it can be competitive, better late than never. I expect GPT-5 to spank Gemini Ultra, and to the extent the latter accelerates the release of the former, I'm for it.

Predictions:

GPT-5 before end of 2024 - 90%

GPT-5 is superior to Gemini Ultra for most use cases, at the first point in time both coexist- 80%

A third competitor on par with either exists before 2025- 60%

An OSS equivalent of GPT-4 comes out before 2025- 70%

This result shouldn't be underestimated because Gemini-Ultra is merely on par/slightly better in text-based reasoning: it thoroughly beats GPT-4V on MMMU, the multimodal benchmark, including harder subscales; it also plays well with audio. People are for the most part functionally illiterate, so this is huge; and of course they will capitalize on Android and other ecosystem-wide advantages the Alphabet empire has. Multimodal language-like models will obviously be table stakes in 2024. (Bytedance guy even hints that they'll opensource a model on Gemini's level.)

Interesting that one of people who had worked on aligning early Gemini said they had trouble aligning it – it burned through RLHF reward models, finding exploits and collapsing into gibberish (imagine using actual RLHF in 2023!). Maybe this has delayed the release, as well as the garden variety safetyism it has made more complex.

To be honest I was more excited about the other day's release of Mamba by Albert Gu and the legendary Tri Dao. There are many architectures that I expect will break through the Pareto frontier of a mature Transformer, but this one is the first that feels like an actual Vaswani et al. 2017 level advance. Unlimited context, here we come.

Hmm, it seems like I confused the MMMU and MMLU in my original post, despite knowing the difference. I'll edit accordingly.

The MMMU performance seems far more compelling compared to the latter, especially given Dean's methodology of zero-shotting both models.

As someone who is functionally literate, I certainly care more about text prowess, as I presume would most of the people here. But in terms of mundane value for the rest of the world, that will be handy.

Interesting that one of people who had worked on aligning early Gemini said they had trouble aligning it – it burned through RLHF reward models, finding exploits and collapsing into gibberish (imagine using actual RLHF in 2023!). Maybe this has delayed the release, as well as the garden variety safetyism it has made more complex.

Interesting/mildly concerning. I haven't heard any claims of such difficulty in early GPT-4 or Claude, but OAI is probably the best at "alignment" in general, while Anthropic gimps their models to hell.

To be honest I was more excited about the other day's release of Mamba by Albert Gu and the legendary Tri Dao. There are many architectures that I expect will break through the Pareto frontier of a mature Transformer, but this one is the first that feels like an actual Vaswani et al. 2017 level advance. Unlimited context, here we come.

I am the wrong person to comment on such architectural concerns, but if people I respect, such as you and some others, do stress its importance, I'm all for it.

Certainly it seems to me that context windows (along with hallucinations) are the biggest impediments in making LLMs useful for more tasks.

I wonder what the deeper implications for human cognition are. I don't think there are people who can keep 25k words in their working memory, that seems to be much smaller, but we certainly don't usually forget the start of a novella by the time we reach the end. Is there a lot of caching and summarization going on?

At any rate, I hope it beats the annoying reality that 128k and 200k context window models begin to severely underperform, especially for data presented in the middle.

How does it stack up to RWKV?

I wonder what the deeper implications for human cognition are. I don't think there are people who can keep 25k words in their working memory, that seems to be much smaller, but we certainly don't usually forget the start of a novella by the time we reach the end. Is there a lot of caching and summarization going on?

Yes, there is in effect a lot of "caching and summarization" going on -- although that's probably our 2023 ooga-booga, not-quite-wrong way of talking about something else. LLMs really only have their context window and it's feedback as a short-term memory. Which is fine for text translation, but is asinine if you want anything like a thinking engine. Goldfish with a notebook.

We and LLMs can both compress long stories into gists, but the LLMs just forget about it and repeat the work on every iteration. We remember the gists and use them as context on every iteration.