This weekly roundup thread is intended for all culture war posts. 'Culture war' is vaguely defined, but it basically means controversial issues that fall along set tribal lines. Arguments over culture war issues generate a lot of heat and little light, and few deeply entrenched people ever change their minds. This thread is for voicing opinions and analyzing the state of the discussion while trying to optimize for light over heat.
Optimistically, we think that engaging with people you disagree with is worth your time, and so is being nice! Pessimistically, there are many dynamics that can lead discussions on Culture War topics to become unproductive. There's a human tendency to divide along tribal lines, praising your ingroup and vilifying your outgroup - and if you think you find it easy to criticize your ingroup, then it may be that your outgroup is not who you think it is. Extremists with opposing positions can feed off each other, highlighting each other's worst points to justify their own angry rhetoric, which becomes in turn a new example of bad behavior for the other side to highlight.
We would like to avoid these negative dynamics. Accordingly, we ask that you do not use this thread for waging the Culture War. Examples of waging the Culture War:
-
Shaming.
-
Attempting to 'build consensus' or enforce ideological conformity.
-
Making sweeping generalizations to vilify a group you dislike.
-
Recruiting for a cause.
-
Posting links that could be summarized as 'Boo outgroup!' Basically, if your content is 'Can you believe what Those People did this week?' then you should either refrain from posting, or do some very patient work to contextualize and/or steel-man the relevant viewpoint.
In general, you should argue to understand, not to win. This thread is not territory to be claimed by one group or another; indeed, the aim is to have many different viewpoints represented here. Thus, we also ask that you follow some guidelines:
-
Speak plainly. Avoid sarcasm and mockery. When disagreeing with someone, state your objections explicitly.
-
Be as precise and charitable as you can. Don't paraphrase unflatteringly.
-
Don't imply that someone said something they did not say, even if you think it follows from what they said.
-
Write like everyone is reading and you want them to be included in the discussion.
On an ad hoc basis, the mods will try to compile a list of the best posts/comments from the previous week, posted in Quality Contribution threads and archived at /r/TheThread. You may nominate a comment for this list by clicking on 'report' at the bottom of the post and typing 'Actually a quality contribution' as the report reason.
Jump in the discussion.
No email address required.
Notes -
Grok 3 just came out, and early reports say it’s shattering rankings.
Now there is always hype around these sorts of releases, but my understanding of the architecture of the compute cluster for Grok 3 makes me think there may be something to these claims. One of the exciting and interesting revelations is that it tends to perform extremely well across a broad range of applications, seemingly showing that if we just throw more compute at an LLM, it will tend to get better in a general way. Not sure what this means for more specifically trained models.
One of the most exciting things to me is that Grok 3 voice allegedly understands tone, pacing, and intention in conversations. I loved OpenAIs voice assistant until it cut me off every time I paused for more than a second. I’d Grok 3 is truly the first conversational AI, it could be a game changer.
I’m also curious how it compares to DeepSeek, if anyone knows more than I?
Lmsys ranking is not nothing, but we are getting to the point where most models are "good enough" from the perspective of the average lmsys rater and most of the interesting differentiation between models is going on in benchmarks that test specialized skill and knowledge that's not necessarily common among lmsys raters.
I couldn't be bothered to click through the tweets (I don't have a Twitter account) so I don't know if they published other benchmarks too.
Agreed. I'm happy to admit that post GPT-4, my ability to strongly evaluate LLMs has failed me.
At least in medicine, most of the difficulty lies in retaining the enormous amounts of knowledge required for a diagnosis, and fluid intelligence plays a smaller role in comparison. Not negligible, by any means, but the difference between an IQ 120 and 130 doctor in most scenarios will mostly hinge on whoever remembered more. Even GPT-4 scored 99th percentile on the USMLE. There was another study that followed it, that assessed it against several incredibly complicated paediatric cases, and it didn't beat a panel of doctors. Said doctors were professors or highly renowned specialists in their fields, a panel of average paediatricians or random medical professionals would have bombed it. Even then, they said that GPT-4 performed adequately, while it might not have come up with a perfect diagnosis and management plan, it didn't do anything stupid or dangerous. And GPT-4 is old. [1]
I don't think I could draft a question that I could answer and an LLM couldn't. Not easily, at least. I'd probably have to crack open a textbook, find an obscure disease, and then try to contrive multiple interactions on top of it. The sheer amount of "general knowledge" LLMs have is vastly superhuman. It's still possible to be a domain expert and exceed it in one aspect, but not to be a generalist who knows remotely as much as they do.
[1] My Google-fu fails me. I don't think I'm misremembering the gist of the study, but it turns out humans can be fallible and hallucinate too, not just LLMs.
Do you mean fluid intelligence?
Fluid intelligence is "figuring out a new unfamiliar problem", crystallized intelligence is "accumulating enough learned knowledge that you can apply some of it straightforwardly". IMHO the latter is what LLMs are already really good at, the former is where they're still shaky. I can ask qualitative questions of AIs about my field and get answers that I'd be happy to see from a young grad student, but if I ask questions that require more precise answers and/or symbol manipulation they still tend to drop or omit terms while confidently stating that they've done no such thing. That confidence is why I'd never use a yes/no question as a test; even if one gets it right I'd want to see a proof or at least a chain of reasoning to be sure it didn't get it right by accident.
You're right, I'll edit that.
I understand that mathematicians develop and use jargon for eminently sensible reasons. But they do make things difficult for outsiders, for example, I just screwed up while evaluating LLMs on a maths problem that I thought I remembered the answer to. When my cousin with the actual maths PhD was walking me through the explanation, it made perfect intuitive sense, but I'll be damned if I go through a topology 101 lesson to try and grokk the reasoning involved.
In medicine, and psychiatry, I think current LLMs are >99% correct in answering most queries. Quite often what I perceive is an error turns out to be a misunderstanding on my part. I'm not a senior enough psychiatrist to bust out the real monster topics, but I still expect they'd do a good job. Factual knowledge is more than half the battle.
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link