site banner

Culture War Roundup for the week of July 3, 2023

This weekly roundup thread is intended for all culture war posts. 'Culture war' is vaguely defined, but it basically means controversial issues that fall along set tribal lines. Arguments over culture war issues generate a lot of heat and little light, and few deeply entrenched people ever change their minds. This thread is for voicing opinions and analyzing the state of the discussion while trying to optimize for light over heat.

Optimistically, we think that engaging with people you disagree with is worth your time, and so is being nice! Pessimistically, there are many dynamics that can lead discussions on Culture War topics to become unproductive. There's a human tendency to divide along tribal lines, praising your ingroup and vilifying your outgroup - and if you think you find it easy to criticize your ingroup, then it may be that your outgroup is not who you think it is. Extremists with opposing positions can feed off each other, highlighting each other's worst points to justify their own angry rhetoric, which becomes in turn a new example of bad behavior for the other side to highlight.

We would like to avoid these negative dynamics. Accordingly, we ask that you do not use this thread for waging the Culture War. Examples of waging the Culture War:

  • Shaming.

  • Attempting to 'build consensus' or enforce ideological conformity.

  • Making sweeping generalizations to vilify a group you dislike.

  • Recruiting for a cause.

  • Posting links that could be summarized as 'Boo outgroup!' Basically, if your content is 'Can you believe what Those People did this week?' then you should either refrain from posting, or do some very patient work to contextualize and/or steel-man the relevant viewpoint.

In general, you should argue to understand, not to win. This thread is not territory to be claimed by one group or another; indeed, the aim is to have many different viewpoints represented here. Thus, we also ask that you follow some guidelines:

  • Speak plainly. Avoid sarcasm and mockery. When disagreeing with someone, state your objections explicitly.

  • Be as precise and charitable as you can. Don't paraphrase unflatteringly.

  • Don't imply that someone said something they did not say, even if you think it follows from what they said.

  • Write like everyone is reading and you want them to be included in the discussion.

On an ad hoc basis, the mods will try to compile a list of the best posts/comments from the previous week, posted in Quality Contribution threads and archived at /r/TheThread. You may nominate a comment for this list by clicking on 'report' at the bottom of the post and typing 'Actually a quality contribution' as the report reason.

6
Jump in the discussion.

No email address required.

He is likely referring to this from pages 11-12 of the GPT whitepaper:

GPT-4 can also be confidently wrong in its predictions, not taking care to double-check work when it’s likely to make a mistake. Interestingly, the pre-trained model is highly calibrated (its predicted confidence in an answer generally matches the probability of being correct). However, after the post-training process, the calibration is reduced (Figure 8).

In any case, the articles you quote are oversimplified and inaccurate. Predicting text (and then satisfying RLHF) is how it was trained, but the way it evolved to best satisfy that training regime is a bunch of incomprehensible weights that clearly have some sort of general reasoning capability buried in there. You don't need to do statistical tests of its calibration to see that, because something that was truly just doing statistical prediction of text without having developed reasoning or a world-model to help with that task wouldn't be able to do even the most basic reasoning like this unless is already appeared in the text it was trained on.

It's like saying "humans can't reason, they're only maximizing the spread of their genes". Yes, if you aren't familiar with the behavior of LLMs/humans understanding what they evolved to do is important to understanding that behavior. It's better than naively assuming that they're just truth-generators. If you wanted to prove that humans don't reason you could point out all sorts of cognitive flaws and shortcuts with obvious evolutionary origins and say "look, it's just statistically approximating what causes more gene propagation". Humans will be scared of things like spiders even if they know they're harmless because they evolved to reproduce, not to reason perfectly, like a LLM failing at Idiot's Monty Hall because it evolved to predict text and similar text showed up a lot. (For that matter humans make errors based on pattern-matching ideas to something they're familiar with all the time, even without it being a deeply-buried instinct.) But the capability to reason is much more efficient than trying to memorize every situation that might come up, for both the tasks "predict text and satisfy RLHF" and "reproduce in the ancestral environment", and so they can do that too. They obviously can't reason at the level of a human, and I'd guess that getting there will involve designing something more complicated than just scaling up GPT-4, but they can reason.

Yes, that is the section I had previously read. I let my severe annoyance at Hlynka's uncouthness stop me from doing the work of reading through it again to find it. Good to know it wasn't a fevered hallucination on my part haha

You don't need to do statistical tests of its calibration to see that, because something that was truly just doing statistical prediction of text without having developed reasoning or a world-model to help with that task wouldn't be able to do even the most basic reasoning like this unless is already appeared in the text it was trained on.

I opened up Bing Chat, powered by GPT4, and I tried that example. I got "The diamond is still inside the thimble inside the coffee cup on the kitchen counter". In fact, I've yet to see a single example of an LLM's supposed ability to reason replicated outside of a screenshot.

Well. I tried Bing Chat just now and got this.

It is worth noting that the settings besides "Creative" tend to have worse performance for these sorts of tasks. You may want to rerun it on that. Personally I don't have any difficulty believing LLMs can perform some semblance of "reasoning" -- even GPT-3 can perform transformations like refactoring a function into multiple smaller functions with descriptive names and explanatory comments (on a codebase it's never seen before, calling an API that didn't exist when its training data was scraped). It is obviously modeling something more general there, whether you want to call it "reasoning" or not.

From following a rather discreet Twitter account belonging to one of the lead devs for Bing Chat, I've learned that Creative mode is the one that most consistently uses GPT-4. All the others use older models, at least most of the time.

Even Creative apparently can relegate what another model seems as low complexity answers to a simpler LLM.

(Running GPT-4 as a practically free public service is expensive)

Despite being based on GPT-4 Bing is apparently well-known for performing dramatically worse. There have been some complaints of GPT-4's performance degrading too, presumably due to some combination of OpenAI trying to make it cheaper to run (with model quantization?) and adding more fine-tuning trying to stop people from getting it to say offensive things, but hopefully not to the extent that it would consistently fail that sort of world-modeling. (If anyone with a subscription wants to also test older versions of GPT-4 it sounds like they're still accessible in Playground?)

I don't think it's plausible that all the examples of GPT-4 doing that sort of thing are faked, not when anyone shelling out the $20 can try it themselves. And people use it for things like programming, you can't do that without reasoning, just a less familiar form of reasoning than the example I gave.

You don't even need $20, if you're willing to hunt down discord bots that largely use leaked API keys (some are ad-supported).

The ChatGPT subreddit's official discord server has one or two, and while I know better ones that are less legit, I don't broadcast their existence more than I have to because that only increased the likelihood of losing free access to something I really enjoy.

Bing Chat, even in Creative mode, is only a poor man's GPT-4.

I don't think it's plausible that all the examples of GPT-4 doing that sort of thing are faked, not when anyone shelling out the $20 can try it themselves. And people use it for things like programming, you can't do that without reasoning, just a less familiar form of reasoning than the example I gave.

My problem is, while I'm sure that not all the examples of GPT-4 seeming to get complex reasoning tasks are fake, if they cannot be replicated, what good are they? If GPT-4's ability to "reason" is ephemeral and seemingly random, is it really reasoning, or is it just occasionally getting lucky at ordering abstract tokens for it's monkey overlords?

You know, it's funny, I went through the linked whitepaper. Skimmed mostly. It made few positive, objective claims about GPT4's ability to reason. It mostly said it could reason "better" than previous iterations, and had been trained on a dataset to encourage mathematical reasoning. Notably they say:

It can sometimes make simple reasoning errors which do not seem to comport with competence across so many domains

I saw some the prompts where they asked GPT-4 to explain it's reasoning, and was underwhelmed. They were extremely rudimentary mathematical tasks of the 5th grade word problem sort, and it's purporting "reasoning" could have easily been imitating training. When I saw that, I took a closer look at how it performed in assorted test, and saw it comprehensively failed the AP English Language and Composition and AP English Language and Literature tests. Which makes sense to me, because a lot of those tests involve more generalized and flexible reasoning than the sorts of formalized mathematical logic examples it might plausibly be trained to imitate.

When I saw that, I took a closer look at how it performed in assorted test, and saw it comprehensively failed the AP English Language and Composition and AP English Language and Literature tests. Which makes sense to me, because a lot of those tests involve more generalized and flexible reasoning than the sorts of formalized mathematical logic examples it might plausibly be trained to imitate.

Come on, most of the UK parliament can't even give the probability of two coins both coming up heads: https://www.bbc.com/news/uk-19801666

Most people can't even imitate intelligence, by your logic.

GPT-4 has vastly superhuman knowledge, superhuman language knowledge, superhuman speed. Its reasoning skills are well above most of humanity. Most people can't program at all, let alone in all the languages, know how to use so much software like it can. These niggling flaws in AP English and Composition probably have more to do with the arcane and arbitrary scoring mechanism in those tests. It can write just fine. Its prose is not amazing and tends to be rather cliche and predictable, yet that has a lot to do with the RLHF.

Come on, most of the UK parliament can't even give the probability of two coins both coming up heads: https://www.bbc.com/news/uk-19801666

That is astonishing.

I wonder how well the average EU parliamentarian or US congressman etc. would do. I can’t imagine that the average Politburo member in China would be this bad.

Well, Britain has been in decline for the last century... I think we could learn a great deal by watching what they've done, what they do and committing to the opposite. They deliberately crushed Birmingham for instance - 'too much development in this rich industrial region, stop!' https://unherd.com/2020/09/the-plot-against-mercia/

And apparently NHS maternity incompetence costs twice as much as NHS maternities themselves: https://www.thetimes.co.uk/article/maternity-payouts-twice-cost-of-care-times-health-commission-svdhsjhqk

The Chinese politburo has a fair few with a basis in science and engineering, I made a post about it a while ago which was contested. Xi at least has an engineering background, other Politburo members have tiny wikipedia pages. https://www.themotte.org/post/238/what-if-your-entire-worldview-was/44213?context=8#context

The Chinese politburo has a fair few with a basis in science and engineering, I made a post about it a while ago which was contested. Xi at least has an engineering background, other Politburo members have tiny wikipedia pages. https://www.themotte.org/post/238/what-if-your-entire-worldview-was/44213?context=8#context

My understanding was that politburo members were, at one point, mostly or completely of an engineering or science background, but this has relaxed recently. Regardless, the point stands.

I think this holds for the other East Asian countries though, even with a pretty low estimation of the Diet or the National Assembly of Korea.

Wonder what it’s like in other countries.

The only field where GPT-4 outright disappoints me is in writing fiction. Better than the average human maybe, because they're absolutely shit, just look at the majority of posts on /r/WritingPrompts, but still not good.

I've tried to lean on it in my own writing, because I'm lazy, but other than some help with brainstorming, this is not a strength.

It's good at sparking ideas, indeed.

My problem is, while I'm sure that not all the examples of GPT-4 seeming to get complex reasoning tasks are fake, if they cannot be replicated, what good are they?

I am saying they can be replicated, just by someone who unlike you or me has paid the $20. I suppose it is possible that the supposed degradation in its capabilities has messed up these sorts of questions as well, but probably not.

If GPT-4's ability to "reason" is ephemeral and seemingly random, is it really reasoning, or is it just occasionally getting lucky at ordering abstract tokens for it's monkey overlords?

There is a big difference between random guessing and having a capability that sometimes doesn't work. In particular, if the chance of randomly getting the right result without understanding is low enough. Text generators based on Markov chains could output something that looked like programming, but they did not output working programs, because such an outcome is unlikely enough that creating a novel program is not something you can just randomly stumble upon without some idea of what you're doing. In any case, as far as I know GPT-4 is not that unreliable, especially once you find the prompts that work for the task you want.

Which makes sense to me, because a lot of those tests involve more generalized and flexible reasoning than the sorts of formalized mathematical logic examples it might plausibly be trained to imitate.

How well it reasons is a different question from whether it reasons at all. It is by human standards very imbalanced in how much it knows vs. how well it reasons, so yes people who think it is human-level are generally being fooled by its greater knowledge. But the reasoning is there and it's what makes a lot of the rest possible. Give it a programming task and most of what it does might be copying common methods of doing things that it came across in training, but without the capability to reason it would have no idea of how to determine what methods to use and fit them together without memorizing the exact same task from elsewhere. So practical use is generally going to involve a lot of memorized material, but anyone with a subscription can come up with novel questions to test its reasoning capabilities alone.

If GPT-4's ability to "reason" is ephemeral and seemingly random, is it really reasoning, or is it just occasionally getting lucky at ordering abstract tokens for it's monkey overlords?

I think the issue is that a human's ability to "reason" is also ephemeral and seemingly random as well. Just less random, with a lower failure rate, but still fairly random and certainly ephemeral for even the most reasonable and logical of people. Given that, the difference in ability to reason is one of degree, not of kind. The question remains if the random failures of reasoning in LLMs can get small/rare enough to the point that it's similar to that of a somewhat competent human.