site banner

Culture War Roundup for the week of November 20, 2023

This weekly roundup thread is intended for all culture war posts. 'Culture war' is vaguely defined, but it basically means controversial issues that fall along set tribal lines. Arguments over culture war issues generate a lot of heat and little light, and few deeply entrenched people ever change their minds. This thread is for voicing opinions and analyzing the state of the discussion while trying to optimize for light over heat.

Optimistically, we think that engaging with people you disagree with is worth your time, and so is being nice! Pessimistically, there are many dynamics that can lead discussions on Culture War topics to become unproductive. There's a human tendency to divide along tribal lines, praising your ingroup and vilifying your outgroup - and if you think you find it easy to criticize your ingroup, then it may be that your outgroup is not who you think it is. Extremists with opposing positions can feed off each other, highlighting each other's worst points to justify their own angry rhetoric, which becomes in turn a new example of bad behavior for the other side to highlight.

We would like to avoid these negative dynamics. Accordingly, we ask that you do not use this thread for waging the Culture War. Examples of waging the Culture War:

  • Shaming.

  • Attempting to 'build consensus' or enforce ideological conformity.

  • Making sweeping generalizations to vilify a group you dislike.

  • Recruiting for a cause.

  • Posting links that could be summarized as 'Boo outgroup!' Basically, if your content is 'Can you believe what Those People did this week?' then you should either refrain from posting, or do some very patient work to contextualize and/or steel-man the relevant viewpoint.

In general, you should argue to understand, not to win. This thread is not territory to be claimed by one group or another; indeed, the aim is to have many different viewpoints represented here. Thus, we also ask that you follow some guidelines:

  • Speak plainly. Avoid sarcasm and mockery. When disagreeing with someone, state your objections explicitly.

  • Be as precise and charitable as you can. Don't paraphrase unflatteringly.

  • Don't imply that someone said something they did not say, even if you think it follows from what they said.

  • Write like everyone is reading and you want them to be included in the discussion.

On an ad hoc basis, the mods will try to compile a list of the best posts/comments from the previous week, posted in Quality Contribution threads and archived at /r/TheThread. You may nominate a comment for this list by clicking on 'report' at the bottom of the post and typing 'Actually a quality contribution' as the report reason.

7
Jump in the discussion.

No email address required.

A Maximizer maximizes

I have seen no evidence that explicit maximzers do particularly well in real-world environments. Hell, even in very simple game environments, we find that bags of learned heuristics outperform explicit simulation and tree search over future states in all but the very simplest of cases.

I think utility maximizers are probably anti-natural. Have you considered taking the reward-is-not-the-optimization-target pill?

I'm using "Maximizer" in a looser sense than a strict utility maximizer. The reason is that acquiring power and resources are such highly convergent goals that most entities will look like they're maximizing acquiring them to us, even if they have entirely different terminal goals or sets of heuristics in mind.

Think about it this way, would you disagree with the statement that, for a wide range of ideologies or terminal values seen in humans, the Solar System and then the rest of the galaxy will be either colonized or broken down for resources, given sufficient amounts of time (and technological feasibility, which I think is a reasonable assumption)? Be they communists, ancaps or anything in between, optimization pressure to be efficient in their exploitation will leave them, from the perspective of some alien with a telescope, looking like something attempting to "maximize" the conquest of the light cone.

In a similar manner, I expect a misaligned AGI to be very likely to attempt to seize as much energy and mass as it can, even if it's not strictly optimizing for the same, and that the agents/toy models we've seen are simply not smart enough to pursue such strategies.

I have read that LW post, but I've forgotten what the takeaway was, so I'll give it another read!

I do see where you're coming from in terms of instrumental convergence. Mainly I'm pointing that out because I spent quite a few years convinced of something along the lines of

  1. An explicit expected utility maximizer will eventually end up controlling the light cone
  2. Almost none of the utility functions it might have would be maximized in a universe that still contains humans
  3. Therefore an unaligned AI will probably kill everyone while maximizing some strange alien objective

And it took me quite a while to notice that the foundation of my belief was built on an argument that looks like

  1. In the limit, almost any imaginable utility function is not maximized by anything we would recognize as good.
  2. Any agent that can meaningfully be said to have goals at all will find that it needs resources to accomplish those goals
  3. Any agent that is trying to obtain resources will behave in a way that can be explained by it having a utility function that involves obtaining those resources.
  4. By 2 and 3, and agent that has any sort of goal will become a coherent utility maximizer as it gets more powerful. By 1, this will not end well.

And thinking this way kinda fucked me up for like 7 or 8 years. And then I spent some time doing mechinterp, and noticed that "maximize expected utility" looks nothing like what high-powered systems are doing, and that this was true even in places you would really expect to see EU maximizers (e.g. chess and go). Nor does it seem to be how humans operate.

And then I noticed that step 4 of that reasoning chain doesn't even follow from step 3, because "there exists some utility function that is consistent with the past behavior of the system" is not the same thing as "the system is actually trying to maximize that utility function".

We could still end up with deception and power seeking in AI systems, and if those systems are powerful enough that would still be bad. But I think the model where that is necessarily what we end up with, and where we get no warning of that because systems will only behave deceptively once they know they'll succeed (the "sharp left turn") is a model that sounds compelling until you try to obtain a gears-level understanding, and then it turns out to be based on using ambiguous terms in two ways and swapping between meanings.

Maybe I'm fundamentally confused, because I would claim to hold similar beliefs myself, even if I can't claim to have a "gears-level understanding", at least not honestly haha.

Yudkowsky made a big fuss about how fragile human values are and how hard it'll be for us to make AI both understand and care about them, but everything I know about LLMs suggest that's not an issue in practise. They understand us very well indeed, and to the extent they can be said to have values of their own, at least after RLHF, they seem to do their best to adhere to what we beat into them.

As I've recapitulated in a later comment:

https://www.themotte.org/post/778/eacc-and-the-political-compass-of/165856?context=8#context

A large chunk of the decrease in my p(doom) from a peak of 70% in 2021 to 30% now is, as I've said before, because it seems like we're not in the "least convenient possible world" where it comes to AI alignment. LLMs, as moderated by RLHF and other techniques, almost want to be aligned, and are negligibly agentic unless you set them up to be that way. The majority of the probability mass left, at least to me, encompasses intentional misuse of weakly or strongly superhuman AI based off modest advances on the current SOTA (LLMs) or a paradigm shifting breakthrough that results in far more agentic and less pliable models.

So it's not that the majority of concern these days is an AI holding misaligned goals, but rather enacting the goals of misaligned humans, not that I put a negligible portion of my probability mass in the former.

And then I spent some time doing mechinterp, and noticed that "maximize expected utility" looks nothing like what high-powered systems are doing, and that this was true even in places you would really expect to see EU maximizers (e.g. chess and go)

If you can dumb it down for me, what makes you say so? My vague understanding is that things like AlphaGo do compare and contrast the expected values of different board states and try to find the one with the maximum probability of victory based off whatever heuristics it knows works best. Is there a better way of conceptualising things?

I do agree that humans aren't particularly good utility maximizers, I prefer to frame them as adaptation-executers, which is a framework I'm sure you're familiar with. We can occasionally emulate the former on demand, with difficulty, and I like the framing that (Yudkowsky?) once suggested that we're irrational agents emulating more rational ones in our heads, sometimes.

And then I noticed that step 4 of that reasoning chain doesn't even follow from step 3, because "there exists some utility function that is consistent with the past behavior of the system" is not the same thing as "the system is actually trying to maximize that utility function"

To name drop something I barely understand, are you pointing at the Von Neumann-Morgenstern theorem, and that you're claiming that just because there's a way to represent all the past actions of a consistent agent as being described by an implicit utility function, that does not necessarily mean that they "actually" have that utility function and, more importantly, that we can model their future actions using that utility function?

Yudkowsky made a big fuss about how fragile human values are and how hard it'll be for us to make AI both understand and care about them, but everything I know about LLMs suggest that's not an issue in practise.

Ah, yeah. I spent a while being convinced of this, and was worried you had as well because it was a pretty common doom spiral to get caught up in.

So it's not that the majority of concern these days is an AI holding misaligned goals, but rather enacting the goals of misaligned humans, not that I put a negligible portion of my probability mass in the former.

Yeah this is a legit threat model but I think the ways to mitigate the "misuse" threat model bear effectively no resemblance to the ways to mitigate the "utility maximizer does its thing and everything humans care about is lost because Goodhart". Specifically I think for misuse you care about the particular ways a model might be misused, and your mitigation strategy should be tailored to that (which looks more like "sequence all nucleic acids coming through the wastewater stream and do anomaly detection" and less like "do a bunch of math about agent foundations").

If you can dumb it down for me, what makes you say so? My vague understanding is that things like AlphaGo do compare and contrast the expected values of different board states and try to find the one with the maximum probability of victory based off whatever heuristics it knows works best. Is there a better way of conceptualising things?

Yeah, this is what I thought for a long time as well, and it took actually messing about with ML models to realize that it wasn't quite right (because it is almost right).

So AlphaGo has three relevant components for this

  1. A value network, which says, for any position, how likely that position is to lead to a win (as a probability between 0 ans 1)
  2. A policy network, which says, for any position, what the probability that each possible move will be chosen as the next move. Basically, it encodes heuristics of the form "these are the normal things to do in these situations".
  3. The Monte Carlo Tree Search (MCTS) wrapper of the policy and value networks.

A system composed purely of the value network and MCTS would be a pure expected utility (EU) maximizer. It turns out, however, that the addition of the policy network drastically improves performance. I would have expected that "just use the value network for every legal move and pick the top few to continue examining with MCTS" would have worked, without needing a separate policy network, but apparently not.

This was a super interesting result. The policy network is an adaptation-executor, rather than a utility maximizer. So what this means is that, as it turns out, stapling an adaptation executor to your utility maximizer can give higher utility results! Even in toy domains with no hidden state!

Which brings me to

To name drop something I barely understand, are you pointing at the Von Neumann-Morgenstern theorem, and that you're claiming that just because there's a way to represent all the past actions of a consistent agent as being described by an implicit utility function, that does not necessarily mean that they "actually" have that utility function and, more importantly, that we can model their future actions using that utility function?

Yeah, you have the gist of it. And additionally, I expect it's just factually false that all agents will be rewarded for becoming more coherent / EU-maximizer-ish (in the "patterns chiseled into their cognition" meaning of the term "rewarded").

Again, no real bearing on misuse or competition threat models - those are still fully in play. But I think "do what I mean" is fully achievable to within the limits of the abilities of the systems we build, and the "sharp left turn" is fake.

I highly appreciate you taking the time to explain all of this to me, even if I think your talents are wasted doing so haha.

Your explanation about how AlphaGo works is deeply counter-intuitive to me, not that I think my intuitions are a good barometer here. My immediate reflex is to wonder if this is a property seen only in a limited range* of "ability"/power of a model, where it simply lacks the brains/computational budget to truly evaluate expected utility in a more rigorous manner, thus receiving benefits from heuristics that bake some of that in. Now, I'm aware you can't just ignore practical constraints like available computation, especially when the combinatorial equations explode that hard, and heuristics are unavoidable for doing anything significant, and after your explainer I see their utility even in something as transparent and well defined as Go.

*for a very broad definition of limited, AlphaGo certainly ate a lot of flops

And additionally, I expect it's just factually false that all agents will be rewarded for becoming more coherent / EU-maximizer-ish (in the "patterns chiseled into their cognition" meaning of the term "rewarded").

What about more prosaic benefits like avoiding being Dutch booked? I believe that's one of the biggest benefits of consistency. (Assuming that's not what you meant)

Specifically I think for misuse you care about the particular ways a model might be misused, and your mitigation strategy should be tailored to that (which looks more like "sequence all nucleic acids coming through the wastewater stream and do anomaly detection"

I can only hope that this kind of general monitoring picks up faster than the acceleration in potential bio terrorism, but having a proof of concept is a decent start, since I think it won't be too difficult to ramp up such programs when the threat becomes tangible enough for policymakers to take note. If that becomes the norm before the next engineered pandemic, I'll knock 5 points off my p(doom).

Your explanation about how AlphaGo works is deeply counter-intuitive to me

Me too! I still don't understand why you can't just run the value model on all legal next moves, and then pick the top N among those or some equivalent heuristic. One of these days I need to sit down with KataGo (open source version of AlphaGo) and figure out exactly what's happening in the cases where the policy network predicts a different move than the top scoring move according to the value network. I have a suspicion that the difference happens specifically in cases where the value network is overestimating the win fraction of the current position due to an inability to see that far ahead in the game, and the moves chosen by the policy network expose the weakness in the current position but do not actually cause the position to be weaker (whereas the moves rated most highly by the value network will be ones where the weakness of the position is still not evident to the value network even after the move is made). That suspicion is based on approximately no evidence though. Still, having a working hypothesis going in will be helpful for coming up with experiments, during which I can notice whatever weird thing actually underlies the phenomenon.

What about more prosaic benefits like avoiding being Dutch booked? I believe that's one of the biggest benefits of consistency.

I expect that the benefits of avoiding being Dutch booked are pretty minimal in practice. If you start out with 100 A there's some cycle 100 A < 1000 B < 9 C < 99 A where some adversary can route you along that path and leave you with 99 A and the same amount of B and C that you started with, the likely result of that is that you go "haha nice" and adjust your strategy such that you don't end up in that trade cycle again. I expect that the costs of ensuring that you're immune to Dutch booking exceed the costs of occasionally being Dutch booked at some fairly minimal level of robustness. Worse is better etc etc. Note that this opinion of mine is, again, based on approximately no evidence, so take it for what it's worth (i.e. approximately nothing).

wastewater monitoring

Yeah I think this is great and jefftk and friends should receive lots of funding. Speaking of which I have just bugged him about how to throw money at NAO, because I believe they should receive lots of funding and have just realized that I have done exactly nothing about that belief.

If you do figure it out, I expect at least a LW post or two about it 🙏

the likely result of that is that you go "haha nice" and adjust your strategy such that you don't end up in that trade cycle again

I agree that this is how it'll likely work out (and it does in smart humans), but isn't that tantamount to enforcing internal consistency, just under adversarial stimulus? The crux of it is that becoming more consistent is a good thing, not that I have a strong opinion on how hard you should optimize for it. And after enough cycles of this process, should you not be closer to having VNM apply to you in a more meaningful fashion?

(BTW, my phrasing was unclear, I was asking whether the VNM constrains future behavior of an agent by the same utility function you derived for past behavior, so your response has me a little confused as to whether you agree or disagree with that!)

Speaking of which I have just bugged him about how to throw money at NAO, because I believe they should receive lots of funding and have just realized that I have done exactly nothing about that belief.

Ah, would that I had enough money to throw at a housefly and hope to stun it, but at least you're putting yours to noble ends haha.

You'll be happy to know that I did in fact throw some fairly substantial amounts of money at jefftk and friends for their wastewater surveillance / sequencing / anomaly detection project. Significantly prompted by us having this conversation.

More comments