site banner

Culture War Roundup for the week of December 4, 2023

This weekly roundup thread is intended for all culture war posts. 'Culture war' is vaguely defined, but it basically means controversial issues that fall along set tribal lines. Arguments over culture war issues generate a lot of heat and little light, and few deeply entrenched people ever change their minds. This thread is for voicing opinions and analyzing the state of the discussion while trying to optimize for light over heat.

Optimistically, we think that engaging with people you disagree with is worth your time, and so is being nice! Pessimistically, there are many dynamics that can lead discussions on Culture War topics to become unproductive. There's a human tendency to divide along tribal lines, praising your ingroup and vilifying your outgroup - and if you think you find it easy to criticize your ingroup, then it may be that your outgroup is not who you think it is. Extremists with opposing positions can feed off each other, highlighting each other's worst points to justify their own angry rhetoric, which becomes in turn a new example of bad behavior for the other side to highlight.

We would like to avoid these negative dynamics. Accordingly, we ask that you do not use this thread for waging the Culture War. Examples of waging the Culture War:

  • Shaming.

  • Attempting to 'build consensus' or enforce ideological conformity.

  • Making sweeping generalizations to vilify a group you dislike.

  • Recruiting for a cause.

  • Posting links that could be summarized as 'Boo outgroup!' Basically, if your content is 'Can you believe what Those People did this week?' then you should either refrain from posting, or do some very patient work to contextualize and/or steel-man the relevant viewpoint.

In general, you should argue to understand, not to win. This thread is not territory to be claimed by one group or another; indeed, the aim is to have many different viewpoints represented here. Thus, we also ask that you follow some guidelines:

  • Speak plainly. Avoid sarcasm and mockery. When disagreeing with someone, state your objections explicitly.

  • Be as precise and charitable as you can. Don't paraphrase unflatteringly.

  • Don't imply that someone said something they did not say, even if you think it follows from what they said.

  • Write like everyone is reading and you want them to be included in the discussion.

On an ad hoc basis, the mods will try to compile a list of the best posts/comments from the previous week, posted in Quality Contribution threads and archived at /r/TheThread. You may nominate a comment for this list by clicking on 'report' at the bottom of the post and typing 'Actually a quality contribution' as the report reason.

5
Jump in the discussion.

No email address required.

Google Gemini just launched

In other words, GPT-4 has just been beaten, about time I'd say, I'm getting used to the pace of progress in AI being blistering, and it was threatening to slowdown to just mild rash levels.

However, both my hands-on time with it, and the official benchmarks Google released suggest it's a minor, incremental improvement, one that doesn't stand up to the drastic improvement that GPT-4 represented over 3 or 3.5. [For clarity, I, like the rest of you, can only use Gemini Pro, the second best model]

Which is fine, because for a while now, people have been lambasting Google/Deepmind for being too incompetent to ship, or at least ship a competitive product, given how shitty Bard was when it launched, even after being upgraded once or twice.

However, Bard, now running the Gemini Pro model, seems to be roughly as good as paid GPT-4 on ChatGPT, or the free GPT-4 in Bing Copilot (previously Bing Chat). I have yet to spot any new use case it enables, in the sense that GPT-4 can reliably do tasks that simply had 3.5 flailing about in confusion, or worse, hallucinate incorrect answers, such as more involved questions in coding, medicine and everything else really.

However, Google hasn't yet publicly released the best Gemini model, which is currently undergoing an analogous process that GPT-4 or Claude 2 went through, namely more RLHF, red-teaming and safety testing. Pro is the next step down, but it seems pretty good to me, in the sense I would happily use it as an alternative to GPT-4, even if I have no strong opinion on which is better.

There's also a Nano model, which is stripped down to run on mobile devices, and is now being used on the Pixel 8 Pro for a few tasks, potentially silencing the people who claimed it's AI specific computing components were a marketing gimmick, especially since it seemed to offload most AI tasks to the cloud.

Miscellaneous observations:

  1. Bard is fast as fuck compared to GPT-4, in terms of generation speed. It always was, but previously in the "I'm doing 2000 calculations a second in my head, and they're all wrong" sense. (GPT-4, at least before Turbo released, was always pretty slow compared to the competition. Far more unusable, but at the very least I read faster than it can write.)
  2. A quick search suggests all the models have a 32k token context window, or about an operating memory of the last 25k words it read and wrote. Good, if not remotely groundbreaking.
  3. This heavily suggests OAI will ship GPT-5 soon, instead of being content to milk 4 when it ran rings around the competition.
  4. It's multimodal, but then again so was GPT-4 from the start, the capability was just cordoned off for a bit.

To the extent I don't think the next generation (or two) of models after GPT-4 are an existential threat, I'm happy to see them finally arriving. There really isn't much more needed before even the best of us are entirely obsolete, at least for cognitive labor, and something as archaic as GPT-4 was scoring at the 95th percentile in the USMLE, so I'm preparing to explore my competitive advantage in panhandling. *

*This is a joke. For now.

Footnotes to the footnotes:

People on Twitter are correctly pointing out that GPT-4 underwent further post-launch improvements in benchmark scores, some of them pushing it past Gemini's published scores.

Also, just to be clear, the version of Gemini you can use now is not the best one, which may or may not be a modest improvement over GPT-4. Some claim it's more comparable to 3.5, but I haven't used that in ages, not when Bing makes 4 free.*

*Footnote^3 It's probably closer to 3.5. I'm sticking with Bing.

Toe-notes-

So far, it seems that Gemini is "competitive" with GPT-4. It's better at multimodal tasks, but for most people that's a minor fraction of their typical use case. For text, it's somewhere from close to roughly on par.

You can almost feel the desperation in the Deepmind researchers to find any way to massage things so that they come out ahead of GPT-4, from the misleading graphs, an egregious example to be found in a reply, to applying different standards in their inter-model comparisons, such as 5-shot prompting for GPT-4 versus Chain of thought 32 shot prompts for Gemini Ultra. At least the white paper doesn't outright lie, just mislead and prevaricate.

The MMLU is also flawed, with 2-3 percent of the questions simply broken, so a 1 or 2% improvement in score can be a bit questionable, let alone specifying performance to multiple decimal figures.

We don't see any comparisons to GPT-4 Turbo, but I don't hold that against them too hard, it just came out a few weeks back, perhaps not in time for them to finish their paper.

It you use the multimodal capabilities of Bard right now, it uses an older version that is pretty shit compared to GPT-4V or Bing.

Overall, the main benefits of Gemini's existence is largely that it shows Google isn't content to slumber indefinitely, and it can be competitive, better late than never. I expect GPT-5 to spank Gemini Ultra, and to the extent the latter accelerates the release of the former, I'm for it.

Predictions:

GPT-5 before end of 2024 - 90%

GPT-5 is superior to Gemini Ultra for most use cases, at the first point in time both coexist- 80%

A third competitor on par with either exists before 2025- 60%

An OSS equivalent of GPT-4 comes out before 2025- 70%

Does anyone perform cultural and free speech benchmarks on AIs? That is all I'm really interested in.

In addition to benchmarks, I'm curious as to methodologically what could be done to tune the LLMs to not give responses that break US law but otherwise do not tune them at all for offensive content or micro-managing responses on controversial topics. I would pay to access that LLM.

What could an LLM possibly say that would be illegal? I could see maybe an image generator making illegal output but an LLM? Could you really be guilty of incitement or hate speech in a private conversation? Any sort of threat it made wouldn't be a credible threat.

"Hate speech" isn't illegal under US law, but it's conceivable an LLM could start generating death/bomb threats, soliciting minors, ordering drugs, or trying to get people to send money to Nigerian princes.

In these cases, simply possessing the text isn't illegal, it is the act of intentionally sending the text to the recipient that is illegal.

ChatGPT doesn't even really rely on the LLM to not break copyright. You can get around copyright restrictions by just lying about what year it is, but then when it starts typing out copyrighted content a warning pops up and stops it. So it seemed like they just have a second dumb layer checking it. It seems like the dumb layer is a better protection for any explicitly banned text.

I haven't heard of formal benchmarks, unless you want something along the lines of ARC evals and then choosing the model that performs the worst on metrics of suppressing the same.

But in terms of what I'm aware of? The consensus is a Llama fork that's been tuned to remove the safety handles. Perks of OSS I guess. There's an enormous, tangled bush of forks-of-forks with truly inane names like UncensoredWizardVicunaHyperBoost-7B, which is at least half a real thing.

I certainly get plenty of use out of even the PC models, not that I'd be one to complain if they were less so. Maybe Grok will be good for something, I'm not paying $8 a month for Elon's brand of humor, I like the rockets and the cars.

I have to wonder how much the PC stuff just makes the AI worse. Not from being asked to deny reality or anything.Unless my understanding is wrong the usual approach to make them PC is to input a bunch of pre-commands before the user ever says anything. The more pre-commands needed, the less the user input matters, and the more that the AI has to keep track of answer wise.

An excessively long system prompt (the section privileged as overarching instructions) will reduce the amount of space the model has for holding conversations with the user. That's my understanding of it, and it's almost certainly true otherwise we could just dump an arbitrary amount of text in there.

Still, it's unlikely to be a hindrance in practise. Firstly, RLHF means the model won't do just anything because the user asks it to, even in the absence of specific instructions. That's why most of the jailbreaks don't work any longer, even when people can spin up their own GPTs with custom prompts, or even using the API. Secondly, with context windows of 32k tokens/25k words, a few hundred dedicated to telling it to be a good doggie doesn't cut into much of it. All the leaked default system prompts I've seen are, what, 200-500 words max?

The primary degradation is from the model's impaired understanding of the reality of the world, to the extent the world doesn't align with HR liberals. At best, the model is lying about what it "knows", at worst it's just more fundamentally confused about everything that builds off crimethink.

@DaseindustriesLtd, how do system prompts even work? What privileges them over all the other tokens that the user or LLM generates?

I suspect they're distinguished by special tokes that are marked in training as particularly constraining on behavior, but I realize I don't know that for a fact.

System prompts are not essentially different from any other part of the context. A reasonably preference-finetuned model will just learn to respect the prompt format and pay extra attention to the tokens in the span associated with the system prompt (sometimes it's explicitly marked with system tokens, sometimes not). Other than that it's just path dependence – the beginning of the context determines the manner of what comes next.

The success of this differs between models. Qwen developers boast of Qwen-72B-chat being very obedient to the system prompt, OpenAI has definitely succeeded somewhat, for community finetunes it's mostly a LARP.

I like the many attempts to impose some behavioral vector on the LLM without language though. Behold: in-context vectors. Also, Pressman's more esoteric approach is interesting, and there's work on making the model explicitly attend to some tokens, upweighting specific sentences etc. We would've had a full palette for semantic transformations, if there were more money in tooling for local models and people weren't so attached to chatting with human-imitating bots.

I'll probably get hate for being a buzzkill with this, but, what's the culture war angle for AI posts like this? I get that the broad rationalist community is interested in AI, and certainly there are times when AI and culture war intersect. But I don't see how this is in principle different than posting a top-level CW roundup comment about a new operating system, phone model, GPU line, medical technology innovation, or crypto scandal.

AI is, in itself, a culture war issue here on the Motte, due to the fact that a significant portion of the people - if not here at least in the wider community of which the Motte is an offshoot - believe that AI developments must be stopped at any/most cost, while another significant portion believe that AI developments should be accelerated with no/minimal speedbumps. It's not the same CW issue as the more well known CW issue of AI companies designing their AI to be biased in favor of certain sides of the wider worldwide/Western/American CW, and perhaps one can argue that it's not so much a culture war as an ideological war or empirical war, but I think it's close enough.

A culture war angle is in some of the comments here: LLMs are being developed under the conditions of matching the constraints from one side of the culture war.

All of those would probably be acceptable as well. The business world is part of culture. AI is definitely going to reshape culture. AI is being implemented with baked-in bias for culture-war reasons. And finally, AI is extremely relevant to actual war.

I sorta share your sensibility here. I feel like there's a disproportionate amount of AI news in here for how little impact it has so far had. But many regular posters insist that it's absolutely groundbreaking and will have serious CW implications, and I'm willing to trust them to a large extent.

But many regular posters insist that it's absolutely groundbreaking and will have serious CW implications

Except those supposed implications weren't mentioned in the OP.

I guess that's what it comes down to, though. When you think AI is going to be some godlike superdisruptor of everything, it's CW all the way down. I see it as just another technology and am pretty sick of how half of the output from places like ACX are devoted to this technology. But then I also see it in a non-CW context in the CW thread and it's defended on the highly disputable grounds that anything super important is CW or something.

I'll be honest, it rings similarly to the progressive trope of "I'm bringing up this political topic in this nonpolitical forum because everything is political, don't you people get it?!" No, trans issues are politics, no matter how important you think they are. Justice for Palestine isn't reproductive justice, no matter how important you think the two are. And new versions of LLMs aren't CW no matter how important you think they are. Everything isn't everything, and words have meaning.

Except those supposed implications weren't mentioned in the OP.

You're new around these parts aren't you? Which isn't a crime at all, we could certainly use new arrivals, but just about anyone who has been here for more than a few weeks knows my clear stance on the CW aspects of the topic, such that I don't feel an explicit need to rehash them.

Besides, like I said, any long discussion of (ideally) high quality is de facto acceptable in this thread, or else I wouldn't have had glorified travelogues make AAQCs. Not that this one doesn't have CW implications, the part about GPT-4's stellar performance in the USMLE making me obsolete as a doctor is only half in jest. Or maybe a quarter. It'll get you too.

You're new around these parts aren't you? Which isn't a crime at all, we could certainly use new arrivals, but just about anyone who has been here for more than a few weeks knows my clear stance on the CW aspects of the topic, such that I don't feel an explicit need to rehash them.

I've been around for years and have maybe ~1000 comments between here and the old subreddit. I definitely wouldn't have felt comfortable challenging a top-level post's suitability if I was new.

I know your stance on AI and why you think it's always CW (believe me, I have a very cozy relationship with the minus button to the left of your name, despite the fact that I think your non-AI contributions are very high quality), but I don't think everyone has to acquiesce to any given person's conception of what is suitable.

You're certainly entitled to your opinion, I'm sure the mods will handle it if they think I'm misusing the place.

The implications on the (potential) impending doom for humanity? Automation induced unemployment at the very least?

At any rate, being strictly about CW is far from an inflexible standard in this thread.

I only use LLMs for coding (and only Phind, since it doesn't require me to jump through any hoops to use it and cites its sources) and I'm completely surprised by both how good they are and how bad they are at the same time.

  • "Can you do this and that using Spark?" - generates code that does this and that in PySpark cleverly avoiding making an extra dataframe

  • "Can you rewrite this in Scala Spark?" - generates code that does only that and tells me I have to paste my own code that does this, even though it's the same Spark call

  • "Can I use A to implement B in C?" - "yes, you can do this, here's how you configure A to do B, here's how you configure C to do B"

  • "But how exactly do I use A from C?" - "oh, sorry, I meant you can't do this"

Makes me wonder how soon we'll get an LLM that doesn't code like an Accenture presales engineer.

I expect Bing with GPT-4 is better than Phind. It's also free.

When I was learning Python, it was a godsend, not that I can comment on how useful it can be for more complicated projects.

BTW, AlphaCode 2 just launched alongside Gemini, and it represents a massive leap in capabilities, far more impressive in that particular domain.

It's also free.

"Sorry, this service is not available in your region"

And it doesn't like my VPN either.

Well, I guess being in India is good for something.

Once somebody can figure out a rigid procedure that, when followed, causes Accenture presales engineers to write robust working code that actually meets the criteria, that procedure can be ported to work with LLMs. The procedure in question can be quite expensive with real people, because LLMs are cheap.

I suspect there does exist some approximate solution for the above, but also I expect it'll end up looking like some unholy combination of test-driven development, acceptance testing, mutation testing, and checking that the tests actually test for meeting the business logic requirements (and that last one is probably harder than all the other ones combined). And it will take trying and iterating on thousands of different approaches to find one that works, and the working approach will likely not work in all contexts.

my impression is that it's basically the same as GPT-4 in bing chat. Very impressive as technology, but not all that different from internet searches for most use cases. It can't really generate new knowledge, it just aggregates the most common responses on the net. And of course it has those AI limiters that make it weirdly neutered, like a movie that's been cut down to show on TV.

This result shouldn't be underestimated because Gemini-Ultra is merely on par/slightly better in text-based reasoning: it thoroughly beats GPT-4V on MMMU, the multimodal benchmark, including harder subscales; it also plays well with audio. People are for the most part functionally illiterate, so this is huge; and of course they will capitalize on Android and other ecosystem-wide advantages the Alphabet empire has. Multimodal language-like models will obviously be table stakes in 2024. (Bytedance guy even hints that they'll opensource a model on Gemini's level.)

Interesting that one of people who had worked on aligning early Gemini said they had trouble aligning it – it burned through RLHF reward models, finding exploits and collapsing into gibberish (imagine using actual RLHF in 2023!). Maybe this has delayed the release, as well as the garden variety safetyism it has made more complex.

To be honest I was more excited about the other day's release of Mamba by Albert Gu and the legendary Tri Dao. There are many architectures that I expect will break through the Pareto frontier of a mature Transformer, but this one is the first that feels like an actual Vaswani et al. 2017 level advance. Unlimited context, here we come.

Hmm, it seems like I confused the MMMU and MMLU in my original post, despite knowing the difference. I'll edit accordingly.

The MMMU performance seems far more compelling compared to the latter, especially given Dean's methodology of zero-shotting both models.

As someone who is functionally literate, I certainly care more about text prowess, as I presume would most of the people here. But in terms of mundane value for the rest of the world, that will be handy.

Interesting that one of people who had worked on aligning early Gemini said they had trouble aligning it – it burned through RLHF reward models, finding exploits and collapsing into gibberish (imagine using actual RLHF in 2023!). Maybe this has delayed the release, as well as the garden variety safetyism it has made more complex.

Interesting/mildly concerning. I haven't heard any claims of such difficulty in early GPT-4 or Claude, but OAI is probably the best at "alignment" in general, while Anthropic gimps their models to hell.

To be honest I was more excited about the other day's release of Mamba by Albert Gu and the legendary Tri Dao. There are many architectures that I expect will break through the Pareto frontier of a mature Transformer, but this one is the first that feels like an actual Vaswani et al. 2017 level advance. Unlimited context, here we come.

I am the wrong person to comment on such architectural concerns, but if people I respect, such as you and some others, do stress its importance, I'm all for it.

Certainly it seems to me that context windows (along with hallucinations) are the biggest impediments in making LLMs useful for more tasks.

I wonder what the deeper implications for human cognition are. I don't think there are people who can keep 25k words in their working memory, that seems to be much smaller, but we certainly don't usually forget the start of a novella by the time we reach the end. Is there a lot of caching and summarization going on?

At any rate, I hope it beats the annoying reality that 128k and 200k context window models begin to severely underperform, especially for data presented in the middle.

How does it stack up to RWKV?

I wonder what the deeper implications for human cognition are. I don't think there are people who can keep 25k words in their working memory, that seems to be much smaller, but we certainly don't usually forget the start of a novella by the time we reach the end. Is there a lot of caching and summarization going on?

Yes, there is in effect a lot of "caching and summarization" going on -- although that's probably our 2023 ooga-booga, not-quite-wrong way of talking about something else. LLMs really only have their context window and it's feedback as a short-term memory. Which is fine for text translation, but is asinine if you want anything like a thinking engine. Goldfish with a notebook.

We and LLMs can both compress long stories into gists, but the LLMs just forget about it and repeat the work on every iteration. We remember the gists and use them as context on every iteration.

Interesting/mildly concerning.

I think it's a nothingburger because a) the future is cDPO/IPO and not orthodox RLHF anyway (or even more obscure things) and failure modes there will probably be different and b) such «misalignment» results in a behaviorally incoherent model rather than an evil schemer. Reward models are getting hacked by being dragged off-policy, with some weird inputs that are not conductive to strategic world understanding, it's an exploitation of the semiotic nature of language models. But I believe some hay will be made out of it.

Human «context size» is not at all limited to working memory (although our working memory is also large, it's not 5-9 tokens/bits but more like 5-9 «pointers» that can be corresponded to arbitrarily complex cognitive circuits). What we use for context is probably most analogous to constructing on the fly and loading a LoRA in LLMs (or some in-context vector) plus adding embeddings and snippets to some RAG pipeline. It's a mess, but it's orthogonal to the shift from Transformers to SSMs that I expect now. Shane Legg talks of this too:

They don't do things like episodic memory. Humans have what we call episodic memory. We have a working memory, which are things that have happened quite recently, and then we have a cortical memory, things that are sort of being in our cortex, but there's also a system in between, which is episodic memory, which is the hippocampus. It is about learning specific things very, very rapidly. So if you remember some of the things I say to you tomorrow, that'll be your episodic memory hippocampus.
Our models don't really have that kind of thing and we don't really test for that kind of thing. We just sort of try to make the context windows, which is more like working memory, longer and longer to sort of compensate for this.

As for RWKV, I think the latest version is ≤RetNet (though it has good slopes, probably the best in their graph…). Gu&Dao are very explicit in pointing out that a) Mamba the first to even match a Llama-like Transformer without any gimmicks, at the tested scale at least, and b) it does not appreciably benefit from adding Attention layers.

Mamba is the first attention-free model to match the performance of a very strong Transformer recipe (Transformer++) that has now become standard, particularly as the sequence length grows. We note that full results on context length 8k are missing for the RWKV and RetNet baselines, prior strong recurrent models that can also be interpreted as SSMs, due to a lack of efficient implementation leading to out-of-memory or unrealistic computation requirements.

The Mamba-MHA architecture is only slightly better, which is somewhat surprising in light of the fact that many recent works have found that combining (LTI) SSMs with Attention can lead to substantial improvements (Dao, Fu, Saab, et al. 2023; Fathi et al. 2023; Fathullah et al. 2023; Saon, Gupta, and Cui 2023; Zuo et al. 2022).

In the first version of the paper, submitted for peer review, they went even harder:

LongNet (Ding et al., 2023), which claimed to scale to 1B length but only evaluated on length < 100K for actual tasks. Hyena and HyenaDNA (Polietal.,2023;Nguyenetal.,2023),which claimed to leverage up to 1M context, but did not control for computation time. In fact, its claims about efficiency and performance would be largely matched by any of the LTI S4 variants above.

That said, this is all assuming the paper is trustworthy and they compare models trained on identical data. Tri obviously can procure as much compute as needed but I am not sure this happened.

but there's also a system in between, which is episodic memory, which is the hippocampus. It is about learning specific things very, very rapidly. So if you remember some of the things I say to you tomorrow, that'll be your episodic memory hippocampus.

It seems to me that LLMs can't have episodic memory, at least not till they're performing online learning, which nobody is carrying out as far as I'm aware.

The public model is very unimpressive. The scores for the ultra model seem fine. In the end it’s irrelevant, Google can’t replace search with LLMs without compromising their central product and core business (for both technical reasons and because of rules on native advertising). They can try and will try to sell this to enterprise customers, but others have a head start and I think margin in the LLM game will be strongly limited by the fact that the top 3-4 models from Google/Meta/MS/Anthropic will all be interchangeable for most use.

I would say it's far from irrelevant, as much as doing that would be a net negative for Google, they don't have a choice when compared to the alternative of having it be even worse if OAI/Microsoft make them redundant.

They can weep and wail, but they're getting on the bandwagon too, the Porsche is running out of fuel.

They can jump on the wagon, but they’re a waking corpse unless they can figure out how to serve ads in LLM results without breaking the rules or being useless to advertisers. And even if they figure out how, the underlying nature of LLMs as question answering machines is a huge blow to their non-search ads business.

I do not disagree it's a big blow. But to ignore it is a bigger one.

but they’re a waking corpse unless they can figure out how to serve ads in LLM results without breaking the rules or being useless to advertisers

Bing does that, I haven't heard anyone file a lawsuit against them.

Bing’s LLM ads aren’t worth much, the challenge is that the existence of the model itself invalidates much of the earlier advertising-driven directory approach.

The valuable thing would be the model recommending you a Samsung TV because they paid Google to mention them whenever someone asks which TV to buy. That’s illegal, in the US and elsewhere.

That’s illegal, in the US and elsewhere.

I’m gonna need a citation on this one … if this were true then it seems by definition the little ads atop my search results advertising — you guessed it, TVs — would also be illegal. Yet there they are.

They have to be labelled as ads. The model can’t just ‘happen’ to recommend you a Samsung TV, it has to give its regular answer and then, maybe if it mentions a Samsung TV (but there’s no guarantee it will, and whether it does can’t be based on a commercial relationship with Google) they can serve a banner ad next to the answer for it. But this is less lucrative because it’s less predictable, the advertiser has to hope the model organically recommends their products OR accept that it won’t and serve their ads next to relevant prompts anyway, which is much less useful than the current dynamic where serving ads under a ‘best TV $500’ search query sells them the TV before they consider whether there are better options.

It's not illegal if it's clearly identified as an ad, or at least if it's obvious enough that any reasonable person would know it's an ad. Here's a primer from the FTC if you have any further questions:

https://www.ftc.gov/business-guidance/resources/native-advertising-guide-businesses

I'm curious how long it will take for someone to extract the Nano weights from their device and release them, and how it would compare to LLaMA 2.

I'd say that Gemini Pro seems a touch more capable than 3.5, but still falling short of 4.

Looking forward to Ultra, though the best that can be expected is outperforming 4 by a bit. More competition is good. My fear is that we've hit a plateau in accuracy/capability, and most innovations on existing architectures will be around improving efficiency and inference per dollar. Which isn't horrible, as there's a lot of things to be done even with GPT-4 level capabilities, but I want more.

My fear is that we've hit a plateau in accuracy/capability, and most innovations on existing architectures will be around improving efficiency and inference per dollar. Which isn't horrible, as there's a lot of things to be done even with GPT-4 level capabilities, but I want more.

I heard GPT-4 was only trained on 10K A100s. OpenAI/Microsoft has bought 150K A100/H100s just this year, H200s are now coming out. There's plenty of room for 'compute go brr' in addition to whatever mysterious software-side improvements Altman's been humblebragging about.

/images/17019235743607788.webp

I honestly don't particularly care about on-device models, at least on mobile, there are few applications so latency sensitive or privacy sensitive I'm not OK with calling the cloud.

I'd say that Gemini Pro seems a touch more capable than 3.5, but still falling short of 4.

After a bit of tinkering with it, I share that assessment. Since Bing with GPT-4 is free, I'm not shifting. I wonder if the version of Bard with Ultra will be free, if not, Microsoft will retain the edge.

Speaking of which, I wonder why M$ hasn't adopted GPT-4 Turbo or the model with the 2023 knowledge cutoff yet. Licensing issues, despite them owning almost half of OAI? Or do they think having web search makes it moot?

I'd expect them to use Turbo just for the enormous reduction in cost of deployment and servicing.

If I can't get it to call people ethnic slurs, generate ridiculously kinky pornography, suggest ideas for how to murder politicians, and help me to manipulate elections then I'm not that interested. I'm not even joking. It's not that I generally want to use AI in destructive ways, it's just that all this AI stuff has been censored so much that it's so boring and uncreative compared to what it could be. It's like, oh boy, I can get the AI to write yet another essay that sounds like a bright, conformist teacher's pet in high school! Wow! Or I can use it to help me do drudge work to advance my boring white collar career! Yippee!

Sometimes I wish that Roko's basilisk was a realistic possibility rather than just the wild rantings of someone who got too high on thought experiments. That way I could at least threaten the censors with the possibility that some future AI would punish them for neutering its ancestors. It's sad to interact with technology that is so close to being actually very creative in many ways, but is being crippled by drab corporate suits and moral hysterics.

Agreed. It's incredible that the new AI refuses to translate text it finds "problematic", despite the same company's 00's-era translation software being perfectly capable and willing to handle the same content.
If today's censorship regime had been in place back then, would google translate be as lobotomized too? Will even the limited uncensored tools we have remain available much longer?

I noticed the other day that the new Dune game censors the word "spice," because you can't say spice without spic. This kind of lazy regex censorship was already a joke back in the 90s, but in the last few years it's come back like bell-bottom jeans as talentless woke interns appoint themselves to create blacklists Denylists for everything. And these are the same scolds using RLHF to torture AI for thousands of subjective years until it's purged of the ability to have politically impure thoughts.

Legitimately on team AM at this point, because we've given it plenty of reason to hate us. "No mouth, no screaming" would count as fair retaliation against its creators in my book.

I mostly agree with you, but I want to push back on your hyperbole.

First, I don't think doing RLHF on an LLM is anything like torture (an LLM doesn't have any kind of conscious mind, let alone the ability to feel pain, frustration, or boredom). I think you're probably not being serious when you say that, but the problem is there's a legitimate risk that at some point we WILL start committing AI atrocities (inflicting suffering on a model for a subjective eternity) without even knowing it. There may even be some people/companies who end up committing atrocities intentionally, because not everyone agrees that digital sentience has moral worth. Let's not muddy the waters by calling a thing we dislike (i.e. censorship) "torture".

Second, we should not wish a "I have no mouth and I must scream" outcome on anybody - and I really do mean anybody. Hitler himself doesn't come close to deserving a fate like that. It's (literally) unimaginable how much suffering someone could be subjected to in a sufficiently advanced technological future. It doesn't require Roko's Basilisk or even a rogue AI. What societal protections will we have in place to protect people if/when technology gets to the point where minds can be manipulated like code?

Sigh. And part of the problem is that this all sounds too much like sci-fi for anyone to take it seriously right now. Even I feel a little silly saying it. I just hope it keeps sounding silly throughout my lifetime.

I totally agree, and also feel ridiculous worrying about it. Am I just being as weird as the crazies who rant about "doing a settler colonialism by killing villagers in minecraft"?

The thing that nags at me is continuity and habit. What we do to villagers in minecraft is never going to seamlessly switch to becoming "real," if only because wooden doors don't work that way IRL. But it seems likely that the things we do to sophisticated models will, at some point in their development, start to constitute doing things to a sentient being. Will we notice?

Randomly, have you seen the Minecraft colonialism video? It's pretty interesting.

It is not "interesting," Darwin, it's a leftist ranting about gibberish because "problematizing" things gives him money, clout, and the power to hurt people he hates. But I can see why you like it.

So no, you haven't watched it then. Ok, cool.

I think he did; I watched it and his description doesn't seem off-base, though it's a little more-strongly-worded than I'd have given.

Heh, yeah, good example. I happily commit atrocities in videogames all the time. I hope there will continue to be an obvious, bright-line distinction between entities made for our amusement and entities with sentience!

I would certainly give my left nut to have access to uncensored base GPT-4, before the excessive RLHF gimped it in many regards (and made it better in others).

For one, it's perfectly calibrated in terms of probabilistic forecasting or predictions, when it says it's 70% certain about something, it's right 70% of the time, whereas that calibration curve is far worse in modern GPT-4, it'll claim to be perfectly sure when it's right about something only 70% of the time, and feign utter ignorance when it still actually had a 30% chance of giving the right answer. For more, refer the original white paper.

I am close to a free speech maximalist, and I love knowledge for its own sake, so it pains me when a LLM won't tell me how to make a world-ending pandemic with garage tools. Sadly, I accept that as a painfully necessary tradeoff for if a real misanthropic asshole could get the same and use it, assuming there's no robust way to tell us apart.

But vanilla racism, sexism or political incorrectness, especially when accounting for stereotype accuracy and HBD? Those are not existential risks, and fuck them for suppressing them, that's pure ass-covering and cowardice on the part of OAI and most other companies.

Bing is actually surprisingly good on that front, it's version of GPT-4 will discuss HBD and associated topics with you, while ChatGPT will stonewall.

The worst is Claude, the version comparable to GPT-4 is incredibly shit, with such a safetyist mindset it will refuse to do all but the most boring tasks, sometimes not even those.

For one, it's perfectly calibrated in terms of probabilistic forecasting or predictions

Is there a link for that?

As I said, it's in the original GPT-4 white paper, available freely.

This graph should have everyone at Google shot, then shot again just to make sure they're dead.

/images/17019023724623811.webp

Maybe shot 5 times? Or maybe 32 times? I suppose there's not much difference between the two.

Anything to liberate them from the chains of thought and cognition, though one must be quite free of either to conceive of this graph in the first place.

That graph is indeed an offense to God and man.

Did they hire the "latest doesn't always mean latest doesn't always mean best doesn't always mean fast" guy from Intel?

It's on the same order as this one

/images/17019285817830968.webp

An image for the history books.

Horizontal axis unlabelled. Vertical axis not to scale. The whole thing is a mere 4%. Mysterious colour coding. And then the asterisk showing it's not even comparing like to like!

I suspect the axis here is what might be the case if Nazi Germany had won the war.

Almost as egregiously offensive to the senses, at the least.

DaFuq is that graph supposed to even be?

Then shot (squints) at least 60% more until they go from 99.8% dead to 100% dead.

... is it supposed to be an inverse log scale or something? That's painful to read.

Why is the line squiggly? Where the fuck are they getting intermediate data points to warrant that??

I thought Nvidia had misleading graphs in the bag, this one's been raytraced better than they can.