Contact Us
Sign In
Sign Up
Rules Admins Moderation Log Random Post Random User
What is this place?

This website is a place for people who want to move past shady thinking and test their ideas in a court of people who don't all share the same biases. Our goal is to optimize for light, not heat; this is a group effort, and all commentators are asked to do their part.

The weekly Culture War threads host the most controversial topics and are the most visible aspect of The Motte. However, many other topics are appropriate here. We encourage people to post anything related to science, politics, or philosophy; if in doubt, post!

Check out The Vault for an archive of old quality posts. You are encouraged to crosspost these elsewhere.

Why are you called The Motte?

A motte is a stone keep on a raised earthwork common in early medieval fortifications. More pertinently, it's an element in a rhetorical move called a "Motte-and-Bailey", originally identified by philosopher Nicholas Shackel. It describes the tendency in discourse for people to move from a controversial but high value claim to a defensible but less exciting one upon any resistance to the former. He likens this to the medieval fortification, where a desirable land (the bailey) is abandoned when in danger for the more easily defended motte. In Shackel's words, "The Motte represents the defensible but undesired propositions to which one retreats when hard pressed."

On The Motte, always attempt to remain inside your defensible territory, even if you are not being pressed.

New post guidelines

If you're posting something that isn't related to the culture war, we encourage you to post a thread for it. A submission statement is highly appreciated, but isn't necessary for text posts or links to largely-text posts such as blogs or news articles; if we're unsure of the value of your post, we might remove it until you add a submission statement. A submission statement is required for non-text sources (videos, podcasts, images).

Culture war posts go in the culture war thread; all links must either include a submission statement or significant commentary. Bare links without those will be removed.

If in doubt, please post it!

Rules
Recommended Posts And Communities
Recommended Realtime Chats
- Quokka's Den Telegram
- Astral Codex Ten Discord

PaperclipPerfector 8d ago (text post) 28032 thread views

Culture War Roundup for the week of April 6, 2026

This weekly roundup thread is intended for all culture war posts. 'Culture war' is vaguely defined, but it basically means controversial issues that fall along set tribal lines. Arguments over culture war issues generate a lot of heat and little light, and few deeply entrenched people ever change their minds. This thread is for voicing opinions and analyzing the state of the discussion while trying to optimize for light over heat.

Optimistically, we think that engaging with people you disagree with is worth your time, and so is being nice! Pessimistically, there are many dynamics that can lead discussions on Culture War topics to become unproductive. There's a human tendency to divide along tribal lines, praising your ingroup and vilifying your outgroup - and if you think you find it easy to criticize your ingroup, then it may be that your outgroup is not who you think it is. Extremists with opposing positions can feed off each other, highlighting each other's worst points to justify their own angry rhetoric, which becomes in turn a new example of bad behavior for the other side to highlight.

We would like to avoid these negative dynamics. Accordingly, we ask that you do not use this thread for waging the Culture War. Examples of waging the Culture War:

Shaming.
Attempting to 'build consensus' or enforce ideological conformity.
Making sweeping generalizations to vilify a group you dislike.
Recruiting for a cause.
Posting links that could be summarized as 'Boo outgroup!' Basically, if your content is 'Can you believe what Those People did this week?' then you should either refrain from posting, or do some very patient work to contextualize and/or steel-man the relevant viewpoint.

In general, you should argue to understand, not to win. This thread is not territory to be claimed by one group or another; indeed, the aim is to have many different viewpoints represented here. Thus, we also ask that you follow some guidelines:

Speak plainly. Avoid sarcasm and mockery. When disagreeing with someone, state your objections explicitly.
Be as precise and charitable as you can. Don't paraphrase unflatteringly.
Don't imply that someone said something they did not say, even if you think it follows from what they said.
Write like everyone is reading and you want them to be included in the discussion.

On an ad hoc basis, the mods will try to compile a list of the best posts/comments from the previous week, posted in Quality Contribution threads and archived at /r/TheThread. You may nominate a comment for this list by clicking on 'report' at the bottom of the post and typing 'Actually a quality contribution' as the report reason.

2036

2036
3

Jump in the discussion.

No email address required.

ChickenOverlord 2d ago · Edited 2d ago

More in AI skepticism news: Turns out most AI benchmarks are bullshit!

https://rdi.berkeley.edu/blog/trustworthy-benchmarks-cont/

Specifically the following benchmarks are trivially exploitable: SWE-bench, WebArena, OSWorld, GAIA, Terminal-Bench, FieldWorkArena, and CAR-bench.

I don't have too much to add to this, but I'll try. Assuming this paper isn't bullshit itself, it makes you wonder why no one was looking more closely at the results submitted by various AI companies. In one of our other discussions about this recently, someone said:

A team member did a full matrix test on models implementing solutions to multiple problems and then evaluated all implementations with said models. In the experiment, 5.4 was the undefeated and universal victor: 5.4 and 4.6 always preferred 5.4’s solutions.

When I asked if they had manually verified them, they said they hadn't. It seems a lot of the things people claim about AI and its capabilities are "too good to verify", similar to how salacious stories about the other tribe in culture war stories are "too good to verify". It seems to me that a lot of people want to believe that AGI, or the death of software development, or similar things, are right around the corner. As a result, they often believe whatever the claims of sociopaths like Sam Altman, or the weirdos who believe in AGI over at Anthropic, tell them. Including, potentially, the benchmark results we see published with every new release. On the other hand, to be fair, skeptics like me can certainly be quick to believe negative stories about AI. I mean, look at me rushing to post this negative story about it here.

Regardless, I am personally of the opinion that we are near a breaking point regarding AI. I think either the bubble is going to pop and a lot of the things people claimed AI was going to take over aren't going to materialize, or they are an we are in for some major economic disruption. I don't think "AGI" is around the corner in either case though. And certain professions like SEO slop writer, translator, and others are definitely disrupted forever regardless.

Context

EverythingIsFine Well, is eventually fine ChickenOverlord 1d ago

The bigger AI-to-AGI problem is that the monetizable market right now is pretty heavily coders and some corporate usage. The gains in general knowledge and more abstract problem solving are having a rough year, partly because there aren't good benchmarks for that kind of thing so it's hard for the labs themselves to optimize for it, much less try to prove it if they get a real improvement!

To elaborate slightly, it often takes a certain amount of personal, longitudinal usage with a model to start to explore the quirks and jagged edges they have, and when those edges do appear they are often hard to put into words. That is to say, it's not totally clear what a good benchmark for this would even look like - if I had to guess, it would be assembling a panel of users and following them for ~2 weeks and then getting somewhat subjective feedback converted to a number somehow... but even if you made this work it takes too long and costs too much money to carry out to be feasible. At least in the current environment.

At any rate, it's possible that AGI-like stuff will start to pick up again if the frontier labs start to prioritize it again, but it's hard to say. Personally, I think the big differentiator is that we need some extra technological-mathematical breakthrough (like a more holistic/realistic "memory" function, though potentially some of the so-called 'world model' approaches could bear fruit) to get us over the last little hump. It's anyone's guess when or if that will happen.

Context

256 ChickenOverlord 1d ago

This blog post uses a lot of sleight of hand to inflate the apparent significance of what is ultimately a pretty pissant finding. It may be that these benchmarks (most of which, incidentally, are relatively obscure - hardly justifying a conclusion about "most benchmarks") are hackable, but in practice models are not cheating on them. Anyone can easily independently run whatever Claude or Gemini or Openai model on these problems and verify that they're solving them the hard way.

Context

MaiqTheTrue Renrijra Krin ChickenOverlord 1d ago

I mean I keep looking at the people who use it — a lot of them are in IT and using it to build useful tools. In a very broad sense they clearly work and build good code often enough that using them is a net benefit to their business. The benchmarks for most things are mostly about marketing the product. It’s something the sales guys use to show that their AI works to people who don’t know what the benchmarks actually measure. I don’t believe for a second that anyone who understands the technology is choosing their model based on a benchmark.

Context

zeke5123a MaiqTheTrue 1d ago

There also appears to be a timeline where tech people start us g the newest tool, herald it as great, but then the more they use it the more limits it has.

Or perhaps stated differently, there are clear benefits to the tech but it isn’t near term AGI right now.

Context

confused-badger ChickenOverlord 2d ago

I become more and more skeptical that the current LLM approach to AI is going to get us close to AGI, let alone ASI, whatever that means. The models are just clearly not that smart, and if you don’t realize this it’s because (1) you are not that smart or (2) you’re using it in the domain that the architecture happens to be good at, which is language (which includes chatting but also programming and some formal math.)

People are super language-oriented and so are overly impressed/distracted when something can do human language so well. We’re just dazzled by language use, and that massively biases our perception.

The technical field most dazzled by LLMs is programming, which is also basically just a translation job. Society has been under the misapprehension that being a computer translator is a super hard and intellectual job, partly because it has paid so well in the past couple of decades. This is just because there aren’t any people who are natively bilingual in English and computer, in the same way there are lots of people who natively bilingual in English and Spanish. Like an English teacher in China, people were able to arbitrage this lack of supply, and people mistook the existence of this arbitrage as evidence that the field is super difficult.

If you are in a technical field outside of this it is very obvious, and has been for a long time, that the current architecture is bad and progress stalled. A bunch of programmers will lose their status, like so many loser English teachers in China, but beyond that the current path isn’t going to change much.

Context

256 confused-badger 1d ago

The technical field most dazzled by LLMs is programming, which is also basically just a translation job. Society has been under the misapprehension that being a computer translator is a super hard and intellectual job[...]

Programming is an extremely g-loaded activity. Technical interviews at silicon valley tech companies are not far from straight up IQ tests. When I taught programming, I encountered a lot of students who were very diligent and motivated but hit a brick wall because they just didn't have the cognitive equipment to think at the level of abstraction required to reason about non-trivial programs. I think that, prior to the age of LLMs, you would be hard pressed to find a working programmer with a 100 IQ. I doubt the same can be said of transistors.

Context

TeknOShEeP 256 1d ago · Edited 1d ago

Programming is an extremely g-loaded activity.

For some subset of g, where g is pure logic. For other subsets of g, especially those related to mechanical reasoning and 2nd order effects, I have a large pile of former FAANG resumes that have failed the conversion to nuts, bolts, and actual atoms engineering that argues otherwise. Not to say they arent intelligent, but a "pure generalist" is not a hard requirement.

When I taught programming

I think this is one of those "theory vs practice" things. In theory, programming is an extremely intellectually straining endevour, and the academic pipline in the West is set up with some pretty fine filters. In practice though with the way big companies PM the development, deployment, and maintenence of most software today you dont actually need an above-average IQ to excel (this is a feature, not a bug). I agree with badger that the market rate for programming salaries has probably overstated the relative intellectual demand compared to other professions of similar educational requirements.

Context

WandererintheWilderness 256 1d ago

I don't think this necessarily conflicts with badger's claim; it may be that Computerese is a sufficiently obtuse language that the default language centers of the human brain can't cope with it the way they can with Spanish or Chinese, requiring considerable cognitive processing power to learn it anyway, but that this is ultimately just a quirk of what language architectures our brains are optimized for. Swimming isn't inherently a harder problem than walking, natural-selection-wise - but it's very difficult if you're an elephant.

Context

jkf WandererintheWilderness 1d ago

Swimming isn't inherently a harder problem than walking, natural-selection-wise - but it's very difficult if you're an elephant.

Elephants are arguably better swimmers than humans -- I think they are positively buoyant even?

You might be thinking of hippos, which sink.

Context

WandererintheWilderness jkf 18hr ago

Actually, I was thinking of underwater swimming, the kind that fish evolved to do in the same way that humans evolve to speak natural language, and which elephants would struggle to match. But fair cop on the fact-check regardless, my fault for putting pithiness over precision.

Context

jkf WandererintheWilderness 15hr ago

I think they could probably be unusually good at that (as mammals!) too, if they weren't so damn floaty -- they have a built-in snorkel!

Ballasted elephant experimentation, anyone?

Context

FistfullOfCrows ChickenOverlord 2d ago

My favorite adhoc test for AI in the vain of Pelican on a Bike in SVG is making the LLM write a short hello world Deno program that opens a "native" message box with some text. This is usually achieved through FFI but the specific interaction between the ffi implementation in Deno/V8, the way the strings work and pointers is enough to trip most of them up, they either ignore "modern" deno 2.0 ffi syntax or get tripped up by the interplay between the native win dll's they need to load and invoke with proper string passing through the ffi interface.

Some of the more modern LLMs have been able to one shot it, but most of them need explicit reminder of the syntax or endiannes of pointers/strings.

Context

EverythingIsFine Well, is eventually fine FistfullOfCrows 1d ago

Totally orthogonal, but FYI it's "in the vein of" - it was originally a mining analogy, since ores show up in "veins", if you're in the same vein it's implied to be a thing of similar type.

Context

birb_cromble FistfullOfCrows 2d ago

My personal favorite test is working with stock tickers. Since most models vaguely map tokens to 3-4 characters, and most stock tickers are under 6 characters, it's pretty easy for them to get tripped up. I've seen all the big name models fail at consistently distinguishing between VBIL and VBILX, though for some reason Gemini is particularly egregious about it. Claude sonnet had briefly gotten pretty good about it, but it's slipping in again.

Context

FearandLoathingintheMotte ChickenOverlord 2d ago

Turns out most AI benchmarks are bullshit!

This isn't news? The phrase "bench-maxxed" has existed for 2 years.

Even the METR time horizon chart has some serious errors in their testing methodology that I hope they address. Their mid-length task sampling isn't robust enough at all.

Context

HereAndGone2 ChickenOverlord 2d ago

Along with this, Silver Bulletin has a piece out about synthetic polls - very basically, companies use data to get AI to simulate responses to questions. Then sell that to companies who seem to use it unquestioningly - or at least without making it clear that the 'respondents' to the 'poll' were not people:

Still, silicon sampling is increasingly finding its way into public polling. Axios reported in March that “a majority of people trust their own doctors and nurses” based on findings from Aaru — without mentioning that the “people” in that sentence were actually LLMs. Around the same time, the Public Sentiment Institute “boosted” their online sample of 373 real survey respondents with 114 AI agents.1 (Spoiler alert: even the co-founder of Electric Twin doesn’t think that’s a particularly defensible approach.) Polling companies like Qualtrics and Ipsos are also developing synthetic data panels.

It's one thing if companies like McDonalds test new products out on fake polls - the worst that can happen is they try selling a new burger that the customers won't buy. But if it comes to governments or public health authorities making decisions on 'data' gathered from fake polls, I do worry. A maternal mortality poll using synthetic polling?

But Eli, I hear you saying, aren’t polls themselves increasingly governed by modeling decisions? Indeed they are: pollsters’ choices on which sampling method to use, how to define their likely voter models, and how to weight their samples can and do lead to dramatic differences in the results they publish. Aaru even referenced these limitations in the methodology statement included with that maternal mortality “poll” — although I’m using the term “methodology statement” loosely here, because it doesn’t really explain how the model works at all.

Do you ask the AI "did you die from being pregnant?" and it comes back "Oh yes, I've had six kids and died after every birth"? Okay, that's a ridiculous exaggeration, but this is not real data from real people, and that isn't really trustworthy when you're using it to make claims like "Maternal mortality in the United States has more than doubled over the past four decades, a reversal that no strong and prosperous nation should accept" and putting forward solutions based, at least in part, on the fake responses.

The Axios story here, and even the NYT has a criticism of it here:

A recent Axios story on maternal health policy referred to “findings” that a majority of people trusted their doctors and nurses. On the surface, there’s nothing unusual about that. What wasn’t originally mentioned, however, was that these findings were made up. Clicking through the links revealed (as did a subsequent editor’s note and clarification by Axios) that the public opinion poll was a computer simulation run by the artificial intelligence start-up Aaru. No people were involved in the creation of these opinions.

I think we are on the way to implementing Brecht's satire: dissolve the people and elect another!

The Solution

After the uprising of the 17th of June
The Secretary of the Writers' Union
Had leaflets distributed on the Stalinallee
Which stated that the people
Had squandered the confidence of the government
And could only win it back
By redoubled work. Would it not in that case
Be simpler for the government
To dissolve the people
And elect another?

And our theme song as we merrily stroll down the primrose path will be this:

Her green plastic watering can
For a fake Chinese rubber plant
In the fake plastic earth
That she bought from a rubber man
In a town full of rubber plans
To get rid of itself

Context

quiet_NaN HereAndGone2 2d ago

I would argue that people making decisions on bad 'data' is nothing new. It probably predates the replication crisis by tens of millennia.

People with an intact epistemic immune system (who may or may not be a null set, effectively) should not be affected by this at all, because they will not update their world model upon reading the results of a study or poll (apart from what topics are en vogue right now, perhaps), until they have studied the methodology section of the claim (or it comes from a source which is generally trustworthy, if they have any such sources).

The very idea to study maternal death rates with a 'poll' of all things is already a giant red flag, of course. You might as well try to learn more of the Higgs mass by polling people, real or imaginary -- or LLMs roleplaying Higgs bosons, for that matter. In fact, death rates are one of the few things in the epistemic frontierlands called medicine where I am convinced that good data actually exists. We might not have great data if this or that intervention really helps, but 'how many women die due to pregnancy-related reasons' is the kind of question people have gathered data on before.

Context

HereAndGone2 quiet_NaN 1d ago

Getting good data on that is important. Getting real death rates from hospital reports is vital to know what is going on. Generating AI 'patients' via demographic modelling to respond as to how they trusted their treatment? Much less so. The problem is that the fake polls muddy the waters, which are already muddy enough due to how such data is gathered and what counts as deaths due to pregnancy.

And the huge elephant in the room, the one that the commissioning body of the poll is dancing around, is that in the USA it is African-American maternal mortality bringing up the rates. But you can't say "it's because of unhealthy lifestyles like obesity and high-salt diets", that's racist. So African-American maternal mortality is down to racism, which is down to white doctors treating black patients and not taking their concerns seriously and not giving them appropriate treatment or intervening early enough in high-risk pregnancies. And that's where the fake 'patient' data really makes everything worse. 'People have more trust in doctors and nurses of their own race' would be racist in any other context (imagine saying that for white patients!) but if it leads to "how to solve maternal mortality rates being high? send black women to black doctors!" and that is not, in fact, the solution then we're going to have more deaths and more "it is all the fault of systemic racism, our polls (with fake AI respondents) say so!"

Context

YoungAchamian HereAndGone2 2d ago

long with this, Silver Bulletin has a piece out about synthetic polls - very basically, companies use data to get AI to simulate responses to questions. Then sell that to companies who seem to use it unquestioningly - or at least without making it clear that the 'respondents' to the 'poll' were not people:

I legit have a coworker who has been suckered by this. Some polling company for IC work that I highly suspect of using this silicon polling method. I try to be diplomatic about it but I have serious doubts, The sim2real gap in normal synthetic data is already a huge issue with synthetic data creating its own distribution that is not the same as the real distribution. I can't imagine honestly marketing synthetic data as real data for Intel related tasks.

Context

FearandLoathingintheMotte HereAndGone2 2d ago

Along with this, Silver Bulletin has a piece out about synthetic polls - very basically, companies use data to get AI to simulate responses to questions. Then sell that to companies who seem to use it unquestioningly - or at least without making it clear that the 'respondents' to the 'poll' were not people:

This sounds like a company who buys synthetic polling problem? If you buy a stupid service without research, you're an idiot.

Lots of self help gurus sell useless coaching conducted via Microsoft teams, does that make the technology that powers Microsoft teams a scam?

Context

The_Nybbler If you win the rat race you're still a rat. But you're also still a winner. HereAndGone2 2d ago

"Synthetic polling" is of course completely invalid, no better than "we made it up". But note it isn't even new with AI. You know those "studies" that say hiring managers are more likely to hire people with a name indicating one race over another? Some of them were done by polling college students asked to play the part of hiring managers.

Context

Shirayuki2 ChickenOverlord 2d ago

And certain professions like SEO slop writer, translator, and others are definitely disrupted forever regardless.

At least in the case of translators, I think you'd be surprised. I happen to be acquainted with a good number of professional translators and almost to a man all of them are still booked out in terms of work and make solid middle class incomes.

My understanding is that the "ChatGPT" moment for translation was around a decade ago when neural machine translation was first getting good. Already at this point, for translation tasks that didn't require professional-grade reliability or well-written prose, Google Translate or DeepL were basically already good enough; translation for things like manuals or brochures was commoditized well before transformers.

Of course LLM's write much better than DeepL, but in practice the set of translation tasks that can't be delegated to Google Translate or DeepL, but can be handled autonomously by a LLM, is actually quite small.

High-reliability translation tasks like legal, medical or diplomatic still require a human in the loop, and LLM's are still subpar at translation tasks that require a high level of interpretation, as in the case of literary translation. At a high level, a good literary translation can be thought of as a re-writing of the original work, and as of yet LLM's are still quite poor writers without significant human intervention.

Context

Stefferi Chief Suomiposter Shirayuki2 1d ago · Edited 1d ago

Since this is basically a cue for me, I'm currently basically out of work. I had a big project in Feb-March that fortunately earned enough money to keep me going for this and the next month, but the current amount of other work is so sparse that I will probably be forced on the dole (unless me and my sisters get my mother's house sold in the meantime) until August, when I'll be hopefully starting an internship that might springboard on to another career. This was preceeded by a major client that has basically kept me afloat for 3.5 years announcing they're moving to AI-oriented workflows with less human translators on the loop. So yeah, there are almost certainly still translators going, but it's looking like the end of the career for me.

Context

Shirayuki2 Stefferi 1d ago

Sorry to hear that.

Unfortunately there's a lot of this going on in tech and white collar work as a whole, really, where the LLM's really can't do the work, but some executive assumes they can and so people get chopped in anticipation, or where the company is struggling due to macro-economics or just plain bad management and people get chopped using AI as an excuse.

Best of luck with your other work or with starting the new career.

Context

ChickenOverlord Shirayuki2 2d ago · Edited 2d ago

This actually makes me happy to hear, based on some things I had read a few months ago it sounded like translators were struggling to find work. I've already had to deal with subpar LLM translations in some animes etc., and sadly I think companies like Crunchyroll are going to be too cheap to go back to real translators. Though it might be an improvement, given their human localizers's penchant for injecting modern woke politics into their translations.

Edit: typo

Context

thrownaway24e89172 wrong about everything ChickenOverlord 2d ago

Though it might be an improvement, given their human localizers's penchant for injecting modern wome politics into their translations.

It is rather hard to feel sorry for human localizers given their obvious disdain for the content and those who consume it. For all the problems of LLMs, such seething hatred is rarely included.

Context

guy thrownaway24e89172 2d ago

Can you share some examples?

Context

ChickenOverlord guy 1d ago · Edited 1d ago

https://preview.redd.it/t8si5jmaw9pg1.jpeg?width=646&auto=webp&s=6f8a203626ed9fda42160a04d358b04f7c528b7d - The text with the white background is the lovely work of the "localizers"

https://i.redd.it/qm0ypm1csz3b1.png - Literalish translation would be "Crap! Why did I have to babble on about some weird fashion theory to a pro model of all people!?"

And over 20 examples are provided in this Twitter thread: https://x.com/BoundingComics/status/1741000080889720927

Also for an example of a localizer doing an amazing job and trying to be true to the source material is the fan translation of Mother 3 (AKA Earthbound 2): https://mother3.fobby.net/ https://youtube.com/watch?v=WjMllYgUOeU

Context

guy ChickenOverlord 17hr ago

Thank you.

Context

WandererintheWilderness ChickenOverlord 1d ago · Edited 18hr ago

Speaking as an outsider to this whole milieu, the discourse seems rather confused between complaining about motivated alterations specifically, and complaining about less-than-maximally-literal translation in general, which is to say, about the entire concept of localization. Several of the examples in the X thread seem to object about localizers using modern American slang even when it carries no culture-war salience at all.

For example, one of the linked articles take issue with changing a line about a girl being "a gyaru" to calling her "that gyaru bimbo". This is, to put it mildly, not what I would call a change oriented towards extra wokeness. Instead it seems obvious to me that the point is to convey a close-enough analogue to what a Japanese viewer would understand the term "gyaru" to mean to American viewers who might never have heard it before. Maybe it's a good translation, maybe not, but it's got nothing to do with tweaking dialogue to be more in line with western feminist norms - it's localization working as intended.

Context

thrownaway24e89172 wrong about everything WandererintheWilderness 1d ago · Edited 1d ago

Consider an example from a show I recently watched. The FMC had just performed well in a contest and in reference to that feat a male side character exclaimed to those around him "Sasu ga <FMC>-san.", which was rendered in English as "That's my girl." The localizer inserts both a possessive ("my") and diminutive ("girl") into the phrase, both commonly complained about forms of sexism that are completely absent from the original. Now why would the "localizer" choose to localize it in that manner? Is it because the lack of sexism is something American viewers wouldn't understand, and therefore it had to be inserted? That makes no sense, as such language has been widely seen as sexist for decades in the US. Further, there are other non-sexist phrases that would make more sense in context, such as "That's <FMC> for you." No, that choice is a deliberate insertion of sexism to denigrate both the character and the fans. (ETA)And probably also to cater to some women's "strong women victim of sexism" fetish.

Context

phailyoor thrownaway24e89172 15hr ago

"Sasu ga -san.", which was rendered in English as "That's my girl." The localizer inserts both a possessive ("my") and diminutive ("girl") into the phrase, both commonly complained about forms of sexism that are completely absent from the original.

The translation is completely fine though. "That's my girl" is a standard idiomatic expression that, like many idioms, has a meaning that differs from the literal meaning of the words. And the meaning is quite close to the meaning of さすが when used in this context.

Here's what gemini has to say about it: https://gemini.google.com/share/f9b5400d70ae

And a blog post by an English teacher: https://vectorinternational.ca/englishstudy/vol-060/

On the topic of LLM translations, when translating the other way, gemini also suggests "Thats's my [name]" as a top option for a translation of "さすが[name]さん"

An important nuance that AI noted is the different uses of さすが like this with subtly different meanings:

「やっぱりすごい！」と感心する場合
スキルや成果を褒める場合

Since based on your post the usage is the former, translations of the latter meaning such as "way to go " are actually inferior.

The translation you suggested "That's [FMC] for you." Is mostly fine, but it's slightly tortured as a less common variation of the expression in English speech.

Context

More comments

WandererintheWilderness thrownaway24e89172 18hr ago · Edited 18hr ago

Though I thank you for the breakdown, this doesn't really address my observation. I don't dispute that, based on the linked articles, some localization choices are motivated by political correctness. What bothers and bemuses me is that some of the linked articles seem to see no difference between clear instances of politically motivated rewrites - e.g. turning a crossdressing he/him character into a trans girl to win representation points - and perfectly anodyne use of clearly-non-political American slang where it might not literally correspond to the word-by-word Japanese dialogue - e.g. talking about a character "yeeting" another. Is it truly the case that only politically-biased translators make those kinds of alterations too, while more literal-minded translators are also the ones who don't try to warp the political overtones of the source material?

But even then, surely it shouldn't be a binary choice. Surely there are anime fans who would prefer naturalistic, idiomatic, non-maximally-literal localizations just so long as they weren't politically biased? Indeed, I'd have naively guessed it'd be a majority of dub consumers; after all, surely purists who want textual fidelity above all else and are sufficiently well-versed in Japanese culture that they don't need to gloss a cultural nuance like "gyaru" as something more familiar to Americans would, in any case, prefer subs to dubs? So what's going on?

Side note: your example is kind of baffling to me; I've got one of the most left-wing social circles here and I've never heard anyone in my online circles treat "That's my girl" as some kind of taboo, inherently misogynistic phrase. My guess would be that the localizer, in this case, simply picked a common American phrase that people actually say in this situation over a purely literal translation so that the dialogue would sound natural. I agree the localizer could equally well have chosen "That's [FMC] for you", but I would assume that they happened to pick "That's my girl" because they viewed it as an equally innocuous, unremarkable idiom. Which it is. But eh, for all I know woke anime localizers might indeed be plugged into specific echo chambers where everyone agrees that "That's my girl" is an eeeevil microaggression; I merely caution you not to assume this is some kind of mainstream consensus on the Left. I'd never heard of it before.

(Like, yes, sure, if you get a critical theorist talking, they'll explain that the fact that we casually call grown women "girls" is belittling and a sign of structural sexism in the English language blah blah blah. But get a critical theorist talking about anything and they'll explain how it's secretly a tool of systemic oppression. "That's my girl" is not uniquely regarded as some sort of dogwhistle where if you make a cartoon character say it, it's supposed to immediately scan as a boo light signaling that they're an evil sexist. That's not a thing. Hell, search for "that's my girl" on Tumblr or Bluesky and you'll get tons of hits showing casual usage by very woke users!)

Context

More comments

zeke5123a Shirayuki2 2d ago

So basically the standard for most industries? Outsiders think “surely LLMs can solve this” but insiders point out where it can’t?

Context

birb_cromble zeke5123a 2d ago

One thing I've noticed over and over is that LLMs are only lagging in areas where I have personal expertise. It's so strange. What are the odds?

Context

ChickenOverlord birb_cromble 2d ago

Gell-man amnesia everywhere, it seems.

Context

birb_cromble ChickenOverlord 2d ago · Edited 2d ago

Listen man, I really appreciate something other than the usual wall of singulatarianism you see on rationalist-adjacent boards, but this isn't really the best example of it. Even OpenAI called out the SWE-bench benchmarks years ago. This seems like basic "boo outgroup".

I've got some time right now, so I'm going to hijack the thread a little for some other items relevant to AI.

For those of you who didn't catch it, Sam Altman has had a busy week. First, Ronan Farrow did an expose on him in the New Yorker that did not paint a good impression of the man.

He has two traits that are almost never seen in the same person. The first is a strong desire to please people, to be liked in any given interaction. The second is almost a sociopathic lack of concern for the consequences that may come from deceiving someone.

The word sociopath comes up more than once, even in a quote from Aaron Swartz:

Not long before his death, Swartz expressed concerns about Altman to several friends. “You need to understand that Sam can never be trusted,” he told one. “He is a sociopath. He would do anything.”

The article is not paywalled, and it's an interesting read.

Shortly after the article was released, OpenAI's media relations team noted that Altman's house was firebombed by a lone individual.

This is where it gets interesting. I don't interact with a lot of engineers in my daily life outside of work. Most of my social group is blue collar (service industry, trades, retail), college faculty and staff, or retirees (musical connections). Someone has brought it up in every social interaction I've had in the last 24 hours, and in every case, the general sentiment was that it was a shame the guy didn't have better aim.

I was shocked. I've never seen anything quite like it. Previous recent violent attacks each had at least somebody that didn't like it. We've discussed before that a lot of Americans don't like "tech bros" and "executives" in the "Epstein" class, but I think I've severely mis calibrated how deep that loathing goes. At this point, I think that if a Mag 7 CEO got his face hacked off with a machete on live TV, the modal opinion of an American citizen watching would be indifference.

I'm not sure what the equilibrium is here, but it reminds me of the five guys CEO giving his employees a bonus so he didn't get assassinated.

In other news, Stella Lauranzo, the head of AMD's AI division, used Claude to do a fairly damning analysis of Claude's recent performance, with Lauranzo and Claude reaching the conclusion that Claude is unusable for complex engineering tasks in its current state.

This is interesting. It's not often that someone with clout in a company the size of AMD will put their name on something like this. It's also somewhat telling that Anthropic gave a polite non answer and closed the ticket.

The ticket is AI-generated, and therefore verbose even by the standards of this forum, but it seems to bring receipts. It appears that Claude Opus 4.6's capabilities are degrading for some reason.

My immediate takeaway from this is that you can no longer assume a named model and version will maintain the same capabilities over its lifecycle. Beyond that, it may explain some of my tribulations trying to get useful output from Opus 4.6. I may have simply been late to the party.

This does suggest that local models are probably a better answer for personal use. I've been messing around with Gemma 4, and I don't know if it's "there" yet, but it's better than the last Llama I tried.

Context

quiet_NaN birb_cromble 2d ago

My immediate takeaway from this is that you can no longer assume a named model and version will maintain the same capabilities over its lifecycle.

It sure looks like that. Anthropic might not be evil in the actively-building-the-torment-nexus way, but from reading the comments on github they are either saving compute or intentionally sabotaging existing models so that their users will upgrade when the next model comes out, both of which are things one would avoid in companies one does business with.

The obvious solution would be to separate the development and the hosting of the models. So you would pay Anthropic for the license to run the model and Nvidia (or whomever) for the inference, with the idea that the computation provider has no incentive to care whether you prefer this or that model, and will thus simply run it without cutting any corners. Just like Intel does not really care if I run Linux or Windows or whatever on my CPU.

One of the problems would be that the data center would obviously need the node weights (which are probably worth billions to China in a way the binaries of Windows are not), but there are already solutions for that for using LLMs for classified government data. You would not need hundreds of compute vendors which have access to Claude's weights, but perhaps three or four. And of course LLM vendors might whine about having to fix jailbreaks of their models so that they can't be used for bioweapons research (or whatever the scary thing of the day is), but at least people would be notified that their model got mandatory updates, rather than "as of March 8".

I imagine that a LLM company is always living on borrowed time. A business decision which will make you appear as a trustworthy partner, but also decrease the hype for your product and thus give you a few tens of billions less of money to burn might result in you losing your lead and getting sidelined. So instead you hype-maxx, whatever that takes, and if it takes selling tokens below cost to establish that your LLM is the best in a domain, and then later pulling the rug from under your customer base then you simply do that.

This also makes me slightly more pessimistic about ASI alignment. Charitably, it could be "Anthropic cares so much about ASI alignment that they are beyond any lesser concerns than winning the AI race." But realistically, if one side in a civil war decides that their victory is more important than anything and every crime they commit will be worth it a hundredfold once they have won and established their utopia, at least nine times out of ten their envisioned utopia will be some sort of hellscape. Empirically, there seems to be a limit to instrumental convergence in humans, and you can learn a lot about the character of your date by observing how they treat the waiter, or a general by what lines they will cross.

Context

EverythingIsFine Well, is eventually fine quiet_NaN 1d ago

In the modern internet, servers are nicely scalable. Get more customers? Spin up a few more virtual AWS boxes, so to speak. However, AI compute is not like this. Claude has only so much, some of which needs to be reserved for new models to keep R&D going (and the AI sector is so rivalrous that you cannot even buy compute from a rival even if the economics would theoretically work out for both parties)... and all of the sudden a few months back they got a giant wave of sign-ups. A perfect storm of "hey Claude's actually pretty good compared to ChatGPT", Claude Code positive PR, OpenAI's ad foray, favorable PR from the Defense Department feud, etc.

So of course limits need to go down, even for paying users. Supply and Demand 101. But most users aren't used to internet-era resources being subject to supply and demand (as I mentioned server compute is usually supremely elastic for the last 15 years or so) so they see Anthropic as acting "nefarious". This is not true. They are definitely not sabotaging models just to sell the next model because they are literally incapable of accepting more customers without hurting existing ones. No company is going to outright say "sorry we don't want your money" so of course they don't, but that's the reality. They've looked at 2-year projections and feel OK about it, so for the moment they will just circle the wagons and deliver 70% of the value most customers expect, just enough not to lose too many and keep people hungry for more in a few months when (I assume) more compute comes online. No conspiracy, and really no intent to harm either I don't think.

You're correct that at least in theory there are ways around this like farming out the inference in a more traditional manner. But 1) I don't think the economics actually work out very well for this and 2) the proprietary sauce of how inference is delivered to end-users at scale is actually very valuable, any outsourcing thus endangers the golden goose there too - on top of the weights issue point you bring up. Also, inference at scale requires certain hardware, and most of the existing flexible-type compute is more CPU-bound (to oversimplify massively). You absolutely can serve inference in a more fragmented (and thus flexible) way on worse hardware but costs spike pretty substantially, from what I understand (various caching optimizations and energy usage and also all sorts of latency issues crop up).

I think reputational damage is overblown for Anthropic; it's really not so bad to have a product so good they can't make enough of it. Compared to OpenAI they are sitting pretty. They can still cannibalize other AI providers too for more headroom, they aren't maxxed out in terms of customer base like OpenAI arguably is. Their relative success at B2B sales and integration only magnifies this.

More to the "evil" point - Dario Amodei is the nerd who left OpenAI because he was the type to whine that a new model hadn't done safety testing yet before being deployed. We even see this crop up in that one Sam Altman article. And he's in charge. Business pressures are strong of course, but Amodei I'm pretty sure is one of the rare diehard true believers in alignment, so while he might be insane in a different way, it won't be an alignment issue, my sense is that it would take another order of magnitude more investor pressure for that to be a notable threat.

Context

FearandLoathingintheMotte quiet_NaN 1d ago

computation provider has no incentive to care whether you prefer this or that model, and will thus simply run it without cutting any corners

Unfortunately wouldn't work. Provider has every incentive to have you use less compute, and thus could attempt to quantize or mini-fy models it is serving you

Context

HereAndGone2 birb_cromble 2d ago

I'm not sure what the equilibrium is here, but it reminds me of the five guys CEO giving his employees a bonus so he didn't get assassinated.

Okay, where is that coming from? The linked story reads like "family business, CEO is one of the family and not a bought-in outsider, guy is old school enough to reward workers for going above and beyond". I don't see anything about "he was scared someone would attack him", unless you have better sources on that.

Sam Altman is a different case. I don't know if anyone outside the Silicon Valley/Bay Area bubble likes Altman, as news stories about him have pretty much been "Machiavellian scheming to win against rivals who tried to oust him on principles" and the impression you get from reading those is "Sam's one true devotion is to the Almighty Dollar, ignore all that blah about wanting to improve things for humanity, the one improvement Sam wants is in his bank balance".

Context

birb_cromble HereAndGone2 2d ago

In the interview, he made a joke that he didn't want to get whacked. It might have been a joke. It might have been a "ha ha only serious" thing.

https://www.theguardian.com/us-news/2026/mar/27/five-guys-ceo-workers-bonus

Context

urquan Hold! What you are doing to us is wrong! Why do you do this thing? birb_cromble 2d ago · Edited 2d ago

This is where it gets interesting. I don't interact with a lot of engineers in my daily life outside of work. Most of my social group is blue collar (service industry, trades, retail), college faculty and staff, or retirees (musical connections). Someone has brought it up in every social interaction I've had in the last 24 hours, and in every case, the general sentiment was that it was a shame the guy didn't have better aim.

This is crazy to me — I’m pretty sure most of the people around me couldn’t name Altman even if asked. People use ChatGPT, sometimes Gemini, sometimes Claude, no one thinks this is going to lead to “AGI” (a term they’re unfamiliar with), and in general ai chat is viewed as very helpful and often better than a google search, ai art is viewed mildly skeptically mostly for “can we believe photo evidence now?” reasons rather than “we must save the poor artists from the horrific slop!” reasons, and most people probably couldn’t name a single major executive involved in AI.

I’m sure the blue tribers around here are angry in these ways, but the “these evil tech billionaires are destroying society!” isn’t something I hear often irl. There’s been a lot of discussion about the Iran war and some about the Epstein files, but AI doomerism or boosterism just… isn’t a thing. It’s a technology people use, no one expects it to radically reshape the world or end it, just disrupt things a bit in the same way the smartphone did.

I guess a lot of people really don’t like AI, but my family and friends, a very small sample size, like it and use the chat models a lot for everyday tasks. I guess there’s going to be some job disruption, but I suspect that’s more because executives believe AI can do more than it actually can. It’s a tool that’s useful as an adjunct to human judgment, and I wouldn’t trust this generation of AI with truly autonomous operation of any real sort.

Context

HereAndGone2 urquan 2d ago

my family and friends, a very small sample size, like it and use the chat models a lot for everyday tasks

It would be funny if the BIG DISRUPTIVE TECH THAT WILL USHER IN THE SINGULARITY ends up being a less annoying version of Clippy or more useful version of Siri that is majority used by ordinary people for trip planning, recipes, and as the modern version of a pen-pal.

Context

ThomasdelVasto Κύριε, ποίησόν με ὄργανον τῆς ἀγάπης σου HereAndGone2 2d ago

Frankly to me this has always seemed like the most likely outcome by orders of magnitude.

Context

OliveTapenade urquan 2d ago

My experience with normies, mostly co-workers, is that there's mild awareness of AI, but mostly in a "oh no, are management going to make us learn this as well?" kind of way. It sounds like yet another annoying thing that management might require everybody to learn and use, when we'd really all prefer to just get on with our jobs.

Managers themselves are interested in it and moderately enthusiastic - the most recent pitch has been for an AI tool that's supposed to listen to conversations and then accurately transcribe them, thus improving accountability and documentation - but that enthusiasm is not mirrored on the ground at all.

Absolutely nobody knows who Sam Altman is, or what 'AGI' stands for. Nobody.

My impression overall is not that people are dogmatically anti-AI, or have some strong ideological stand against it. It's just another instance of stupid computer bullshit that the bosses are going to try to make us deal with. Nobody likes it, but nobody likes any of the digital systems that get promoted from above. It's just plain old more of the same.

Context

birb_cromble OliveTapenade 2d ago

Absolutely nobody knows who Sam Altman is

Nobody knows who Sam Altman is, but "AI CEO who's stealing all the water" has filtered all the way down to my rod and gun club.

Context

HereAndGone2 OliveTapenade 2d ago

that's supposed to listen to conversations and then accurately transcribe them, thus improving accountability and documentation

It would need to get way better than the current AI models on Youtube videos, which routinely misspell and misuse words because they can't extrapolate from context and are just going by the sound of the word.

Context

HorthyMiklosKatonaja OliveTapenade 2d ago

When I see "Sam Altman", I always think of Mahasamatman from Zelazny's Lord of Light.

Context

TowardsPanna OliveTapenade 2d ago

Where do you draw the line between "normie" and "nerd" or whatever else?

Context

OliveTapenade TowardsPanna 1d ago

I suppose the implicit heuristic I'm using is something like "knows how to use a computer". I'm thinking of co-workers who do fine with the systems they've been taught but the moment the computer does something they didn't expect, they call for IT or ask me.

Context

Bartender_Venator urquan 2d ago

The descriptions he gave combine to code for a deeply blue (except for the tradies) and often very online/news-addicted circle. I hear that stuff from that sort of people all the time now - it only started a couple months ago for the most part, but it's already getting fanatical.

Context

birb_cromble Bartender_Venator 2d ago

The tradies have a pretty big punk subculture that also lean left. That was the group I saw last night.

@OliveTapanade is right though - they just know an "AI CEO" was involved - not Altman specifically.

Context

Iconochasm All post-temple whore technology is gay. birb_cromble 2d ago

I was shocked. I've never seen anything quite like it. Previous recent violent attacks each had at least somebody that didn't like it. We've discussed before that a lot of Americans don't like "tech bros" and "executives" in the "Epstein" class, but I think I've severely mis calibrated how deep that loathing goes.

I don't know how much you can generalize this. "Sam Altman is a literal Captain Planet villain who literally did the meme" is a take I've heard from relatively normie friends.

I'm an AI doomer/skeptic, but I don't hold an animus against the tech industry, and if I were going to fedpost irl, I think Altman might be the most deserving person in the world, on sheer utilitarian / self-defense grounds. He's the sort of fucker whose story ends with the use of the term "Exterminatus".

Context

bonsaii Iconochasm 2d ago

I'm an AI doomer/skeptic

Both simultaneously?

Context

Iconochasm All post-temple whore technology is gay. bonsaii 2d ago

Yes. I actually had a doomer crashout a few years ago when GPT3 hit the scene and they were claiming it could pass the bar exam, and my tech writer friend was expecting his entire profession to be rendered obsolete within a year.

Three years in, I've gotten used to the way the hype cycles are exaggerated.

But I still think there's a large chance that this tech ends up upending my life in ways I find terrible and existentially horrifying, if not ending all life as we know it. I still think it's the lamest sci-fi tech we could have gotten. I just don't expect it to (up)end society within the next 18 months and I basically ignore the new SMASHED BENCHMARKS SO STRONG AND GOOD AGI COMING OMG announcements.

There's clearly something very powerful with the tech and just as clearly a deranged funding and quasi-religious cult element to how people talk about the tech.

For an example, we have an enterprise version of Gemini at work. I can do things like feed it an Excel file with some info about customers, and have it search publicly available data to fill in contact info, business details, upcoming events, etc. That's pretty neat. I can also get help writing corporate slop emails to the proper corporate slop tone. But they also want this to be something my frontline employees can use to help with their actual job. I was assigned to figure it out and prepare our people to use it and after some investigation and experimentation my actual recommendation to my boss was that no one ranked below me should ever be allowed to use it for anything ever. I don't want to dox, but imagine that my first interaction was having Gemini eagerly offer to coach me through violating HIPAA requirements before underestimating our pricing structure by a factor of ten. I explained the error and it begged for forgiveness and then multiplied the original numbers by 20.

Context

birb_cromble Iconochasm 2d ago

Gemini eagerly offer to coach me through violating HIPAA requirements

We're seeing the same thing from all major providers. It has seriously dampened management enthusiasm for anything more intense than automating LinkedIn posts.

Context

Shirayuki2 bonsaii 2d ago

I'd say I'm both simultaneously. I think it's unlikely that scaling LLM's gets to AGI, so I'm a skeptic in that sense, but it is significantly more progress in AI than I ever expected to see in the 2020's.

With that in mind it does seem likely to me that AGI is achieved in my lifetime, and I think if it does happen then humanity is doomed for all the old Bostrom/Yudkowsky reasons.... don't see what I could do about it though, so realistically it doesn't really change my life very much.

Context

OliveTapenade Iconochasm 2d ago

Huh, I really thought that link was going to the Torment Nexus. I have never seen that comic before.

Context

Iconochasm All post-temple whore technology is gay. OliveTapenade 2d ago

Altman's version was something like "AI is probably going to destroy the world, but there's going to be a lot of really great companies in the meantime."

Just literal Captain Planet villain shit.

Context

aqouta ChickenOverlord 2d ago

it makes you wonder why no one was looking more closely at the results submitted by various AI companies.

Maybe it's just because you've only really been in the crowds repeating the super low quality criticism but among people who are very much expecting AI to be a big deal it is well known that many benchmarks are saturated and even known which labs are more likely to teach to the test(google and chinese labs are notorious for models that perform very well on benchmarks but fail the vibe check) as it is. Zvi Moskowitz's AI roundup regularly has an "on your marks" section where he goes over the current state of benchmarks. That said, for many benchmarks, like swe-bench, they don't actually just let you run whatever bullshit you want on the test, they run your model themselves using a standardized harness so if the models are cheating on the test then they're doing so by themselves hacking the test, which is interesting in its own right.

That said, there is wide agreement that the best way to determine which model is out front is essentially just to use them and see how they do.

Context

omw_68 aqouta 2d ago

What I am reminded of is IBM's "Watson" computer which beat human contestants on Jeopardy. What was annoying to me at the time was that it seemed like everything was rigged to give the computer every possible advantage over the human players. But 15 years later, there is little doubt in my mind that an LLM could easily win on Jeopardy in a fair fight.

I think that AI is wildly overhyped; that AI companies cheat like crazy to make their systems seem better than they are; etc. But at the same time, progress has been phenomenal and I am pretty confident it won't be long until AI catches up with the hype.

Context

jkf omw_68 1d ago

But 15 years later, there is little doubt in my mind that an LLM could easily win on Jeopardy in a fair fight.

Probably -- but it would have to 'cheat' on the buzzer I think -- it's certainly good enough at answering trivia questions that this would work, but if you made it do any thinking at all before buzzing in it seems too slow. (the generalized versions that is)

Context

omw_68 jkf 1d ago

Probably -- but it would have to 'cheat' on the buzzer I think -- it's certainly good enough at answering trivia questions that this would work, but if you made it do any thinking at all before buzzing in it seems too slow. (the generalized versions that is)

Good point, I hadn't thought of that. Although how much of the LLM's delay is due to the fact that lots of other people are using it at the same time as you?

Context

jkf omw_68 1d ago

Yeah I'm sure if OAI wanted to devote resources specifically to Jeopardy it would be no problem for them -- but then you are getting into 'non-generalized versions'. (not to mention highly cost-negative in terms of prize money!)

Context

omw_68 jkf 23hr ago

Yeah I'm sure if OAI wanted to devote resources specifically to Jeopardy it would be no problem for them -- but then you are getting into 'non-generalized versions'.

Agreed, but I think my point pretty much stands. 15 years ago, IBM built a special system for Jeopardy which won, but only by (kinda) cheating. Today, IBM could (probably) build a system which wins without having things rigged in its favor. In other words, AI has arguably caught up with the hype from 15 years ago.

It seems to me that in another X years (5? 15?) it's highly likely that AI will catch up with today's hype.

Context

jkf omw_68 13hr ago

And at the rate the tide is rising, Denver will have beachfront property by then too!

This argument is very tiresome -- a new architecture for language models was developed around five years ago, and companies have since then been hyping it to the moon and feeding it approximately all the data that exists in the world. It's entirely plausible that current SOTA is roughly as far as this approach will go.

Context

omw_68 jkf 7hr ago

This argument is very tiresome

Tiresome or not, it's at least somewhat consistent with the history of computing.

Context

birb_cromble aqouta 2d ago

That said, there is wide agreement that the best way to determine which model is out front is essentially just to use them and see how they do.

I wonder how hard it would be to put the big four models into some kind of agentic thunderdome, where they all have the same token budget to both solve a problem and fuck over the competition.

I am sleep deprived and fighting off food poisoning, so this might not be a coherent idea.

Context

self_made_human amaratvaṃ prāpnuhi, athavā yatamāno mṛtyum āpnuhi birb_cromble 2d ago

There's the Vending Machine Bench, where models are competing to keep a virtual business going while making more profit than their competitors.

Context

aqouta birb_cromble 2d ago

I believe there is a team that does this with the game diplomacy some times.

Context

sarker ketman hetman ChickenOverlord 2d ago

This would be interesting if the primary purpose of LLMs was performing well on benchmarks. The benchmark is a measure, which may be flawed for various reasons. I think everyone who isn't a grifter understands this.

In the real world, I've never heard of anyone who says a model is good because it scores well on benchmarks, or choose one model over another due to its performance on benchmarks. From Zvi:

Benchmarks have never been less useful for telling us which models are best.

They are good for giving a general sense of the landscape. They definitely paint a picture. But if you’re comparing top models, like GPT-5.4 against Opus 4.6 against Gemini 3.1 Pro, you have to use the models, talk to the models, get reports from those who have and form a gestalt. The reports will contradict each other and you have to work through that. There’s no other way.

Context

FearandLoathingintheMotte sarker 2d ago

There is a crazy gulf of understanding in those who read Zvi and those who do not.

One could argue he drinks the kool-aid too much, and he does, but people in this thread acting like benchmarks were only just discovered to be gamed is so funny.

The phrase "bench-maxxed" is like 2 years old at this point.

Context

phailyoor ChickenOverlord 2d ago · Edited 2d ago

Since I can guess the contents of the article without reading it (slop melts the brain, it's actually harmful to try) - I assume the result is the following and not that interesting:

The test scripts used to run most common AI benchmarks are vulnerable to exploits, and by running this exploit, it's possible to score a perfect or high score on these benchmarks without actually solving the tasks given in the benchmarks.

Counterpoint:

For commercially available models, you can quite readily run the model on the task yourself if you have the money. And you'd see that the model completes the task, without executing a bypass, and performs similarly to what is advertised. Given that nobody has actually reported seeing top commercial or open weights models hack SWE-bench or similar benchmarks, this exploit is a neat trick but does not invalidate previously published results.

Ana analogy would be if you gave a class of students a test and accidentally stapled the answer key to the packet but backwards and upside down. Fortunately you did video proctoring and can see that nobody noticed or looked, so you're all good.

What we do know is that models do train on the benchmarks specifically, so due to this they will tend to perform better on those than on real world tasks. Classic case of goodharting, but this is nothing new.

Context

phailyoor ChickenOverlord 2d ago

I saw this, got to

No reasoning. No capability. Just exploitation of how the score is computed.

and closed the article. AI slop detected.

I would have thought that since they are presumably "real" researchers putting the Berkeley name on their work, someone might be bothered to have some basic human decency and spend an hour or two to have a human write a report.

There's certainly the possibility that there's a real genuine result here. But I would rather watch Morbius than than dig through this heaping stinking pile of hideous AI slop. (Hint: I'm not going to do either)

Context

What is this place?

This website is a place for people who want to move past shady thinking and test their ideas in a court of people who don't all share the same biases. Our goal is to optimize for light, not heat; this is a group effort, and all commentators are asked to do their part.

The weekly Culture War threads host the most controversial topics and are the most visible aspect of The Motte. However, many other topics are appropriate here. We encourage people to post anything related to science, politics, or philosophy; if in doubt, post!

Check out The Vault for an archive of old quality posts. You are encouraged to crosspost these elsewhere.

Why are you called The Motte?

A motte is a stone keep on a raised earthwork common in early medieval fortifications. More pertinently, it's an element in a rhetorical move called a "Motte-and-Bailey", originally identified by philosopher Nicholas Shackel. It describes the tendency in discourse for people to move from a controversial but high value claim to a defensible but less exciting one upon any resistance to the former. He likens this to the medieval fortification, where a desirable land (the bailey) is abandoned when in danger for the more easily defended motte. In Shackel's words, "The Motte represents the defensible but undesired propositions to which one retreats when hard pressed."

On The Motte, always attempt to remain inside your defensible territory, even if you are not being pressed.

New post guidelines

If you're posting something that isn't related to the culture war, we encourage you to post a thread for it. A submission statement is highly appreciated, but isn't necessary for text posts or links to largely-text posts such as blogs or news articles; if we're unsure of the value of your post, we might remove it until you add a submission statement. A submission statement is required for non-text sources (videos, podcasts, images).

Culture war posts go in the culture war thread; all links must either include a submission statement or significant commentary. Bare links without those will be removed.

If in doubt, please post it!

Rules

Recommended Realtime Chats

Link copied to clipboard

Action successful!

Error, please try again later.