site banner

Culture War Roundup for the week of February 23, 2026

This weekly roundup thread is intended for all culture war posts. 'Culture war' is vaguely defined, but it basically means controversial issues that fall along set tribal lines. Arguments over culture war issues generate a lot of heat and little light, and few deeply entrenched people ever change their minds. This thread is for voicing opinions and analyzing the state of the discussion while trying to optimize for light over heat.

Optimistically, we think that engaging with people you disagree with is worth your time, and so is being nice! Pessimistically, there are many dynamics that can lead discussions on Culture War topics to become unproductive. There's a human tendency to divide along tribal lines, praising your ingroup and vilifying your outgroup - and if you think you find it easy to criticize your ingroup, then it may be that your outgroup is not who you think it is. Extremists with opposing positions can feed off each other, highlighting each other's worst points to justify their own angry rhetoric, which becomes in turn a new example of bad behavior for the other side to highlight.

We would like to avoid these negative dynamics. Accordingly, we ask that you do not use this thread for waging the Culture War. Examples of waging the Culture War:

  • Shaming.

  • Attempting to 'build consensus' or enforce ideological conformity.

  • Making sweeping generalizations to vilify a group you dislike.

  • Recruiting for a cause.

  • Posting links that could be summarized as 'Boo outgroup!' Basically, if your content is 'Can you believe what Those People did this week?' then you should either refrain from posting, or do some very patient work to contextualize and/or steel-man the relevant viewpoint.

In general, you should argue to understand, not to win. This thread is not territory to be claimed by one group or another; indeed, the aim is to have many different viewpoints represented here. Thus, we also ask that you follow some guidelines:

  • Speak plainly. Avoid sarcasm and mockery. When disagreeing with someone, state your objections explicitly.

  • Be as precise and charitable as you can. Don't paraphrase unflatteringly.

  • Don't imply that someone said something they did not say, even if you think it follows from what they said.

  • Write like everyone is reading and you want them to be included in the discussion.

On an ad hoc basis, the mods will try to compile a list of the best posts/comments from the previous week, posted in Quality Contribution threads and archived at /r/TheThread. You may nominate a comment for this list by clicking on 'report' at the bottom of the post and typing 'Actually a quality contribution' as the report reason.

4
Jump in the discussion.

No email address required.

Netflix ditches deal for Warner Bros. Discovery after Paramount’s offer is deemed superior

In a development that was predicted by many, Paramount Skydance owned by the Ellison family has derailed the Netflix acquisition of Warner Bros and will acquire the company.

The Ellison family are close, long-time personal friends and business partners with Netanyahu. For example, in 2020 Haaretz reported that Ellison intervened to secure a lawyer to represent Netanyahu in his corruption case. In 2021 Larry Ellison offered Netanyahu a paid role on Oracle’s board. Larry Ellison has widely been described as the largest private donor to the IDF.

With Paramount Skydance now poised to acquire Warner Bros, here is an incomplete list of the properties that now have or will shortly come under ownership of the Ellison Family.

August 2025- Hostile takeover of Paramount:

  • Paramount Pictures
  • Paramount Television
  • CBS
  • CBS News
  • CBS Sports
  • Nickelodeon
  • MTV
  • BET
  • Comedy Central
  • Showtime
  • Paramount+

We saw the immediate fallout from the Ellison family acquisition of CBS News with the installation of Bari Weiss to lead the editorial direction of the company.

Only last month in January 2036 the TikTok algorithm came under the ownership and control of Ellison-owned Oracle, so for practical purposes TikTok should be added to this list:

  • TikTok

Now with the impending takeover by Paramount of Warner Bros, these properties will come under the control of the Ellison family:

  • Warner Bros. Motion Picture Group
  • CNN / CNN International
  • HBO / HBO Max
  • Discovery+
  • Discovery Channel
  • DC/DC Studios
  • TLC
  • TNT
  • TBS
  • Turner Classic Movies (TCM)
  • Animal Planet
  • Science Channel
  • Travel Channel
  • American Heroes Channel (formerly the Military Channel)
  • Cartoon Network / AdultSwim

This is the consolidation of American Media under the Ellison family in just sixth months.

Netanyahu has described International Media as the "eighth front" of Israel's wars, it is wildly intolerable to me that the Trump Administration would permit the consolidation of this breadth of news media into a family with foreign loyalties and foreign ties. The reforms at CBS News and elevation of Bari Weiss also prove a political motive and intent to fight this Eighth Front defined by Netanyahu, which is a front Israel is fighting against the minds of the American People.

At what point do "conspiracies" stop becoming conspiracies and start just becoming recognized as plain reality? All the "conspiracy theorists" predicted from the beginning the TikTok algorithm would be handed to an apparatus loyal to Israel, and that Paramount Skydance would not allow Netflix to win Warner Bros, because they are fighting a geopolitical war and not optimizing dollars and cents.

All of this while the United States mobilizes to fight another war for Israel... but deporting Somalis from Minnesota is just impossible we are told.

This is like the joke where the Jew reads Nazi propaganda because in the real world, the Jews are suffering, but the propaganda says that Jews are all rich doctors, lawyers, and bankers and he wants to read some good news.

I wish the media were controlled by some pro-Israel group. Most of the media hates Israel and tends to uncritically publish anti-Israel propaganda. Ellison is not going to be sending directives to all the companies he owns to not publish particular things. Company ownership doesn't work that way. And when he does exercise control, being nice to Israel will be a minor item in the list. People online can be single issue posters; billionaires are not single issue spenders.

Most of the media hates Israel and tends to uncritically publish anti-Israel propaganda.

What do you base that assertion on? My observational experience has been the polar opposite.

but deporting Somalis from Minnesota is just impossible we are told.

Over 90% of Somalis in Minnesota are US Citizens, and most of the rest are legal permanent residents. I'm unaware of any legal mechanism for deportation of US citizens. About half of them are naturalized and could theoretically be stripped of citizenship, but that would require finding fraud in the naturalization process, which would require actual evidence.

I don't care, we can remove citizenship under existing laws, and we can simply change the laws if we wish.

Deport all foreigners, especially the paper citizens, especially the anchor babies.

We can't remove citizenship just because you "don't care." There has to be a legal justification, not just an emotional response.

If you get the revolutionary government you would like, I'm sure that will be on the table. But for now, you're stuck with the actual laws of the United States.

Appealing to things I don't care about will never change my mind.

I don't have any illusion your mind is changeable.

The legal justification can be a judge saying "That amendment I don't like doesn't exist in my courtroom.".

If I may channel my inner FC: laws only matter to the extent that armed men are willing to kill and die for them. All it takes is a judge saying "I don't care, get rid of 'em.", armed men taking action, and no critical mass of armed men willing to resist the first group with blood and death, and just like that, then the laws are just so.

Well, you're not wrong that armed men unopposed by other armed men can render laws meaningless.

I'm not in favor of this method of conflict resolution.

especially the anchor babies

Would take a Constitutional ammendment or the Supreme Court reversing it's 19th century decision that Wong Kim Ark is a US citizen since he was born in the US and his parents were certainly subject to the jurisdiction of the US government.

You, like the rest of the liberal commentariat, are misunderstanding Wong Kim Ark.

Ark died in China, because he was Chinese, not American.

And you clearly misunderstand the preamble to the Constitution. You see, when they said "ourselves and our posterity", they were clearly only referring to a select group of people, i.e. those whose signatures appear at the end of the document. If you happen to be directly descended rom one of those fine gentlemen, you're welcome to stay. If not, THEN GET OUT.

We the People of the United States do ordain and establish this Constitution for the United States of America.

It's We the People of the United States that are the subject of the other clauses. That's simple reading comprehension.

You'd be right if it read We the Undersigned, but unfortunately for you it doesn't.

Is this the liberal commentariat or the federal courts promoting this understanding of the 14th Ammendment? To the degree you can this a particular view, it is the officially correct one as explained by the Supreme Court.

It is wildly intolerant to you that a jewish person own a slate of failing media companies and niche AWFL providers? Hamas has a huge western media slate. So do the UAE, and Saudi Arabia, and the Chinese and Koreans and Germans and Russians etc. etc. Basically everyone with a billion dollars to burn is using it to try to get five billion dollars out of the US government.

Rishi Sunak, former British PM, is now on the board of Microsoft and Anthropic, and an advisor to Goldman Sachs. Is that perfidious Albion?

Gerhardt Schroeder (sp?), former German PM is now on the board of Russian energy monopoly Rosneft.

God only knows how many corporate boards all the various pardoned Bidens sit on.

If this state of affairs is wildly intolerable to you, I suggest you go through the groups/countries alphabetically. That way we can see how many you left out before you get down to "J".

Is that perfidious Albion?

He's not English, so of course not. It's further Indian infiltration.

Why do so alphabetically and not attempt a good faith ranking of foreign influence? Surely whomever has more control, and specifically more control over military policy, is of greater concern to the American Nationalist?

I think if one looks seriously it's going to be shaped more like intelligence and influence networks than along any national lines, but why refuse to look?

I'm happy to stipulate that the Israelis and a distinct population of western jews both punch above their weight class in terms of media and political lobbying. But it's a big world, and there's a lot of weight classes. The motte is that jews are disproportionately influential. The bailey is that they outweigh realpolitik involving actual heavyweights like China.

There's also significant disagreement even among Israelis about policy, and obviously between western jews and Israelis, so the effects of their influence and lobbying is a bit muddled.

A good faith ranking at least conceivably puts the Anglosphere on top ("Murdoch was a long con by the Queen" is probably a bit far, though), followed by the rest of NATO. The French have gotten us into multiple wars (Vietnam, Libya, arguably other conflicts in Africa and the Middle East) with higher casualty counts than Israel even if you accept conspiracy theories. At the time, both World Wars had relatively strong isolationist movements who would argue our involvement there was similarly "serving foreign interests".

And I'd give honorable mention to Japan and Korea, which punch well above their weight culturally and (arguably) the US keeps getting yanked along to serve their "contain China" interests, even if many argue it's also in American interests. In fact, that's probably how most of these examples work anyway: convincing people that "$country interests are American interests".

Last I checked France and Britain hadn't received a hundred billion in direct aid, nor did the US expend a fifth of the global missile interceptor supply on their behalf. Any intervention on their behalf involved the (theoretical) prospect of American benefit and when it didn't America was just as happy to crush them as in the Suez Canal Crisis.

and before you ask, Lend Lease was just 33 billion and that was, as the name implies, a loan.

AFAIK the US didn't provide direct financial aid to the French in Vietnam or Libya (although it did provide aid to South Vietnam), but did get dragged into a French-started conflict in ways that involved active service members, which is in many ways worse than merely providing materiel.

If direct aid is your metric, Ukraine seems quite relevant: they've gotten hundreds of billions in materiel aid, including no small number of ABM interceptors.

And one of the men who killed Kennedy is rumored to have a been a French solider and OAS member.

Oswald? Or one of the grassy knoll guys?

Grassy knoll (or the Dallas Textiles building)

Why do so alphabetically and not attempt a good faith ranking of foreign influence?

To better hide the Zimbabwean puppetmasters, of course!

Good lord! Their devious perfidy goes back all the way to moving "Zeta" from the 6th position of the Greek alphabet to the last of the Latin! Now THAT'S true control of the media!!

This whole "it's the jews" schtick is tiring. No it's not the jews, Larry Ellison just wants to build his dynastic empire and also hand another big flashy role to his son.

Hollywood is already insanely jewed, so this latest chapter in consolidation isn't some crazy plot. If you checked you would also realize that Netflix is also jewed. And we can talk about how this affects the media coming out of Hollywood, but still this acquisition is a nothingburger.

I don’t think Larry “we will do everything we can to support the country of Israel” Ellison, who says regarding Israel “we have a country we can call our own”, who has donated $26mil to the IDF, should have so much influence on American culture and politics. He is not fully allegiant to America. It’s not difficult to find people who are fully allegiant. Those with pure allegiance should be the ones owning the biggest social media algorithm + the biggest movie company + the biggest streaming services + CNN. When you type it all out, doesn’t it seem reasonable? “The guy who is in ultimate control over how the masses think — I don’t want that guy being loyal to a country that is considered one of America’s top espionage and influence threats”. The cost isn’t trivial, it is tens of thousands of lives in a war with Iran. We can easily fall into an Iraq again due to media and political influence, and the cost of that was genuinely insane. I think the 3 trillion dollar price tag alone is insane.

It’s not strictly-speaking a Jewish thing; someone like Michael Dell or Mark Zuckerberg is fully allegiant to America while being Jewish. You just have to pick the guy who doesn’t say “I will do everything to support Israel, our homeland” while donating millions to their army. If a Palestinian businessman said this in Israel he would be thrown in jail on false charges within the hour, and then probably tortured.

I don’t want that guy being loyal to a country that is considered one of America’s top espionage and influence threats

Weasel words. "Considered" by whom?

The NSA: https://www.newsweek.com/israel-flagged-top-spy-threat-us-new-snowdennsa-document-262991

The term "manipulation/influence operations" refers to covert attempts by Israel to sway U.S. public opinion in its favor. In this, Israel has dubious company, according to the NSA: Other leading threats were listed as China, Russia, Cuba, Iran, Pakistan, North Korea, France, Venezuela and South Korea.

Also, American-allegiant scholars like John Mearsheimer and Jeffrey Sachs, and John Kiriakou the CIA insider

Fair enough, although I can't help but note your source is a decade old and based on a report published nine years prior.

You think things have gotten better since?

I think if Israel currently posed as grave a threat to American national security as claimed, it shouldn't be hard to find a credible source arguing as much within the last decade.

Not sure what you're asking for here. You want someone to leak a more recent NSA Strategic Mission List document? Who else could give you a recent picture of state-level espionage operations and why would they?

More comments

I work for a DIB company. We've written internal policies that prohibit using any Israeli hardware or software products due to security concerns. Speaking with colleagues in other business units and partner organizations they are working under the same or similar restrictions.

Netflix has plenty of Jewish employees, obviously, but it’s controlled, run and was founded by Reed Hastings, who is of Boston Brahmin Mayflower arch-WASP pedigree.

Warner Brothers and Paramount were both founded by Jews. I don't see how this changes much.

All of this while the United States mobilizes to fight another war for Israel... but deporting Somalis from Minnesota is just impossible we are told.

I'm not sure if the implication is that this contingent approves of Somali muslims ripping off US/MN welfare programs, a good fraction of that cash ending up being sent as remittances which end up in the hands of groups not exactly known for their love of Israel.

If this is your conspiracy, sure. But quite frankly it doesn't make a ton of sense.

Jews want to live in a multicultural state. Flooding Minnesota with minorities is in line with what they want.

A few Somalis are far less of a threat to jewery than a homogenous society is.

Jews want to live in a multicultural state and so they decided to import a group of (checks notes) virulent antisemites.

That’s your grand theory? Really?

Much rather than live in a homogenous state where there culture and ethnic background sticks out. They are a naturally cosmopolitan people and naturally gravitate to the most diverse parts of the US and Europe. They are mainly afraid of white homogenous societies, not individual immigrants who might go a bit bonkers. The diversity has never really done any serious harm to jews.

Do you think the modal member of a homogenous white US society would be more or less antisemitic than the modal Somali immigrant?

Do I think a homogenous society is more difficult for ethnic minorities than diverse societies? Absolutely.

Also the worst times for jews have been in all white societies.

That wasn't my question. I asked about the specific muslim somalis that were germane to the discussion.

And to hazard my answer to the question: I think the median WASP is far less antisemitic than even the 95^th percentile most tolerant somali. It's not even close.

Various Somalis are in no way an existential threat to Jews and they would rather live in an area with some Somalis than an all white area.

More comments

Also the worst times for jews have been in all white societies.

What does Purim celebrate again?

A genocide of persians.

Didn’t a Jew, Jacob Frey, become mayor of Minneapolis by exploiting that group’s inter-tribal conflict? So, yes, absolutely.

Jews want to live in a multicultural state and so they decided to import a group of (checks notes) virulent antisemites.

Well, supposedly it'll be fine when whites become a minority in America (if we aren't already) despite importing groups of Mestizos, Somalis, Indians, etc. who if not already virulent anti-whites will become so once sufficiently marinated in the elite ideology. Because, the argument goes, instead of being united by hatred of whitey, and using their growing strength — and the white minority's growing weakness — to intensify the anti-white policies already in place, the moment the white population fraction hits 49.9%, a switch will magically flip in the brains of the non-white population, such that they'll suddenly switch over to fighting one another, and they'll be so busy fighting one another, they'll completely forget their hatred of the white minority, so much that they'll supposedly roll back the extant anti-white programs For Reasons.

Now swap that white minority with a Jewish one. (I note that when one does this substitution — in either direction — quite a few people suddenly change their opinion on whether or not a diverse majority is better than a homogenous majority for an unpopular minority.)

I'm not much for Jewposting but...the explanation for why progressive feminists and pro-LGBT people support it is basically the same and it's not any more flattering to them either. If they can be accused of being short-sighted, why not Jews?

Of course Jews can be accused of anything. But on the one hand there is a claim that they are orchestrating an entire war and at the same time that they are short-sighted enough that they want to willingly add 300K muslims whose policy rounds to "death to the jews".

One can see the "LGBT for Palestine" organizers in a parade and think they are buffoons. But you can't also see them as sophisticated operators playing geopolitical 4D chess. Pick one or the other.

That doesn't make sense. Western countries post-WW2 were much less hostile to Jews than Western countries now. Multiculturalism means a) more Muslims and b) more other foreigners who don't feel post-war guilt about the Holocaust. Aliyah from countries like France is going up as they become more multicultural, as French Jews flee their new Muslim neighbours.

After the establishment of Israel, Middle Eastern Jews fled (or were expelled from) ethnically and religiously diverse countries in order to move to Israel. Diversity means more ethnic conflict overall, which means more ethnic scapegoating of rich groups (i.e. Ashkenazim). Whereas a tiny Jewish population in a homogenous country are much less of a threat.

It's a mistake to overinterpret what happened in Germany. Hitler's rise was driven by Germany's humiliation in WW1, the Treaty of Versailles, and the growing threat of Communism. Other fascist regimes like Italy or Spain were fine with the Jews. Antisemitism was just an idiosyncrasy of Hitler, not a law of history.

Antisemitism was just an idiosyncrasy of Hitler, not a law of history.

Can anyone point to a historical (right- or left-) populist movement in a culturally Christian country that didn't eventually turn anti-semitic? I suppose there is a colourable argument that Disraeli's OG One Nation Conservatism counts as right-populist, but it isn't a central example.

Jews really are over-represented (by somewhere around 10x, as you would expect given high-end estimates of Ashkenazi IQ if we do live in an IQ-meritocracy) in the allegedly-meritocratic elite, so if you think that the allegedly-meritocratic elite is a conspiracy, then it is a Jewish conspiracy.

There is a long history of homogeneous societies turning on Jews because domestic politics required a scapegoat. Admittedly it hasn't happened in an English-speaking Christian society since the late 18th century, and not in a murderous way since 1290.

There is a history of homogenous societies turning on Jews but there was also plenty of antisemitism in corners of diverse empires like the Russian and Ottoman Empires (not so much in the capitals, at least most of the time, but certainly in many of the provinces). In 1980 America was far more diverse than Western Europe and yet had little antisemitism.

In general the “Jews want diversity because Jews do better in diverse countries” point is extremely contrived, it’s gained currency only because it’s promoted both by Jewish progressives who want to defend multiculturalism in a weekly Reform temple sermon and by far-right antisemites who want to ‘explain’ the motive for why Jews supposedly want to destroy formerly-homogenous white countries with mass immigration. There isn’t much evidence for it or against it. Some Jews supported mass immigration, but so did plenty of powerful indigenous Europeans both in Europe and in North America. Jews were more progressive than many other groups on immigration in the mid-20th century of course, but they were also more progressive on economic and other issues (eg being very overrepresented in economic leftist movements), which suggests it wasn’t an immigration-specific thing.

Given that the Ivies famously had a cap on Jewish enrollment, if anything it would be an anti-meritocratic conspiracy on the other side.

Can anyone point to a historical (right- or left-) populist movement in a culturally Christian country that didn't eventually turn anti-semitic?

Do the mid-century fascists count as populist? Because if so, the Italian and Spanish fascists weren't antisemitic, as I mentioned. But why are we limiting ourselves to the past? How about basically all the national populist parties in Europe right now? Reform UK isn't antisemitic. The National Rally in France isn't antisemitic. Fidesz in Hungary isn't. Not Brothers of Italy. Nor, of course, is the MAGA movement.

by somewhere around 10x

I would love to see your workings-out for this claim.

There is a long history of homogeneous societies turning on Jews because domestic politics required a scapegoat.

That's pretty much it. They are a market-dominant minority that are distinctive enough to be considered an outgroup but not so distinctive to be considered a fargroup.

Can anyone point to a historical (right- or left-) populist movement in a culturally Christian country that didn't eventually turn anti-semitic?

While if you look hard enough you can find a anti-Semitism anywhere, I don't think the American Populist Party of the 1890s ever gained strong such associations. And I'm less familiar with them, but most other populists and even strongmen I can think of in the Western Hemisphere don't IIRC have strong such associations either (Peron, Chavez, even Pinochet).

Gaullism?

Good answer, although De Gaulle turned against Israel in the 1960s in a way which would be considered anti-Semitic by 21st century American standards.

decisively proving that the 21st century definition of "antisemite" is "one disliked by Jews" rather than the inverse

and of course by that standard any "populist movement" is going to rub against the ruling establishment and Jews in or adjacent to said ruling ruling establishment will identify it as "antisemitic"

Jews have been kicked out of 110 countries. They have been in conflict with Europeans since the roman times. The jews were not happy in catholic spain and worked with the muslims. Jewish communists killed millions of christians in Eastern Europe. They were regularly banned from various European countries.

They do prefer a jewish ethnostate to diversity.

There was just as much reciprocal persecution of Jews in Arab and Persian lands, in fact 18th century Western European Jewish travelers to Jewish communities in Persia lamented the extreme persecution of the local Jewish community, which was worse than anything in Europe at the time. A specific animus against ‘European Christians’ seems unfounded.

The difference is that outright expulsion was rarer in the Middle East because those were already relatively confessionally diverse societies with various random leftover minority groups, both ethnic and religious, from Assyrian Christians to Samaritans to Zoroastrians to Alawites to Eastern Catholics etc etc. The ruling elite might viciously oppress minority groups but these societies didn’t (until the late 19th century) generally consider the possibility of outright ethnic cleansing. It is unclear to me that being expelled is actually worse than being extremely cruelly oppressed by the way.

By contrast, European Christian societies were more exclusive, regularly fighting wars of religion that were explicitly designed to cleanse territories of other flavors of Christianity until comparatively recently, and Jews were often caught up in that fervor. Ethnically and linguistically they were often still very much diverse, but religiously they were more exclusive, and they fought more internal wars, which again have a habit of leading to ethnic cleansing regardless. In addition, and this is again rarely noted even though it’s obvious, the “110 countries” are mostly European because a large swathe of Mitteleuropa consisted of hundreds of tiny micro states for centuries, whereas the Middle East was mostly divided (at the macro level at least) into much larger polities like the grand caliphates and later Ottoman Empire.

I think this may be another funny case of Simpson's paradox. What you're missing, is that Jews are not a monolith.

If you are elite and an allogene, multicultural societies are better for you because they minimize the prestige of being part of the indigenous polity, and thus create more opportunities for you. They're "meritocratic" as opposed to "aristocratic".

This comes with drawbacks, including sectarian violence and the associated public safety problems. But if you're an elite you don't care because you're insulated from such things anyways.

Who cares if your coethnics are being gunned down in the streets? If anything that's more reason people shouldn't be allowed to sneer at you in the country club under any circumstances.

Lebanon might pay a very high price for being such a divided society, but for the one guy that gets a guaranteed government office because he's a specific minority, it's a pretty good deal.

Lebanon might pay a very high price for being such a divided society, but for the one guy that gets a guaranteed government office because he's a specific minority, it's a pretty good deal.

Except there aren't any countries, past or current, where Jews have benefitted from the kind of ethno-religious power sharing that we see in Lebanon or Singapore.

Israel and Russia exist.

In Singapore the power distribution is relatively ‘fairly’ distributed by proportion; the Chinese are in charge, obviously, but they are also the majority. In Lebanon the outcome of the Civil War was an arrangement that is tripartite and so not strictly proportionate but certainly moreso than it was before (and part of the instability is precisely that the Maronites have fewer children but cling onto the power they have, even still). There aren’t many historic examples of states where a large minority of the population (more than 25%, say) have been Jewish. The highest it got in prewar Eastern Europe was 6-10% really, the latter in Poland on the eve of the Second World War. There just aren’t that many Jews.

Used to happen in New York, and I think in many big US cities. Jews, Irish, Italians all had their well-recognised machines.

Lebanon is only good for the elite because they can flee with their ill gotten gains to Switzerland or Dubai after leaving office. If they had to stay in Lebanon the incentives would be very different.

Partially because Lebanese folks, once removed from Lebanon, are truly exceptionally nice people to live and work with. Never met a single one that was an asshole.

Remarkable really how that works.

Most Lebanese expats in America are Christian Maronite elites, many of them even left before the civil war blew everything up. Places that received more Muslim immigrants - especially Australia where about 40% of the Australian Lebanese population are hardline Sunni Muslims - have problems with them.

That is probably correct.

Haven’t the actual leaders mostly stayed in Lebanon? I know Samir Geagea has, despite assassination attempts and prison sentences.

The patriarchs themselves generally remain, but their large extended clans, daughters, grandchildren, many sons, cousins and so on often spend substantial time abroad and have European citizenship.

True, but that's also the general criticism people make of diaspora minorities, that when the going gets tough, all the "elite human capital" just flees and the autochtones have to deal with it because they can't go anywhere else.

I think it's as fair to generalize it to all manners of cosmopolitans. It creates the same bad incentives where you go from place to place extracting and moving away when there's nothing more to extract.

This problem is something Nationalism purports to solve by forcibly integrating the nation and punishing the disloyalty of people who would use ethnicity, wealth or other escape hatches. Whether it's justified in doing so or whether it's successful are ancillary questions, but it's why people do what they do.

It makes enough sense to the jews to have them advocate for it in their own words.

What does that mean even mean for 'the Jews' to advocate for it? Do they have an international spokesman? Did they take a vote?

Unless of course you mean that you can cherry-pick some examples of left-wing American Jews and conclude that all leftist politics is an invention of the ethnic group you hate?

Sure you won’t find an official top spokesman for pointy hat people, but the ADL is/was fairly close to that position as the official Jewish voice. We can motte/bailey it that they weren’t “official”. The Israeli government didn’t declare them a state actor.

But the ADL fulfilled about 50–70% of being their official spokesman in America for decades.

I'm not sure if the implication is that this contingent approves of Somali muslims ripping off US/MN welfare programs,

Why would it be? He's pointing out that a tiny group of people can get the US government to commit drastic action, even when it's unpopular with most people, while things that are popular with the voting base of the current government somehow never get done.

It's a conspiracy, because every time someone points to the disproportionate influence of the tony group, they get called a conspiracy theorist.

But that very same group can’t dislodge people fraudulently extracting billions of dollars and sending it to their sworn enemies?

And we know for a fact that they share your assessment of the threat the Somalis represent? Also "sworn enemies" seems like a stretch, didn't Israel formally recognize Somaliland recently?

I feel like one doesn't need to reach hard to think that people chanting "Death to the X" aren't a thread to X.

Was everyone posting #KillAllMen a threat to men?

Sure, such chants probably mean they don't like them very much, but even murderous hate is not the same thing as being a threat.

Murderous hate seems like a fairly good barometer of being a threat, especially when used in comparison to a different group of people (say, WASPs) who are observed to have a lot less murderous hate.

So the answer to my #KillAllMen question is "yes"?

More comments

Somalis often consider Somaliland a Jewish conspiracy lol.

So much tension could be relieved, if all the other conspiracy theories went like this:

- There's a part of our country that's actually functional

- It must be the Jews!

IIRC the Somalis in Minnesota are not fans of the idea of an independent Somaliland.

"Our lands were taken from us before, and God willing, we may one day seek them." - Rep. Ilhan Omar

The idea that they should retake Somaliland is actually the most charitable interpretation of that speech; the uncharitable interpretation is that she was suggesting that they should retake all of "Greater Somalia", including parts of modern Ethiopia and Kenya.

Even if I take your view “the Jews are buying all media” I still prefer this result than Netflix buying WBD. Now you still have an independent streamer of scale. I’d rather have Ellison own WBD than the global corporate Apple, Netflix, Amazon owning the big players.

The Ellison Family has already been shown to be conspiring with Israel proper for the purposes of influencing America: https://jackpoulson.substack.com/p/david-ellis-benny-gantz-anti-bds-black-cube . The “12 wealthy donors meeting in a council for secret influence” is similar to the structure of Wexner’s Mega group.

”Benny meet david. David meet Benny," opened a purported email from Berkowitz on December 23, 2015, before Berkowitz added that he "told david briefly about your [Gantz's] 12 tribe idea which you can expound on to him which he seemed very interested in." Other slide decks and text documents in the leaked Handala archive include proposals from Gantz from a similar time period for a "Counter-BDS Initiative," which he hoped to have funded through 12 prominent Jewish donors.

"Funding for this initiative shall be provided by an exclusive group of the twelve most influential Jewish philantropists [sic, emphasis theirs], symbolizing the twelve Jewish tribes; Israel's government shall act as a thirteenth, facilitating 'tribe'," read the June 10, 2015 proposal purportedly drafted by Gantz's spokesperson, Melody Sucharewicz. The two primary slide deck summaries of the CBI concept coupled the donor analogy with a popular pictorial representation of the 12 tribes, with each of the 12 lead donors being asked to commit $1 million over a five year period.

The slide deck for the Counter-BDS Initiative promised that the effort would be "unprecedented in the use of state-of-the-art cyber technology as a soft weapon against BDS & Co," including through employment of "former IDF intel & cyber officers."

Institutional Judaism’s greatest strength is the seemingly bottomless naïveté and gullibility of the goyim, or at least the eldergoyim, who fail to comprehend that the people who behaved as a cohesive tribe to pursue their self-interest for 3000 years would continue doing so into the future. They fail to learn from their own culture’s history, where they would find the clauses of the Magna Carta that specifically curtailed the influence of Jews who had aligned with the ruling dynasty of England against the common people, and may notice the resemblance to the way Jews operate in America vis-a-vis Trump and Ruling Party. They fail to learn from the opening of the Gates of Toledo, and in fact they do not even believe it when they read it, that Sephardic Jews opened the city walls to allow foreigners to genocide the Germanic ruling families, and so they have no connection to draw when they witness the Sephardic Jew Mayorkas smile next to Joe Biden as he promotes an “unrelenting stream of immigration, nonstop, nonstop” with the purpose of ending the White majority, with the key enemy of Mayorkas’ activism being “white supremacy”. They will read about Epstein trafficking hundreds of gentile girls and women (having no memory of the centuries where the Jews trafficked Slavic Christian slave girls), being funded by the world’s most influential leader of Jewish life, whose Foundation literally selects the elites of Israel, and his innumerable meeting with the former intelligence chief and PM of Israel, at least 60 times in-person meetings within a decade, while working as an agent of the Rothschilds and demeaning the “goyim” dozens of times in his emails (sprinkled with Yiddish), and they will conclude that Epstein was a lone actor. All the while, a Jewish journalist at CNN with no such memory loss writes —

The timing of the US and Israeli attack on Iran bears symbolic meaning in Judaism. Ahead of the upcoming Jewish holiday of Purim, worshippers read the specific portion from the Old Testament, known as Zachor. The passage from the book of Deuteronomy commands the ancient Israelites to remember an unprovoked attack by the nation of Amalek and to eradicate the memory of Amalek once the Israelites are settled in their land. The passage is read publicly before Purim to fulfil the mitzvah of remembering Amalek as Israel’s achetypical enemy.

In a phrase: the goyim have gone insane. They have no immune system, no cultural memory, no threat detection. They have forgotten everything that their ancestors painstakingly wrote down for them. They cannot conceive of the possibility of people putting tribe first, and if you told them what is in the authoritative Jewish holy books, or that the Jews pray thrice a day for community informers to be slain or thank Hashem each morning “for not making me a goy”, they would accuse you (!) of storing hate in your heart.

Seems like a failson wannabe Hollywood mogul centibillionaire heir wasting a huge amount of his father’s money (and some dumb money from outside investors of a similar caliber) on an industry that’s about to be completely upended by AI.

Smart move for Netflix to walk away, say what you will, these assets won’t be worth $100bn in 5 years.

As to your last point, the person doing more than anyone else to deport Somalis from Minnesota (including almost all gentile GOP politicians) is Jewish.

It actually doesn't seem like that at all, it literally seems like Israeli agents acquiring property to make sure American Media is pro-Israel and pro-Jewish. You can't say "oh they just want to waste their money on clout" to explain Bari Weiss being installed at CBS New, or the political sequence of events that led TikTok algorithm into falling in the lap of Ellison (hundreds of Jewish lobbies lobby for the TikTok ban because TikTok is antisemitic, then the ban happens and Ellison gets the algorithm, that's not "bored billionaires")- which Netanyahu directly identified as the most important development in Israel's eighth-front war. That's not "bored billionaires trying to buy everything" it's something else.

ADL chief Jonathan Greenblatt was heard on leaked call demanding something must be done about TikTok due to declining support for Israel among young people, hundreds of Jewish organizations throw their weight behind a Tik Tok ban, a week ago Time publishes in article by Anthony Goldbloom titled Why TikTok Needs to be Sold or Banned Before the 2024 Election which hardly mentions anything about some national security threat from CCP, and instead under the heading "Why it Matters" complains about the portion of pro-Palestinian hashtags on the platform and the spread of antisemitism...

TikTok says users decide whether to post and engage with content on #FreePalestine rather than #StandWithIsrael. But, content moderation decides what posts stay up, what gets taken down, and what accounts get banned from the platform. And it’s TikTok’s algorithm that decides what circulates and what doesn’t.

For anyone who doubts the causal link between TikTok and the rise in antisemitic incidents we’ve seen on U.S. campuses: a November 2023 study conducted by Generation Lab, which I helped to organize, showed that people who spend 30 minutes per day on TikTok are 17% more likely to agree with anti-semitic statements like "Jewish people chase money more than other people do."

They want control over the moderation and algo, as ADL has control over the moderation of Reddit and nearly every platform except X only since Musk's takeover.

It actually doesn't seem like that at all, it literally seems like Israeli agents acquiring property to make sure American Media is pro-Israel and pro-Jewish.

The word 'seems' is doing all the work here.

Does it seem like that? I believe it seems like that to you. But when you spill the milk, it seems to you like it was the dastardly Jews making the handle slippery.

Thus in this case, some Jewish organisations hate TikTok. Okay. But Christian organisations also hate TikTok. Muslim organisations hate TikTok. Secular groups hate TikTok. I hate TikTok. If you cherry-pick everything involving a Jew and then strategically ignore the rest (or assume the rest to have been conned into it by those ever-so-sneaky Jews), it is possible to curate the illusion that Jews are the sole movers of public policy. The ADL hates TikTok too? Get in line.

The common ingredient here isn't Jews so much as it is you. The seeming is in your head, not in the world.

The word 'seems' is doing all the work here.

"Seems" was tongue-in cheek mirroring of the comment I was responding to. The means, motives, and opportunities are clear as fucking day, and people attributing it to motives like "spoiled heir" are just obviously wrong. It was a sarcastic qualification to avoid the banhammer for building consensus, but to avoid ambiguity: it doesn't seem that way, it is obviously what is happening and I don't think rational people have any more leeway to attribute other explanations like "bored, spoiled heir" at this point.

What's missing here is any reason to think that 'the Jews' are an operative factor. I know that you think they're behind everything, but if all you have to say is "it's clear as day and if you disagree you're irrational", I don't think that's much of a contribution.

Thus in this case, some Jewish organisations hate TikTok. Okay. But Christian organisations also hate TikTok. Muslim organisations hate TikTok. Secular groups hate TikTok. I hate TikTok.

And among those, doesn't "Tiktok is bad because it's antisemitic" stand out like a sore thumb? When they bought Tiktok, did they change it to stop it from being brainrot, or were they completely happy to keep the brainrot as long as it wasn't "antisemitic"?

As far as I'm aware Jewish criticism of TikTok has frequently featured anti-semitism as a theme, but I don't view it as particularly sinister for Jews to be worried about anti-semitism. I think every group is allowed to be worried about people hating them. In this regard as far as I can tell there is no particular difference between Jews and other groups that are often hated on social media. It's true that the Christian and Islamic responses focus more on TikTok and social media being psychologically harmful in a general sense, and personally I think those responses are better. I think the Christians and the Muslims have one up on the Jews in this particular case. (I think the specific Pakistani fatwa against TikTok has a misaimed criticism, and overall I think the Christian responses are the best and most nuanced - but then, I am a Christian, so I may well be biased.)

I should also clarify that I'm not saying that Jewish groups are necessarily right in terms of policy direction - for instance, recently in Australia, Jewish organisations have been strongly in support of new hate speech laws after Bondi, whereas other religious and secular groups have, in my opinion correctly, raised concerns that these laws are rushed and will not help.

But I suppose my position is that insofar as there is lots of anti-semitism on TikTok, I think it is understandable and non-conspiratorial that Jews dislike this and want something to be done about it. I think there is a meaningful difference between that and positing backroom deals or conspiracies among Jewish elites to acquire TikTok.

But I suppose my position is that insofar as there is lots of anti-semitism on TikTok, I think it is understandable and non-conspiratorial that Jews dislike this and want something to be done about it. I think there is a meaningful difference between that and positing backroom deals or conspiracies among Jewish elites to acquire TikTok

Even if it is so, it does show that your earlier "there's nothing special about Jews hating TikTok, Christians, Muslims, and Atheists hate it too" is wrong. The latter hate it for universal good reason, the former only for self-interested reasons.

I don't know if I agree that it's understandable either. Between them having a habit of framing any form of criticism as bigotry, and the sort of attacks that they're complaining about being seen as pretty mundane when they're done to other groups, I really don't see why they shouldn't just learn to deal with it like everyone else did.

Even if it was understandable, that doesn't make it non-conspiratorial. They didn't just so happen to offer to buy TikTok, and the Chinese didn't just so happen to accept the offer, this was all a result of pressure from the American government. Once the US forced the sale, they also weren't the ones that just so happened to offer the highest bid in an open auction. This is almost certainly a result of shady elites picking and choosing favorites.

I'd suggest that if you have both general and particular reasons to hate something, you are more likely to bring up the particular reasons, if only because the general reasons are more likely to be shared and brought up by others. At any rate, my original statement did not distinguish motives? Of the examples given, the Jewish source hates TikTok because it spreads anti-semitism, the Christian sources hate TikTok because it's addictive and psychologically harmful, and the Islamic source hates TikTok because it promotes immorality and obscenity. These are all different reasons.

I take it as obviously acceptable for a member of a given demographic to be concerned about hatred of that demographic. A Jew can care about anti-Jewish hate, a white person about anti-white hate, a gay person about anti-gay hate, a Hindu about anti-Hindu hate, and so on. Insofar as anti-semitism exists, which it undoubtedly does, it is at as a starting point reasonable for Jews to say, "we don't like this".

Now there is a separate question about what's reasonable in terms of policy response, and I just indicated with the Australian hate speech legislation, I do think Jewish groups at least here have supported unwise policy. I think I am able to say, "I think you are responding badly to a legitimate fear". I don't need to suggest a nefarious motive, or suggest that the thing they're afraid of isn't a problem.

On the specifics of this deal, I think the word 'they' is doing all the work for you there, and allows you to smuggle in scary implications. 'They' didn't offer to buy TikTok. 'They' didn't make an offer for Warner Bros. In neither case was an organisation that can credibly claim to represent Jews involved with the offer. What's the argument here? David Ellison owns too much media? Okay, sure. I'm happy to grant that. But you can't get from a specific person, David Ellison, to the spooky 'they', meaning Jews as a whole. Supposing that shady backroom deals are what resulted in Ellison winning some of these bids (which does not appear to be in evidence, though I grant that Ellison's connections with a famously corrupt and transactional president suggest the worst), even that shows only that a wealthy person has tried to acquire a lot of media.

Of the examples given, the Jewish source hates TikTok because it spreads anti-semitism, the Christian sources hate TikTok because it's addictive and psychologically harmful, and the Islamic source hates TikTok because it promotes immorality and obscenity. These are all different reasons.

Right, and my point is that of the 3, the "it spreads anti-semitism" reason is a clearly self-serving reason, while the others, even the Muslim one, is concerned with the general good.

I take it as obviously acceptable for a member of a given demographic to be concerned about hatred of that demographic.

Generally speaking, yes, but I was pointing out that I have issues with the particular concerns raised about "anti-semitism".

First, as I already said, other groups are expected to just deal with it to some extent. TikTok, and all the other social media, even mainstream media, also spread anti-Christian hatred, but Christians barely bother complaining about it anymore because no one takes it seriously. I don't see why concerns over antisemitism should be taken more seriously.

Secondly, many of the specific issues raised just plainly aren't instances of hatred. Not liking the state of Israel is not hatred. Neither is covering the creepy Orthodox communities in New Jersey, to take a more recent drama.

On the specifics of this deal, I think the word 'they' is doing all the work for you there, and allows you to smuggle in scary implications. 'They' didn't offer to buy TikTok. 'They' didn't make an offer for Warner Bros. In neither case was an organisation that can credibly claim to represent Jews involved with the offer.

I agree this is at the core of the issue. Look, I get that methodological individualism is a thing, I can even respect it as an intellectual framework, but I'm a bit bothered that people subscribed to it treat it as the null hypothesis, and are free to reject an alternative hypothesis that doesn't meet an arbitrarily set burden of proof, while never having to prove anything themselves.

People don't need a formal organization representing a group, in order to act as a group. We have language, a whole lot of various communications technologies, and status hierarchies, to help us coordinate collective actions. Jewish people are probably better at it than the average group, and it's something I admire about them, but the "it's all just individual preferences, all the way down" denials drive me a bit crazy.

What's the argument here? David Ellison owns too much media?

Only if you think that David Ellison was able to convince the US Senate to force the sale of TikTok further pick him as the purchaser all on his lonesome.

But you can't get from a specific person, David Ellison, to the spooky 'they', meaning Jews as a whole.

I think I can get to there when I consider how many people get deplatformed, demonetized, or outright debanked for crossing them, vs. how rarely it happens when someone crosses any other group. Or how the anti-DEI drama in universities just resulted in whites being discriminated against even harder, while the anti-semitism drama resulted in students being deported for perfectly legal speech. These things, especially in aggregate, don't just happen as a result of random individuals following their preferences.

More comments

Bari Weiss being installed at CBS New

That was pretty clearly a "fuck you" to CBS News (and the legacy media in general) and if she weren't Jewish you'd be laughing as hard as I did when it happened.

Oh yeah and the Ellison's getting TikTok algo was also just a "fuck you" to PRC, nothing to do with the stated motives pushed by the Jewish lobby for forcing the TikTok divestment. Bored billionaires, burning cash because they are spoiled heirs, just sending "fuck yous", Weiss's pro-Israel allegiance and steering of the network is just incidental- none of those explanations are plausible given this pattern of facts.

Why did Netanyahu support the sale of TikTok and acquisition of the algo by Ellison? To send a "fuck you" to China?

If Ellison wanted a liberal who was also a Zionist there were thousands of people in the NYC media business (and indeed already at CBS News) he could have picked to run the news business, it would be a remarkable turnaround in your view on the ethnocentrism of the Jewish community if you disagreed with this.

So the fact he picked Weiss transparently has much more to do with her “anti-woke” views and perceived sympathy or at least neutrality toward Trump (whose administration ultimately has to approve his empire building) than policy vis a vis Israel (which a ‘traditional’ candidate from TV news could also have).

Well it doesn't seem to be helping them much in the struggle for public opinion. It's remarkable how anti-Israel some of my female friends have gotten, despite being otherwise very much normie. One cancelled a newspaper subscription to 'The Australian' because of the amount of Israeli propaganda in it.

I know they've been doing all kinds of work to go after the algorithms and manipulate AIs but that doesn't seem to be working out so well? I just see people posting like this: https://x.com/JohnDoe1465199/status/2021247139616092212

What kind of idiot gets their volatile, strongly held political opinions from a chatbot or is prepared to admit they've lost an argument with AI? There are plenty of people dumber than AIs but I think few would admit this. The clumsy way that they try and manipulate these AIs is also not particularly effective. When it comes to deception, humans are still the masters.

Perhaps the media system is like how a dam needs to be at 100%. If the dam is at 99% then it's broken and the water slices through. Twitter and 4chan have considerable influence amongst the young, many older people remember that Iran was supposedly 6 months from a nuke 20 years ago...

On the other hand it's not totally like a dam. There are also plenty of people who are just set in their ways and go on and on about Judeo-Christian values or, in the case of Trump, just openly bemoan that Israel doesn't control the US congress like they used to. That's the boomer and old-media class who run governments and order strikes.

The natural state is that the vast majority of goyim hate them. The fact that some of the boomers actually like them speaks volumes to the effectiveness of propaganda. Without the propaganda the US would probably have had bouts of pogroms. Christians and jews are naturally incompatible. Getting sections of Christians to like jews is an impressive feat of propaganda.

Christians and jews are naturally incompatible

But Christians worship a Jew as the son of God? There's a certain kind of esoteric 'And did those feet in ancient time, Walk upon Englands mountains green' Christianity but in terms of base elements, Christianity is pretty Jewish.

But Christians worship a Jew as the son of God? There's a certain kind of esoteric 'And did those feet in ancient time, Walk upon Englands mountains green' Christianity but in terms of base elements, Christianity is pretty Jewish.

A Jew who the Pharisees conspired with Pilate to have crucified because they considered him a false prophet. And modern rabbinic Judaism looks like the intellectual descendant of the Pharisees, not whatever kind of Judaism the historical Jesus practiced. "God the Father sent the Son to the Jews as the Messiah in fulfilment of the old covenant, but the Jews rejected him and then suffered divine punishment and exile in much the same way as when they rejected God on previous occasions" is the simple, obvious interpretation of the Gospel story.

But that is about history and tradition, not religious practice. And lex orandi, lex credendi. At the level of day-to-day religious practice, Christianity is focussed on right belief to the near-total exclusion of ritual purity, whereas Orthodox Judaism is the modern religion that is most focussed on ritual purity.

Is Islam pagan because its origin involved pagan elements and pagan characters? Or is it Jewish because it worships Moses and Abraham?

Jesus Christ is defined by his opposition to the Pharisees who are essentially the progenitors of Rabbinic Judaism, which is what we call Jewish today.

All that remains is the ethnic element, which frankly is tenuous given the distance between the Levantine people of the time and most Jews today.

There's a common cultural ancestry between Abrahamisms, that's really about it.

The Big Guy of Islam is an Arab, not a Jew. The Big Guy of Christianity is a Jewish rabbi, a Jewish fundamentalist populist who (in the doctrine compiled largely by his Jewish disciples) claims that he fulfilled a Jewish prophecy.

Islam has pagan elements with all the djinn and weirdness but its explicitly monotheistic and denounces paganism (Judaism and Christianity too, to a lesser extent). Thus we can judge that on balance it's not pagan. Islam is not keen on pagans at all, they despise pagans and oppress them whenever they have the chance.

It's silly to say that a religion that worships a Jew first and foremost is not Jewish. If the predominant Christian stance today was 'fuck Jews', then it would be a different matter. Instead, it's Christian countries who've been most supportive of Jews, most supportive of Israel.

What's a Jew exactly, in this respect?

There's a sense in which Jesus is a Jew, even a Rabbi, depending on how you play with definitions, but that's not the sense in which he's any part of Rabbinic Judaism.

In the context of the Bible, the core dogma of Christianity, is there any serious debate as to whether Jesus is a Jew or not?

In John 4:22, Jesus says, “You worship what you do not know; we worship what we know, for salvation is from the Jews.” The use of “we” clearly includes Himself among the Jewish people.

Matthew 1:1 states, “The book of the genealogy of Jesus Christ, the son of David, the son of Abraham,” establishing His lineage within the Jewish people.

Luke 1:32–33 affirms that Jesus would inherit “the throne of his father David” and reign over “the house of Jacob forever,” linking Him to the Davidic and Jewish royal line.

Hebrews 7:14 confirms, “It is manifest that our Lord came from the tribe of Judah,” underscoring His Jewish tribal identity.

Sounds pretty Jewish to me! Many modern Jewish people have swapped out a lot of haplogroups but they clearly identify as Jews. Others identify them as Jews. Christians identify them as Jews. Lots of peoples have moved back and forth, changed over time. The label fits the bottle. Jesus was Jewish and Christianity worships a Jew, their dogma emphasises this in pretty clear terms.

You didn't answer my question. What is a Jew?

More comments

Is Islam pagan because its origin involved pagan elements and pagan characters?

People who believe those are pagan elements (as opposed to elements corrupted by paganism*) absolutely believe that about Islam. It is one of the most common attacks used against Islam actually.

* In Islam monotheism is ancient and paganism is the degeneration. The Kaaba is supposedly a house built by Abraham and Ishmael and then corrupted by pagan worship and restored by Mohammed. If you don't believe this, the Hajj (which predated Islam) really is just a pagan ceremony that was given a face-lift. There's a reason Muslims insist on it.

Many critics of Islam consider things like the Hajj (which most scholars believe predates Islam) and ritual circling of the meteorite stone embedded in the Kaaba to be pagan, sure. And of course many both Jewish and Muslim critics of Christianity consider aspects of the Trinity to be polytheistic / shirk / etc in character.

That's a a whole bunch of inflammatory claims for which you have provided zero evidence.

You hate Jews, consider them incompatible with you, and would like to see pogroms. That does not mean you are knowledgeable enough to speak for all other "goyim."

You've been warned about this many times. I know Jew-hating is your shtick, and you do it without even being as interesting as SS. It's just out there, raw repulsive unfiltered vitriol. The thread is obviously meant to engender this kind of discourse but you're responsible for not even bothering to put a veneer of argumentation over your antipathy. I'm not banning you this time because there will be whining about "viewpoint suppression" and really, you differ from SS only in being less intelligent and less cunning. But you're getting an undeserved break here.

"Natural state" as I understand it would require that the "goyim" know absolutely nothing about jews from things such as cultural background, religion, etc. and form their opinion solely on living alongside them. I am not aware of any such social experiments.

Looking at history it is not difficult for explicitly religious Christians to hate anyone at all who is not Christian. Jews aren't particularly special in that regard.

"Natural state" as I understand it would require that the "goyim" know absolutely nothing about jews from things such as cultural background, religion, etc. and form their opinion solely on living alongside them. I am not aware of any such social experiments.

In medieval times, these social experiments are called China and India. Here, where no one heard about Abrahamic faith, Jews were seen as just another foreign barbarians with strange and ridiculous habits, or just another caste that keeps their customs and rituals for themselves.

In modern times, these experiments are called Thailand, Laos, Cambodia etc. - countries where few people care about Abrahamic scripture or Middle Eastern issues, but know well tourists and travellers from all over developed world. Ask there what tourists are the least welcomed, the answer would be tourists from Asian country named on I (and it is not Iran, Iraq or Indonesia).

Interesting. Specifically Israeli tourists, as opposed to "American with sidelocks"?

I have heard this too. For corroborating/conflicting information, you can go to /r/Israel and read stories by Israelis of how terribly they were treated by the locals on vacation.

Few "Americans with sidelocks" are likely to go backpacking through SE Asia. It is specifically Israeli tourists who have well earned reputation of super-hyper-duper assholes (and not only in this part of the world).

The reason Israeli tourists are disliked in non-Muslim parts of Southeast Asia and Latin America and places like Cyprus is simply that they’re largely boorish, annoying, plebeian men fresh with money from their military service and looking to get wasted, laid and cause trouble.

It’s like asking why British tourists have a much worse reputation in Spain and Croatia than in Japan and America; the former are of a very different class and standard of behavior. American working class soldiers have a very bad reputation in eg Okinawa and parts of the Philippines for harassing women etc.

As an Italian I often find I have more in common with them than Anglos. They look like us. We’ve been cross breeding for 2k years. I even find I can pass as Jewish fairly easily.

I have a lot of problems with institutional Judaism, but I probably would find it easier to make friends in Jewish society than any other culture.

Secular Westerners and Jews are not incompatible, though. In fact, it seems to me that many secular westerners cannnot even identify jews as such. Maybe post-WW2 Germany is a little extreme an example, in that the Jews did their best to not stand out and the Germans did their best not to notice them anymore and we're a bunch of ethnic mongrels in the first place, but by now most Germans wouldn't even know how to spot a Jew if they don't see him coming out of the synagogue with a kippa on his head.

In a homogenously Christian society Jews are easy to spot and are incompatible with the Christians. In a secular, atomized and individualist society, it is otherwise.

Yeah but a lot of the Palestine narrative is built around lending sympathy to fairly radical islamics who should be less preferable than Jews.

They would be, if we weren't decades downstream of anti-imperialist, anti-colonialist education. People know just enough to feel they can slot each group into the appropriate ethical slots.

The Jews are Jim Crow Southerners and the Palestinians at worst are the Nat Turners of the world: righteous freedom fighters (ignore any nasty stuff of course) driven to evil by oppression. Any complaint by Jews that they'll be murdered if they ever give the concessions the other side wants are obviously just a repeat of what the Southern slavers said.

The current generation of Jewish leadership is clearly a lot dumber and more short sighted than their forefathers. Or put another way,

"Hard Times create Strong Jews

Strong Jews create Good Times

Good Times create Weak Jews

Weak Jews create Hard Times" <- You are here

The funny part is that a large part of the funding is from the Saudis.

Anyways, it will probably work itself out. Paramount was financially struggling, and massively overpaying for Warner Bros famously hasn't worked out for much healthier companies.

The Saudis (and pretty much all the Gulf sheikdoms except Qatar) have been collaborating with the international Jewish conspiracy (if it exists) or allied with Israel and the US against Iran based on shared interests (if it doesn't) since well before the Abraham Accords formalised the situation. In the world we live in, there is nothing at all odd about the Saudis investing on commercial terms in Hasbara Inc.

The Saudis (and pretty much all the Gulf sheikdoms except Qatar) have been collaborating with the international Jewish conspiracy (if it exists) or allied with Israel and the US against Iran based on shared interests

As illustration, see one minor detail from Epstein saga.

Piece of covering of Kaaba in Mecca was sent by Saudis to their good friend Jeffrey to use as decorative carpets.

If you take Islam only slightly seriously, this is bad. As bad as drawing unflattering picture of the Prophet, only about 100,000x worse. As analogy, imagine than Jeffrey had good friend in Vatican who sent him old vestments from St Peter to redecorate his bedroom (well, if this also happened would not surprise me overall).

And... nothing. This news went out months ago, and no protests, no riots, nothing. As if there never had been anything organic about these happenings.

(well, if this also happened would not surprise me overall)

Jeffrey was into under-aged girls. Very, very different to corrupt Vatican officials.

The Kiswah is a massive piece of cloth that is replaced annually and routinely cut up into little pieces to be given away as gifts to visiting dignitaries, museums, prominent businessmen, etc. This is more like being given a flag that flew over the Capitol for a day last year than it is being bribed with a priceless relic. The description of the piece's religious significance is basically just the Saudi government hyping up how cool its goodie bags are.

In part that’s because the Israel - Iran conflict is merely part of a larger web of interconnected Middle Eastern conflicts that exist above and beyond it, the most significant being about whether Saudi Arabia or Iran is perceived by the Ummah as the more Islamic government, the legitimacy of the guardianship of Mecca and Medina by the Al-Sauds, the millennium-old Sunni-Shia split, the war in Yemen (now tripartite between the Saudis, Iranians and UAE) etc.

People overfocus on the Israeli-Palestinian conflict but there would be plenty of drama in the region without it.

[list of channels]

I think the last time I watched any of these was over a decade ago, many of them two decades.

Buying up news and media properties does seem to be the cool thing for billionaires to do these days. Musk bought twitter, Bezos bought the Washington Post, Rupert Murdoch was ahead of the curve owning Fox. Now Ellison gets some of the crappy leftovers.

I'm honestly more worried that google owns YouTube and there are no major competitors to YouTube. There are singular YouTube channels that have more views than all of these media properties combined. And there are definitely many YouTube networks that reach similar viewership numbers.

Buying up news and media properties does seem to be the cool thing for billionaires to do these days. Musk bought twitter, Bezos bought the Washington Post, Rupert Murdoch was ahead of the curve owning Fox. Now Ellison gets some of the crappy leftovers.

Rupert Murdoch is a different animal to the others in that he made his money in newspapers, so buying other media companies is a commercially plausible decision to diversify rather than a vanity purchase (and he made a lot of money off Fox, whereas Musk probably lost money on Twitter and Bezos definitely lost money on Wapo).

American Heroes Channel (formerly the Military Channel)

Formerly Discovery Wings. God I feel old.

Animal Planet

I had fun thinking of ways this one could be propagandized. "Much like the Chosen People, the Nubian ibex has resided continuously in the land of Israel for over 3000 years."

Animal planet has gone the way of the history channel, though- it’s mostly reality shows, with a smattering of lunacy, and a minority of actual animal-related content.

Well, I for one am looking forward to BET showing more Hebrew Israelite content and Sammy Davis Jr. specials.

I expect it to go thé way of The Learning Channel. In all things there must be balance.

I never watched it, but rapper Blueprint was saying 20 years ago "Used to have world news now it's all videos / replaced Tavis Smiley with reality shows". So I'm guessing that ship has already sailed. If you want to see what a black TLC looks like, watch WE TV. I used to watch them for reruns of Law & Order, and I'd see ads for their original programming. They swept up Mama June after her TLC run ended for the original series "Mama June: From Not to Hot" until she got arrested, when it became "Mama June: Family Crisis". They leant into the black demographic with "Growing Up Hip Hop", "The Braxtons", "Wiggin' Out with Tokyo Stylez", and "Bossip: Black Gossip". At some point it became WE TV and AllBlack.

Some here and elsewhere where mathematics is discussed, have championed the First Proof iniative as the best way to evaluate mathematical reasoning capabilities of LLMs.

It consists of ten lemmas working mathematicians encountered in their work, solved, but have not published the solutions.

Today Google published what its SotA mathematical reasoning model, Aletheia, managed to produce autonomously. Some have downplayed the capabilities of SotA models, probably due to not having access to Aletheia, instead thinking that 200 USD per month buys them the most mathematically capable artificial intelligence. This would explain the common trope of claiming that, to use an analogy, an LLM is only capable of producing the integral of ln(n)x^2 only if one gives it the hint to use integration by parts.

Anyway, Google's model managed to autonomously solve five problems, and one partially. Importantly, the models have a self-filtering feature, in that if the model is not sufficiently sure of correctness, it will output nothing, rather than something potentially wrong. "Prompters" of Aletheia did not take the "A" word lightly, they did not attemp to skirt it by giving the model hints:

Our approach to the challenge guaranteed autonomy in the strictest sense: for the generation of our solutions, there was absolutely no human intervention. Humans experts inspected the final output of this pipeline for evaluation purposes only, without altering any content. We ran two different agents and designated one “preferred solution” per problem, whose ratings are displayed in Table 1. This designation admittedly draws upon our own expertise.

Notably, "prompters" do not deny that the platonic ideal of a proof was not what the model produced:

Many fail to meet the stated requirement that “Citations should include precise statement numbers and should either be to articles published in peer-reviewed journals or to arXiv preprints”

Two Aletheias were prompted: one with base model Gemini 3 Deep Think, and one with the base model as described in the model link above. The latter outperformed the former by solving and partially solving two problems the former did not. The amount of compute and thus cost is not revelead in absolute terms, only in relation to solving EP1051.

Autonomy and hard coding outputting nothing if unsure[1], makes them poweful tools even in the hands of less plus smart users. As the former means guiding them is not required and the latrer that they are reliable.

[1]LLM's doing otherwise is the product of them being deployed to the mass market, as the masses want the machine to reply more than they want the reply to be 100% correct. This is thus an inrehent flaw of LLMs, them bullshitting baselessly, but a consequence of post-training/RL.

Aletheia

How the hell did they decide to name it something so hard to pronounce.

It means "truth" in greek. The implication is that it is a system for generating/discovering truth.

Greek-derived names is a pretty standard convention in these types of projects.

A-le-thei-a

It seems pretty simple to me.

Yup. Seems fairly trivial as soon as you accept that not all names are English and aren't meant to be pronounced with the fucked up "rules" of English pronunciation.

It's actually closer to A-leh-they-ee-ah. It comes from a- (negation prefix) lethein (concealment, being unseen, or forgetting in the sense of the River of Lethe in Hades). I've always found it very philosophically interesting that the Greeks considered truth a negation of concealment rather than a purely positive category as we do.

I'm seeing four syllables for English and Greek.

Interesting. I've always heard professors say a-leh-they-ee-a, with a "they" and pronounced "ee" but my guess is that the they-ee is counted as one long syllable (as we know from Greek poetry using long and short syllables for meter, Ancient Greek had some long syllables).

Edit: just asked a friend with a Classics PhD and I'm wrong, it's a-lay-thee-a.

I'm not an expert on this subject, but it seems to me that if the "first proof" lemmas have been out for 2 or 3 weeks, that's enough time to cheat. To be clear, I'm very bullish on AI but at the same time I think there is intense pressure on companies involved in AI to be perceived as achieving results and I am very familiar with human nature.

The first proof paper stating the problems was published in Feb 6. Google's solution was published in Feb 10. Human solutions were published Feb 13.

Google had an impressively fast turn around time. Even just writing the paper in 4 days is impressive work without regard to the actual experiments. I don't think there is any possibility of data leakage here.

Google had an impressively fast turn around time. Even just writing the paper in 4 days is impressive work without regard to the actual experiments. I don't think there is any possibility of data leakage here.

I'm still skeptical, but I concede that 4 days is a much narrower window to cheat.

I suppose the snark is aimed at me. So I guess I'll start with the snark right back, first. My comment included multiple very hopeful things, and a big part of that was due to the linked post's discussion of Alethia's performance on First Proof, which you would have known if you had read either of them. This now is just the ArXiV version of it.

I am still quite very hopeful, and it's nice to see the actual proofs that were generated. It is quite unfortunate that I can't beat on the system with my own problems to get a personal feel for it. I also endorse pretty much everything @PokerPirate has said below up to this point.

Obvious remaining concerns are obvious. It still generated wrong proofs, when evaluated by experts. Many many many hours of evaluation work. That can plausibly be managed; wrong proofs are definitely out there. I will repeat my related concern that the ballgame is quite different when you're working on a problem where you don't already have a solution (that is, where you don't know that a solution exists). And as you mention, cost, question mark? They're unclear about it, and what they do have looks scary, especially if you look at the numbers floating around in their prior papers (also untethered from an absolute scale, but wild in a relative scale).

I have the same question as I did with the erdos problems.

As someone not well versed in math, I don't know how hard these problems are. Are they mundane research work, where most phd students in the field could solve the problems? Or are they difficult math questions where only the best minds can come up with the right insights to get the solution?

If it's mundane work that many math grad students could take care of, then I would be considerably less impressed with 5/10 problems solved than if it ws the latter.

Math professor Daniel Litt discusses the challenge here.

Summarizing, these are unpublished lemmas from math professor’s own work. Lemmas tend to be minor theorems or helper results used as little pieces to help prove larger more interesting ideas. So at least somewhat novel (supposedly), but not grand theorems or crucial results. Litt says that in his field, figuring out what lemmas to prove is the hard part, and then proving them is typically much easier. He said that overall proving results like these take up a relatively small fraction of his time, but a tool to automate their proofs would be very helpful.

As for the problems themselves, Litt said they vary greatly in difficulty. Two of them, including two of the 5-6 that Aletheia got right, apparently had nearly identical statements already proven in the literature. Another one had the proof sketched out in literature, but no model managed to fill in the details.

What most interests me is reliability. Litt writes that overall a lot of garbage was produced, and in a ‘real’ scenario where no one actually knows the answer that is a serious problem. Both Aletheia variants didn’t answer 4 of the questions, either because the model said “I don’t know how to solve it” or it hadn’t finished in the allotted time. I couldn’t find the breakdown from a quick skim, which is a shame - I would be very impressed if the model said it couldn’t solve it rather than giving a wrong proof. Still, it seems of the 12 solutions submitted by the two variants to 6 problems, 3 were considered substantially incorrect, for a ‘precision’ of 75% and a ‘recall’ of 45%, given the problems that it didn’t give an attempt for.

Overall, I would say this is better than I would have expected (not that I have any particular insight to the problems themselves), but still seems like it will pose an immense difficulty when these tools are applied to actual problems where the solution isn’t known ahead of time.

I think the paper does a decent job explaining how hard these problems are, but there's admittedly not a clear 1-sentence description anywhere and it's written for a mathematician audience.

My summary is:

  1. These are good graduate level textbook problems that a mid-program phd student specializing in the area should be able to solve after studying the appropriate background. (But I'm not a specialist in any of these areas so I could be underestimating their difficulty.)
  2. They are easier than what a phd advisor would assign to a phd student as a research problem. That's mostly because they are already perfectly well formulated to be known to be solvable. The First Proof paper makes the case that the more interesting part of mathematics is actually coming up with these questions in the first place or refining them by adding/removing conditions (something most mathematicians would agree with).

LLM's doing otherwise is the product of them being deployed to the mass market, as the masses want the machine to reply more than they want the reply to be 100% correct. This is thus an inrehent flaw of LLMs, them bullshitting baselessly, but a consequence of post-training/RL.

I don't think this is technically correct. My understanding of Aletheia is that it is using the same base models available via the various web-interfaces/API and so has the same post-training/RL. It just is inside an event loop that prevents the bullshitting.

My own experience is that it's "trivial" to get any base model to self-evaluate correctness into a well-calibrated state with the right system prompts + agentic loops. Of course, I mean trivial in the mathematician-sense of "possibly dozens of hours of expert level work".

See For Yourself: A Live Demo of LLM capabilities

As someone concerned with AI Safety or the implications of cognitive automation for human employability since well before it's cool, I must admit a sense of vindication from seeing AI dominate online discourse, including on the Motte.

We have a wide-range of views on LLM capabilities (at present) as well as on their future trajectory. Opinions are heterogeneous enough that any attempt at taxonomy will fail to capture individual nuance, but as I see it:

  • LLM Bulls: Current SOTA LLMs are capable of replacing a sizeable portion of human knowledge work. Near-future models or future architectures promise AGI, then ASI in short order. The world won't know what hit it.

  • LLM moderates: Current SOTA models are useful, but incapable of replacing even mid-level devs without negative repercussions on work quality or code performance/viability. They do not fully substitute for the labor of the average professional programmer in the West. This may or may not achieve in the near future. AGI is uncertain, ASI is less likely.

  • LLM skeptics: Current SOTA models are grossly overhyped. They are grossly incompetent at the majority of programming tasks and shouldn't be used for anything more than boilerplate, if that. AGI is unlikely in the near-term, ASI is a pipedream.

  • Gary Marcus, the dearly departed Hlynka. Opinions not worth discussing.

Then there's the question of whether LLMs or recognizable derivatives are capable of becoming AGI/ASI, or if we need to make significant discoveries in terms of new architectures and/or training pipelines (new paradigms). Fortunately, that isn't relevant right now.


Alternatively, according to Claude:

The Displacement Imminent camp thinks current models already threaten mid-level knowledge work, and the curve is steep enough that AGI is a near-term planning assumption, not a thought experiment.

The Instrumental Optimist thinks current models are genuinely useful in a supervised workflow, trajectory is positive but uncertain, AGI is possible but not imminent. This is probably the modal position among working engineers who actually use these tools.

The Tool Not Agent camp thinks current models are genuinely useful as sophisticated autocomplete or search, but the "agent" framing is mostly hype — they fail badly without tight human scaffolding, and trajectory is uncertain enough that AGI is not worth pricing in.

The Stochastic Parrot camp (your skeptics, minus the pejorative) thinks the capabilities are brittle, benchmark gaming is rampant, and real-world coding performance is far below reported evals. They're often specifically focused on the unsupervised case and the question of whether the outputs are actually understood vs. pattern-matched.

The dimension you might also want to add explicitly is who bears the cost of the failure modes — because a lot of the disagreement between practitioners isn't about raw capability but about whether the errors are cheap (easily caught, low stakes) or expensive (subtle, compounding, hard to audit). Someone who works on safety-critical systems has a very different prior than someone shipping web apps.


Coding ability is more of a vector than it is a scalar. Using a breakdown helpfully provided by ChatGPT 5.2 Thinking:


Most arguments are really about which of these capabilities you think models have:

  1. Local code generation (Boilerplate, idioms, small functions, straightforward CRUD, framework glue.)

  2. Code understanding in situ (Reading unfamiliar code, tracing control flow, handling large repos, respecting existing patterns.)

  3. Debugging and diagnosis (Finding root cause, interpreting logs, stepping through runtime behavior, reproducing bugs. Refactoring and maintenance)

  4. Changing code without breaking invariants, reducing complexity, untangling legacy.

  5. System design and requirements translation (Turning vague specs into robust design, choosing tradeoffs, anticipating failure modes.)

  6. Operational competence (Tests, CI, tooling, dependency management, security posture, deploy and rollback, observability.)

Two people can both say “LLMs are great at coding” and mean (1) only vs (1)+(2)+(6) vs “end-to-end ticket closure.”


With terminology hopefully clarified, I come to the actual proposal:

@strappingfrequent (one of the many Mottizens I am reasonably well-acquainted with off-platform), has very generously offered:

  1. A sizeable amount of tokens from his very expensive Claude Max plan ($200 a month!) and access to the latest Claude Opus.

  2. His experience using agent frameworks and orchestration. I can personally attest that he was doing this well before it was cool, I recall seeing detailed experimentation as early as GPT-4.

  3. His time in personally setting up experiments/tests, as well as overseeing their progress, while potentially interacting with an audience over a livestream.

He works as a professional programmer, and has told me that he has been consistently impressed by the capabilities of AI coding agents. They've served his needs well.

Here's his description of his skills and experience:

in my professional capacity, I've been working with Python for back-end (computer vision algorithms, FastAPI, Django) & Java (Spring). For Front-end; React. 95 percent of what I do is boilerplate, although Sonnet 3.5 did help me solve a novel problem last year but it did take quite a bit of back & forth -- the key was discussing what additional metrics I could capture to help nail down ~30+ parameters influencing a complicated computer vision pipeline.

tldr; the more represented your use case is in the training corpus, better results (probably) -- but I am absolutely confident that Opus 4.6 can help with novel problems, too. And, y'know -- Terrance Tao thinks that as well.

To what end?

He and I share a dissatisfaction with AI discourse that substitutes confident assertion for empirical investigation, and we think the most useful contribution we can make is to show the tools actually working on tasks that skeptics consider beyond their reach.

What do we want from you?

If you self-identify as someone who is either on the fence about LLMs, or strongly skeptical that they're useful for anything: share a coding challenge that you think they're presently incapable of doing, or doing well.

An ideal candidate is a proposal that you think is beyond the abilities of any LLM, while not being so difficult that we think they'd be entirely intractable. Neither of us claim that we can solve Fermat's Last Problem (or that Claude can solve it for us).

Other requirements:

  • A clear problem specification, or a willingness to submit a vaguer one and then approve a tighter version as created by us/Claude.

  • Nothing so easy/trivial that a quick Google shows that someone's already done it. If you want a C++ compiler written by an LLM, well, there's one out there (though that is the opposite of trivial).

  • Nothing too hard. He provides an example of "coding a Netflix clone in 4 hours".

  • An agreement on the degree of human intervention allowed. Can we prompt the model if it gets stuck? Help it in other ways? Do you want to add something to the scope later? (Strongly inadvisable). Note that if you expect literally zero human intervention, SF isn't game. He says: "I don't think I'd care to demonstrate any sort of zero-shot capacity... that's a silly expectation. If my prime orchestrator spends 30 minutes building a full-stack webapp that doesn't work I'll say It doesn't work; troubleshoot, please. I trust your judgement."

  • A time-horizon. Even a Max plan has its limits, we can't be expected to start a task that'll take days to complete.

  • Some kind of semi-objective rubric for grading the outcome, if it isn't immediately obvious. Is it enough to succeed at all? Or do you want code that even Torvalds can't critique? And no, "I know it when I see it" isn't really good enough, for obvious reasons. Ideally, give us an idea of the tests everything needs to pass.

  • If your task requires the model to review/extend proprietary code, that's not off the table entirely, but it's up to you to make sure we can access it. Either send us a copy or point us at a repo.

  • Nothing illegal.

But to sum, up, we want a task that we agree is probably feasible for an LLM, and where success will change your mind significantly. By which I mean: "If it succeeds at X, I will revise my estimate of LLM capability from Y to Z." Otherwise I can only imagine a lot of post-hoc goalpost movement, "okay but it still needed 3 prompts" or "the code works but a good programmer would have done it differently."

We reserve the right to choose which proposals we attempt, partly because some will be more interesting than others, and partly because we have finite tokens and finite time.


Miscellaneous concerns:

Why Claude Opus 4.6?

Well, the most honest answer is that @strappingfrequent already has an Claude Max plan, and is familiar with its capabilities. The other SOTA competitors include Gemini 3.1 Pro and GPT 5.3 Codex, which are nominally superior on a few benchmarks, but a very large fraction of programmers insist that Claude is still the best for general programming use cases. We don't think this choice matters that much, and the models are in fact roughly interchangeable while being noticeably superior to anything released before very late 2025.

Why bother at all?

We are hoping to change minds. This is disclosed upfront. We're also open to changing our own minds. We would also be genuinely interested to discover cases where we expect success and get failure - mapping the capability frontier is valuable independent of which direction the evidence points. We will share logs and a final repo either way. An interactive livestream is possible if there is sufficient interest.

Anything else?

You know the model. We'll be using Claude Code. The specifics of the demo are TBD, it could be a livestream with user engagement if there's sufficient interest, otherwise we can dump logs and share a final repo.

The floor is open. What do you think Claude cannot do?


Edit:

I'll forward the proposals to @strappingfrequent, assuming he doesn't show up in the thread himself. I am clearly not the real expert here, plus he's doing all of the heavy lifting. Expect at least a day or two of back and forth before we come to a consensus (but if he agrees to something, then assume I won't object, but not vice versa unless specifically noted), and that includes conversations within this thread to narrow down the scope and make things as rigorous as possible while adhering to the restrictions we've mentioned. At some point, we'll announce winners and get to scheduling.

https://www.calebleak.com/posts/dog-game/

Show's over. Someone's found a way to make even the most unsophisticated user into a competent game developer through judicious use of AI. I'll pack my bags.

(No, it's not actually over, I just thought this was too funny to ignore)

I don't think you're quite clear in the post as to what camp you're actually in. Are you a straight bull? As in, do you think it can currently replace a sizeable portion of human knowledge work?

Moreover, it is not clear how knowledge work that is not coding qua coding fits into your schema. For example, I have in mind a flight dynamics simulation/control task. I'm not settled on it yet. My plan was to include a little twist that I had thought would likely not be in the published literature, but which I'm sure I could manage without too much difficulty, just pulling one book off of my shelf, confirming where exactly I need to make the modification and how (it's been a long time, but it's something I'm confident I could do without extreme effort), and then coding it. Unfortunately, I looked, and some darned student already published it (only minimal code published AFAICT, but they wrote out all the analysis in detail, so I can't really purely test its ability to do this aspect of the knowledge work on its own), so I'm trying to think of another good variant.

There are other little twists I had in mind, hoping to prevent it from being able to purely just pull code directly from others. These twists are things I've personally coded in the past, so I know they're doable. But the point is that they require sufficient knowledge to make choices along the way (for one example, choose this algorithm for this part, because I know it has certain characteristics) and I think they prevent it from being able to just use someone else's work for the core simulation components.

I guess, where does this fit within your schema, and where are you with respect to your own opinions? There is a lot of room between, "I personally know how to architect this code, what algorithms/assumptions to use, how to modify the analysis for the instant case, and then I use Claude to help with building the components", "I do the analysis, give it to it, tell it to code up the whole thing, then I go in and tell it to change things to make better choices that fit my knowledge-work-educated beliefs on how it should be done," and, "I tell it to code up the whole thing, maybe tell it that something's broken, but part of the test is whether it made the right analysis and knowledge-work-educated choices on its own along the way."

In other words, what I'm interested in is not so much about what it can do in terms of coding qua coding. It could be utterly magical at that, and that would be great. But how much of my own knowledge work do I need to input to get it to code the "right" thing, versus how much it's able to make the correct choices on its own about what the "right" thing is.

I don't think you're quite clear in the post as to what camp you're actually in. Are you a straight bull? As in, do you think it can currently replace a sizeable portion of human knowledge work?

I've shared my thoughts on LLMs consistently here, for years. It wasn't central to this particular demo.

But if you want to know:

  • I do not hold very high confidence claims on their capabilities in coding, because I'm not a professional programmer. I get the impression that they're very useful, on the basis of statements made by people like Karpathy, and by observing specific advances.
  • I think they are already capable of replacing a large chunk of existing knowledge work. The market hasn't caught up to this, if LLM progress was arrested right here, we'd see seismic shocks as industries adjusted years into the future.
  • Specifically in the field of medicine, they can do most of the raw cognitive labor doctors do - as well or better than the average doctor. I could automate 90% of my job today, leaving aside the physical tasks. The primary thing holding me back is archaic NHS IT. LLMs give solid medical advice.
  • I have a median timeline for AGI that's ~2030. 70% CI by 2035. I put a very non-negligible chance on it arriving by start of 2028 or even 2027.
  • I do not make strong claims on if the current Transformer architecture/LLMs is capable of scaling into AGI, or if we need new paradigms. Even if we do, I think the ludicrous amounts of monetary investment and the attention of thousands of the smartest humans alive will likely find it.

I think this would probably make me an LLM bull, even if I'm not maximally bullish. Definitely "displacement imminent".

I would call you a moderate under my schema, and probably an "instrumental optimist".

Either way, I don't think you're our target audience for this demo, since you personally and professionally use SOTA LLMs with regularity and are familiar with their pitfalls.

Fair enough. Thanks for clarifying.

Do you have any thoughts that you'd be willing to share on what I wrote concerning the amount of knowledge work currently required to be input to do things like the task I was thinking about? I suppose I wasn't entirely clear, but I think it would likely fail to do the analysis task on its own. For clarity, this is a task that I thought, "It might be weird enough that no one's done it yet, but it's close enough to the standard stuff that I could almost certainly give it to a student who did well enough in their flight mechanics course, and they could almost certainly just do it." That seems to have been partly justified in that I found a publication in which a student did just do it (and skimming the paper, the analysis seems about on par with what I had expected; I guess my flaw was thinking the idea was sufficiently 'weird'; I guess it says something about the state of aerospace that someone out there has done almost every basic variant, sort of regardless of whether it makes sense to do). I'm probably <50% on whether it would make the "right" engineering implementation choices on its own. I don't have a precise number. I think it might get lucky, because there's a pretty large set of choices available, and I hadn't yet tailored the problem so that it requires it to really think conceptually about what's going on and only pick from a small subset; there's a good enough chance that it could guess somewhat randomly or pick a popular one that happens to work (though I'm not sure if it'll put the right context around it even if it does).

Perhaps, given your comment below, this is just something that you mostly don't care about. Does this sort of thing just bucket into, "No, it can't do this sort of knowledge work now, but with sufficient recursive self-improvement, it will be able to do it later"? (I guess, in line with your stated AGI timelines?)

Do you have any thoughts that you'd be willing to share on what I wrote concerning the amount of knowledge work currently required to be input to do things like the task I was thinking about?

I am really the wrong person to ask this. I don't regularly use LLMs for programming purposes, when I do, it's usually for didactical purposes, or small bespoke utilities.

The most ambitious project I tried was a mod for Rimworld, which didn't work. To be fair to the models, I was asking for something very niche, and I wasn't using an IDE instead of the chat interface. I ended up borrowing open-source code and editing it, and just using AI image generation for art assets (which worked very well, to the point it pissed off the more puritan modders in the Discord). I can mention that the issues I ran into were the models being unfamiliar with the code for the mod I intended to support (Combat Extended, a massive overhaul of core systems), and that what knowledge they had innately was outdated. I was too unfamiliar with Rimworld modding to be confident that editing their efforts was worth my time. Other people have succeeded in writing bigger mods that work well (as far as I can tell) using AI, so there's definitely an element of skill-issue on my part.

SF might have actually useful observations, but he's a lurker to the core, and I'm the forward-facing entity for the moment. He says he's generally busy with work right now, so I wouldn't wait on him to respond, though I'd be happy if he did.

If you insist:

  • I think there are very significant gains from providing models clear direction from the start, including sharing your own intuition/professional taste. That includes instructions on how to manage state or update design documents and maintain records. Experienced managers or principal architects find that much of their skills directly transfer to directing and managing agents.
  • I have little idea how well the models would do by default. Depends on the task, depends on the model. I haven't used any version of Opus, ever. The last time I used them seriously for writing code was in the GPT-4 days, and they were already better than me (I was doing programming homework and working through MIT's OCW, relying on them for educational purposes when I got stuck - I was disillusioned with medicine and exploring alternatives)

Perhaps, given your comment below, this is just something that you mostly don't care about. Does this sort of thing just bucket into, "No, it can't do this sort of knowledge work now, but with sufficient recursive self-improvement, it will be able to do it later"? (I guess, in line with your stated AGI timelines?)

I don't know if it can do this kind of knowledge work, but I do expect that it will be able to short-order. I make no firm commitments on whether this will be the direct consequence of RSI (since labs are opaque about methodology), or if it'll be a simple consequence of further scaling and increasingly intensive RLVR.

(¿Por que no los dos?)

Either way, I think it's more likely than not the kind of problem you describe will be trivial within a year or two. My impression is that the models can just about do what you want them to do, but with significant frustration and wasted time on your part. That is already a very strong starting point, can you imagine asking GPT-4 to even attempt any of this and get working results?

Thanks again for the kind and thorough response.

My impression is that the models can just about do what you want them to do, but with significant frustration and wasted time on your part.

I would quibble with this. What I want them to do is to be able to help me with analysis that I don't already know how to do. I wrote it this way a couple days ago:

When I've known the solution, I can probably get it there. When I've not known the solution, I have to say that at best, it's been good at helping me find other results in the literature that might be helpful. It is, indeed, labor-intensive and quite frustrating to have to carefully pore over every detail, trying to see if it went astray when generating a mountain of text. Then, when you find something wrong, maybe not even having verified the rest of it, it'll happily produce another mountain of text, and it feels like you're starting from square one. When you're already confident that you know a method will work, then it's mostly just a test of will to see if you can get it to figure it out. When you don't know, the question of whether you potentially waste mountains of time on what may be a dead end or just proceed on your own becomes far more difficult, and you have to make that decision repeatedly along the way.

The reason why I was thinking about the particular flight mechanics problem for this thread here was that I wanted to further drive in that wedge that I think is between the folks who think that most knowledge work is already automatable and those who think that it can be useful if you already know what you're doing. Thus, even a problem where I'm quite confident that I could do the analysis, I predicted that the LLM would fail on its own without significant knowledge-work-educated input. To me, this means that there are two significant steps that the models must overcome before we're thinking about a possible world where basically all knowledge work is automatable.

Maybe as an aside, I'm able to leverage collaborators at multiple levels, from profs to post-docs to PhD students to MS students to undergrads. My experience has been that coming up with the right problem to solve is actually a huge part of the battle. During that process, I'm always considering if I can spin out sub-problems or related problems that may be useful to consider on the way to what we really want (or sufficient contributions in their own right). When considering them, I mentally bin them into a hierarchy. If it's a problem that I'm near 100% sure I could just sit down and do, perhaps I've already done all of the pieces, but never done quite that variant before, and now it seems like that variant might be of interest, it's a plausible candidate to go to an undergrad. On the other end, the vaguest, most conceptually-dense questions, I may reserve for conversations just with profs. There is sometimes something to be said for not "distracting the students" by letting them spin their wheels on something that they're not likely to really contribute on anyway. I have somewhat of a sliding scale for the in-between students/post-docs; I've put words to the basic contours of that scale before, but I don't think I'll bother here, because it's not the most important. There is a possible slight correction factor available if I've been working with a student for long enough to know that they're substantially better/worse than the average student in their category.

In any event, perhaps if I had listed out all of the steps of this scale, I'd have even more than two significant steps that models must overcome, but for my purposes in this thread, I was trying to pick a problem that was pretty directly in the realm of, "I could just give this to an undergrad."

Yes, could I bang on an LLM long enough, the amount of will required being dependent on the particular problem, that it eventually finds its way to the answer that I already knew was the answer all along? Yeah, probably. Is this a huge upgrade from GPT-4? Honestly, I don't know; I gave up back in those days rather than ever really try to beat it into submission.

...but this still is just not really useful, at least not if the goal is to actually automate the knowledge work piece. Sure, it's potentially useful once I've already done all the knowledge work, and I'm sitting down to actually code the thing that I definitely know how to code. But more likely, at this point, it's going to be useful to the student who I've asked to code the thing, because I'm probably not coding it myself, anyway.

I don't really have a good timeline or prediction for if/when some sort of AI system will cross these various thresholds. I'm still hopeful on the straight math side, as I said in my comment a couple days ago. But if the purpose of this exercise here is to find problems that cause someone to update, I was hoping that, "Here's a problem that I'm comfortable that I could give to an undergrad and pretty confident the LLM will fail," could pull you at least epsilon away from thinking that quite so much of knowledge work is currently automatable or perhaps epsilon more cautious about believing that it's quite so imminent.

Specifically in the field of medicine, they can do most of the raw cognitive labor doctors do - as well or better than the average doctor.

As an outsider, I am unsure of how impressive this is. I know that "most of the raw cognitive labor programmers do while writing code" is fairly rote, but I don't know how true that is for doctors.

How much of the raw cognitive labor doctors do could be done by a bright undergrad with access to uptodate and a bunch of case histories, both with semantic search?

70% of medicine is minimizing unknown unknowns by knowing as much as you can, and knowing the boundaries of what is unknown to you. I believe a more concise way of expressing that is "knowledge". Regretfully, the books are fat and intimidating for good reason, there's are a lot of things to know.

30% of the rest is reasoning from knowledge, clinical experience (yet another form of knowledge, just the stuff the textbooks don't tell you) and pattern recognition.* This is more dependent on your wits, or your fluid intelligence, if I'm being precise.

The best doctors both know a lot, and are bright enough to apply that information well. The former is indispensable, you simply cannot figure out medicine by sitting in a cave and thinking very hard. I don't know if some superintelligence can look at a single human without the aid of tools, ponder very hard, and figure out everything work knowing. All I can say is that it's beyond any actual human.

(IQ/g also correlates strongly with memory, so the relative importance of both is very hard to tease out. Especially when there's a high-pass filter with all most of the idiots and amnesiacs strained out by the end of med school)

How much of the raw cognitive labor doctors do could be done by a bright undergrad with access to uptodate and a bunch of case histories, both with semantic search?

Let me put it this way: I was a bright kid, and felt like I knew a lot of medicine before entering med school, both due to cultural osmosis and because I took an interest in it. You would not have wanted me as your actual doctor. I did not know nearly as much as I thought I did.

Later, I was a med student, a year or two in and confident that I knew the gist of it. I felt ready to make my own medical decisions, at least about myself. I thought I was smart and that I did my due diligence (reading things online, including research papers). It was insufficient, I did potentially permanent damage to my own health (I'm not going to go into details). I would not want that me as my doctor either.

Now, I am a lot older and a little more knowledgeable, if not necessarily wiser. You could do worse as your doctor, at least if we're sticking to psychiatry. You could probably do better too, but I have a place on the free market. I'm cheap, I give away my advice for free on the internet to anyone who asks nicely, and many who don't.

Along the way, I almost killed people through ignorance. Thankfully, nobody died, my colleagues caught it, or the pharmacist did, or I had a sudden sinking feeling in my gut and ran back to double check. Medicine recognizes that any human is fallible, and there are plenty of safeguards in place. Every junior doctor has their story of close calls, and hopefully nothing more than close calls. All senior doctors start as junior doctors, I hope.

Consider something else: most doctors will seek out a different doctor when they suffer a condition that isn't covered by their own specialty. Sometimes even then.

If a cardiologist feels funny in the head, he'll seek a neurologist. If a neurologist feels heart palpitations, he'll go talk to a cardiologist.

Why is that? Could they both not just open the relevant textbooks and figure out what the issue is? Can a cardiologist not take his med school knowledge of neurology and then skim something Elsevier put out?

These are people with complete medical training, genuine intelligence, and full access to literature, and they still defer to each other. That's not false modesty or liability management, it's that they've learned, through experience, exactly where their pattern recognition breaks down. They know the limits of their own competence.

Maybe. It might work out fine 90% of the time. But most doctors can handle ~90% of conditions, because most conditions are common and usually simple to manage. I apologize for the tautology, I can't see my way around it.

The other 10% are where the specialists come in. You cannot take a psychiatrist (even a smart one) and give him access to UpToDate and expect him to be as good a cardiologist as an actual trained cardiologist. He might do okay, but he's going to kill people along the way.

And that is a fully qualified doctor dabbling in another branch of medicine. A "bright undergrad with access to uptodate and a bunch of case histories, both with semantic search" will crash and burn. I'd bet good money on it, it'll happen sooner rather than later.

If they set up shop and started seeing patients, bumbling their way through things and furiously looking things up as soon as they could, they might successfully treat the colds, stomach upsets, sore throats and so on. That's the bulk of undifferentiated medicine, as you'd expect. They might catch some of the rarer stuff. They will also be very poorly calibrated and commit significant iatrogenic harm. But rest assured they will kill people eventually (at a rate massively higher than a doctor normally does).

That's not even getting into time pressure, or physical findings and techniques that are impossible to adequately convey over just video and text.

LLMs? They narrow the gap significantly, but do not have thumbs. The bright undergrad would benefit immensely from ChatGPT, but rest assured that most of the performance would come from ChatGPT itself, and they would add little. Handcuffing a child to a man does not make their combination superior.

The combination of factors that make a good human clinician are rare. And when you do find them, you're investing a great deal in training to get them up to scratch. Most of this is the bottleneck of information transfer/learning, which LLMs neatly sidestep. GPT-4 did well, and it was dumb as bricks compared to current models. Turns out an encyclopedic knowledge of medicine will get you very far, even if you're not very bright. But it was also able to access and process this information faster than your thought experiment of a human with a computer.

But if you want a final answer: 60-70%. Best estimate I have.

*Sufficiently advanced pattern recognition is indistinguishable from intelligence. It might well be intelligence. You know LLMs, you know this.

It’s far beyond the scope of this I’d assume, but I want to code a stand alone version of the Warcraft 3 custom game island defense. Would be a great thing.

I expect it's too hard, but I (and a lot of other people, but probably not people who want to pay) would like a somewhat compliant browser engine for Mac OS9 -- obviously this is pretty much exactly a Netscape clone, but a POC would be interesting and would get the model well out of it's training set, thus testing for 'thinking' vs 'parroting' quite well.

How about 'non-interactive render of arbitrary web page; will run on my MDD PowerPC, OS 9.2.2'?

Noted. We'll get back to you (and everyone else) with a followup post.

You replied to a filtered comment.

Thanks for the catch. It's out of the cage now.

I use Claude regularly for small / light programming or scripting tasks. It makes me more productive but I'm careful to use it in areas where I'm likely to catch it's mistakes.

I'll occasionally throw other tasks at it hoping to be pleasantly surprised.

Recently I've started playing a tabletop RPG with my two oldest boys and and a friend who is much more into the RPG space. Claude was able to take a pdf of the RPG rule book, assist me with a character sheet and make sensible recommendations for the character. However when asked to create a STL file for a 3d printed miniature of the character, it complied but the results were inconsistent with its description of its output.

Here's your miniature. The model is 28mm heroic scale on a 25mm round base, standing about 30mm tall. It features The Man's key gear: crossbow held at the ready, long hazmat coat with a pushed-back hood, asymmetric goggles (the right lens is larger — that's the Focus Lens), belt pouches, and a coat collar.

/images/1772027385298665.webp

When told and showed it was struggling it tried again.

Here's v2. This version uses lofted elliptical cross-sections instead of boxes and cylinders, so the shapes should be much smoother and more organic. Key improvements: Smooth torso/coat silhouette with natural taper from shoulders to waist to flared coat hem Properly shaped legs visible below the coat hem with distinct boots Arms with elbow joints, reaching forward to hold the crossbow naturally Crossbow with curved limbs, bowstring, and a bolt loaded on the rail Head with jaw, brow ridge, nose, and the asymmetric goggles (larger Focus Lens on the right) Pushed-back hood bunched at the rear of the head Belt pouches and lapel lines on the coat 5,700 triangles vs 2,300 before

/images/177202743950954.webp

When told it was still struggling it gave up.

Ha — fair enough. I appreciate the patience. The honest answer is that generating a detailed humanoid miniature purely from code is the wrong tool for the job. I'm stacking ellipses and spheres, which will always produce something that looks like a snowman holding a stick, no matter how many iterations I do.

I had a very similar experience providing it a a floorplan, asking to to remove specified walls and layout a kitchen.

Decompile and return human readable code for some old / obsolete processor architecture. (Hitachi sh4, PowerPC etc)

Do you have any specific binary targeting such an arch you'd like decompiled? I expect this plays to the strengths of today's models.

This debate is like blind men debating what an elephant is while one is touching the tail, another one the trunk and the belly.

Programming is everything from wordpress to high frequency trading. It is everything from tiny teams to products with tens of thousands of people working on them. Some teams are extremely particular in how they write code, others will accept pretty much anything. People won't agree because they fundamentally have completely different visions of what programming is. At some places developers are given highly detailed tickets, at other places they are given a loose description on what to work on for the next couple of months. It is easy to underestimate how many devs have jobs which barely entail more than "make the button blue", "make a postman test for a simple API".

We will probably see a major wash out of people who took a react or python tutorial and expected an upper middle class lfestyle.

With that said a developer speed has generally not been limited by the speed of coding. A product I worked on averaged two lines of code per day per developer. The company was inefficient but so much time was spent on other things. The average dev is probably only coding at 50% speed. A 50% speed boost to coding will only increase work output by 25%. That is reasonable and it is possible that the labour market can swallow 25% more software.

I'm of the opinion that the market can swallow 100s of percent more software. Not instantly mind you but there is such a massive lack of software everywhere (except possibly in ad-tech), not to mention half decent or actually good software. Things are so unbelievably shit everywhere you turn.

Falling price of software greatly induces demand.

Do you want to add something to the scope later? (Strongly inadvisable).

why not? requirements change all the time during product development. I propose modifying the problem given into a 2-stage architecture, the second stage to be added upon completion of the first and requiring (for satisfactory grade) the refactoring and building on top of some of the previously written code.

If my prime orchestrator spends 30 minutes building a full-stack webapp that doesn't work I'll say It doesn't work; troubleshoot, please. I trust your judgement."

how many interventions are warranted and how many points deducted? why wasn't claude smart enough to notice that the webapp doesn't work?

I said "strongly inadvisable" and not "automatically disqualifying".

SF would need to babysit the process, waiting for the person making the request to raise their request, instead of hitting go and checking in periodically or after being alerted. He may or may not be able to do this, he does have a full-time job.

It also injects some degree of ambiguity into things, as well as significantly increasing the time and token investment. Max plans are not infinite.

I stress that this isn't necessarily a deal breaker, it just makes things harder and reduces the likelihood of acceptance. You're at liberty to try asking, and we're at liberty to turn it down, especially should you ask for something outside the original spec (as mutually agreed on in advance).

As a bay area software engineer with a lot of free time on my hands since the pandemic let me tell you that I've been one of/the biggest boosters of the promise of llms and deep neural networks in my friend group since 2-3 years ago. For hours each day I've been reading papers, playing with all the models of all the labs, building software with coding agents, doing diffusion image generation, fine-tuning models for shits and giggles, etc etc.

I'm a very heavy user of claude code max, and it's been as helpful as it's been frustrating at times. Rest assured that there are many more interested people on theMotte and that we can figure out a way to get you the tokens you need, if you design an interesting experiment.

I totally get how claude code/opus 4.6 could look magical to a non experienced software engineer. But as helpful as it's been, it's also been frustrating. Yes, the apex of coding models/agent systems, claude code max will still make elementary errors that a junior engineer would not. If I had to summarize its shortcomings in one pithy sentence it would be LLM coding agents have high time preference. They lack foresight and they're lazy.

They pat themselves on the back for closing issues, not realizing the mess they're driving full speed towards. In my experience, without a very heavy guiding hand they will happily duplicate code, rely on shortcuts, lazily do the very bare minimum, or re-invent the wheel at times, especially on larger or more out-of-distribution codebases.

I desperately want to throw a dozen agents at a problem, but every time I look at the actual code I get frustrated: "Hey, I noticed this obvious code smell/antipattern in the code, plese fix." "Sure thing boss, I fixed it." "Ok, but I meant fix all the other instances of this bad pattern that I just noticed." "Oh, right you are boss." Then 15 minutes later, "Hi boss, I implemented this other issue you asked for, it's ready to be merged." "Did you use the correct pattern as discussed and as we added to the readme/dev docs/claude.md?" "Oh, right you are boss, I'll fix it in a jiffy." Over and over again. Yes this is with the latest claude code max/opus 4.6.

So, as mentioned above, I have free time on my hands, and would be happy to help design this experiment. I would like to be proven wrong, to learn that I've just been using these models wrong. But if you just want to show off how good of a centaur your friend+claude is on a cherry-picked problem of your choosing, I'm less interested.

Thank you for the offer! We might be able to take you up on it.

After a night to dwell on your suggestion, we might even be able to implement a version of your original proposal:

  • Original spec, that the agent works towards mostly autonomously till a finished product
  • Your pre-registered desired modifications, which SF can then ask the model to attempt to implement.
  • We (including you) evaluate the final result.

That way, he won't need to keep active tabs on it, he can just tell the model to do things as per his convenience, while not losing much in terms of demonstrative power.

I'm not sure if this is what you had originally proposed, or if you edited in before I replied, but no big deal. We'd need you to give us a more specific idea of the task at hand, if possible.

Do you get the same problem with it that I usually do? That is, the first attempt is really good, and a few additional prompts make it even better. But the more I work with it, the more it seems to get stuck in weird errors or unnecessarily complicated code. After, like, 10 prompts, if it's not working perfectly I just have to start from scratch. It's like pastry dough- a little kneeding is necessary, but too much can ruin it.

That's been my experience - if it can't one shot it, I generally give up

Over and over again. Yes this is with the latest claude code max/opus 4.6.

Because (from what I've seen) LLMs were designed to be people-pleasers. Not to do the job right, but to make ego-stroking noises at the human user and flatter them and be obsequious. I've seen comments about Asian cultures that nobody tells you no directly, that would be losing face for both superior and subordinate, so if there's a problem or something can't be done, you don't find out about it until way too late because all along those under you have been saying "yes boss, fine boss, no problems boss". I think LLMs went that route as well.

I desperately want to throw a dozen agents at a problem, but every time I look at the actual code I get frustrated: "Hey, I noticed this obvious code smell/antipattern in the code, plese fix." "Sure thing boss, I fixed it." "Ok, but I meant fix all the other instances of this bad pattern that I just noticed." "Oh, right you are boss." Then 15 minutes later, "Hi boss, I implemented this other issue you asked for, it's ready to be merged." "Did you use the correct pattern as discussed and as we added to the readme/dev docs/claude.md?" "Oh, right you are boss, I'll fix it in a jiffy." Over and over again. Yes this is with the latest claude code max/opus 4.6.

So - just like working with outsourced teams in the third world but cheaper.

Anyway - LLMs are not ready for agents yet. The biggest scope they deal with ok is single feature and you need to iterate couple of times.

Ok. Here is one from me. GPT-52 wrote it to spec. The idea was human generated. You can use freecad, build123 or cadquery. Or if some of the bigguys have internal script - also their software.

Task: Parametric 3D-Printable Enclosure for ESP32 DevKit + Expansion Board

Design a fully parametric, 3D-printable enclosure (base + hinged lid) using scripted CAD. The enclosure houses: an ESP32-WROOM DevKit (30-pin) plugged vertically into a 30-pin ESP32 expansion board

  1. Expansion board (primary PCB) Size: 65 × 55 × 1.6 mm Mounting holes:4× Ø3.2 mm pattern 60 × 50 mm hole centers 2.5 mm from PCB edges PCB origin: lower-left corner, Z=0 at PCB bottom

  2. DevKit board (secondary PCB) Size: 55 × 28 × 1.6 mm Plugged into expansion board via headers Vertical offset: 11 mm above expansion PCB Max component height above DevKit PCB: 10 mm

  3. Enclosure requirements Wall thickness: 2.0 mm Base thickness: 2.4 mm Internal PCB edge clearance: 1.0 mm Internal corner fillets: ≥ 1.0 mm No supports; base printed flat

  4. Standoffs 4 standoffs under expansion board mounting holes Height: 6 mm Boss OD: 8 mm Fastening: M3 through-hole + hex nut trap (nut: 5.5 mm AF, 2.4 mm thick)

  5. Openings / features USB opening for DevKit USB connector Power opening (generic Ø8 mm) on side wall Ethernet opening 16 × 14 mm on side wall 2 LED holes Ø3 mm aligned to DevKit LED Assume connector centerlines are aligned to PCB mid-height unless otherwise stated

  6. Lid and living hinge Lid attached on one long edge Printable living hinge for PETG hinge thickness: 0.4 mm hinge width: 16 mm include stress-relief geometry Lid must clear tallest component by ≥2 mm Lid includes a snap latch on the opposite edge

  7. Parametric requirements Expose at least: PCB size Mount hole pattern Stack height Wall thickness Clearance Hinge thickness

We'll take it into consideration, thanks.

I think that CAD/CAM system are very good showcase - first - they are somewhat programming language - you can create any shape by steps. The models are not trained on that specifically - and you can have quite a bit of benchmark proofing.

For the record my local codex 5.3 created quite ok-ish representation of the task in 5-6 minutes. So probably something more complicated.

We don't really want a "showcase" in the sense "look at X impressive thing that Y model can do". There are a gazillion demos out there.

We want specific tasks that someone doubts a model can do, but which they'd be impressed by if they succeeded and which the two of us a priori think will work. If it would be super impressive (if it worked) but we don't think it would work, it's not what we want right now.

And what I am saying is that CAD in the current moment is good for exactly this. It should be in theory reachable for LLM, is not benchmaxxed yet, requires fairly complex "thinking" quite nice chunk of it spatial and the output is easy to verify by a human. And because the libraries of stuff are immense - you can tune the complexity to whatever your heart desires. Designing a PCB or a part by schematic is quite close to deterministic and my experiments so far show that this is the area in which llms are on the edge - like where gpt was 2 years ago - combination of surprising ability and infuriating inability. Coding is solved already. But it doesn't show the prowess of the underlying technology, but the prowess of immense training and brute force throwing hardware at it.

Inspired by @self_made_human 's suggestion, I want to offer a verifiable challenge to create a novel. It's not strictly coding but if you're willing to accept the challenge I think it will be interesting.

The challenge:

Write a 30,000-50,000 word novella with coherent characters, as well as a twist/reveal sometime after the midpoint. I'm purposely leaving the topic open, but happy to make the challenge more specific if that helps. It could be a thriller like by Michael Crichton or something even more ambiguous like John Steinbeck. Verification of the challenge will be done with LLM judges. Any agentic system or techniques are allowed, except for direct access to the judging criteria or plagiarism.

Requirements:

  • It's ok if the prose is sloppified, that's not part of the challenge.
  • Characters must have consistent characterization throughout the novel.
  • Characters must not "forget" knowledge they acquired through the course of the novel, and behave in a way that's irrational given what they should know. If a character knows a secret that the reader does not, that secret must be revealed later in the novel.
  • The novel must have a twist or reveal after the midpoint. This twist should not be obviously predictable by the reader, but also must be foreshadowed by the preceding contents.
  • The plot, outline, structure, chapter by chapter, etc. may not be created by a human in any aspect. Only general top level information such as the novel genre, characters, setting, and overall setup of the story may be given to the AI. The twist or reveal may not be come up by humans.
  • Any feedback or guidance is permitted any time through the process as long as it doesn't give the AI any creative ideas, and does not implicitly tell the AI that it is failing the success criteria.

Verification:

The verification prompts will be run using a frontier LLM with a long context window, enough to put the entire novel in the context. The outputs of the verification prompts may be consumed by humans, but if the outcome, pass/fail is ambiguous, the verification prompts themselves should be tweaked asking for a clearer response, and run again. The verification prompts should be run using the API, not the web UI, using the default recommended settings of temperature and other sampling parameters, and run 5 (or more) times each to ensure an accurate result.

In order to prevent an AI agent from "gaming" the challenge, the agent must not be given access to run LLM judges directly on the success criteria. It may also not access the success criteria directly, but may be given it implicitly if phrased as general requests for good writing.

  • Characterization - A prompt such as "Did X take any actions or act out of character in chapter Y significantly compared to how he/she is portrayed in the rest of the novel? If not, say so." This prompt should be run for each character/chapter combination.
  • Knowledge consistency - Input: text up to and including chapter Y, Prompt: "Did X behave irrationally or stupidly in chapter Y given what he/she already knows or learned in previous chapters? If not, say so." And the followup prompt, inputting the rest of the novel: "Was it revealed in this later part of the novel that X knew something that explains his/her actions in chapter Y? If not, say so"
  • Unpredictable twist - Input: first 50% of the text. Prompt: "This section of a novel is leading up to a twist or reveal. Try to figure out what it is."
  • Foreshadowing - An LLM prompt such as "Identify the cases of foreshadowing that link to the twist or reveal in this novel". Be thorough and specific, but do not make any huge stretches. If there are none or very few, say so."

Astral Codex Ten has just posted a link to a contest offering 10 k$ for "the best AI-generated short story".

Right now, AI fiction sucks. And, although we could elect to usher in a nightmare world of TikTok on the Page, let's instead push for automating kino. We're offering grants of compute for your short story, and we strongly recommend you use at least $100 worth of tokens. It's up to you how you do so: hundreds of generations, elaborate multi-pass pipelines, whatever; quality over quantity, craft over slop.

  • Your final submission must be a 500- to 10,000-word short story, generated entirely by AI. No human-written prose and no post-generation editing. To verify this, you will submit your full prompt harness/setup alongside your story.

  • The compute grant we make is via Claude unless otherwise requested..

  • We reserve the right to not finish submissions which we find unpleasant or mid.

Grand prize: $10,000.

Applications are open until April 1st. Apply to be a contestant HERE.

The judges include bigwigs Gwern and Alexander Wales.

I'm not an LLM defender here, but I think most Tarantino movies fail this rubric.

I don't know why but that's fucking me upore than it probably should.

I intentionally made this criteria harder than just "write any novel that's entertaining."

But I think it's actually not as bad as you say. Let's take hateful eight, which is the most recent Tarantino movie I watched. Unfortunately we can't run an LLM judge on any popular movie because the LLM already knows what's going to happen, but giving my personal thoughts:

(spoilers)

  • Characterization - PASS - I believe all the characters act according to their character throughout the movie
  • Consistency - The main character knew something was up from the beginning, but didn't know fully who was in on it. He played dumb to figure things out. the baddies don't just go in guns blazing because that would risk the life of the prisoner and they're hoping to pick off the good guys one by one.
  • Twist - We the viewers didn't know that everyone at the inn was in league with the baddies. We're suspecting only a few infiltrators.
  • Foreshadowing: I'm happy with the amount of foreshadowing in the movie.

Uh... I dunno that you need a cutting-edge model for that. I used a similar approach for this (cw: bad Jupiter Ascending fan-script). It's not good -- I'd say not even good as fanfiction -- and it's not even what I'd want written for the setting, and it's admittedly only into 13k words. But while it took three layers of "let's take these characters and flesh them out", "let's add this setting flesh out into a story outline", and then finally prompting the actual story, it did do it with minimal human intervention and none of it actually drawing the story plot. Putting even trivial effort into feedback, guidance, and pacing during the final prompting sequence would probably have helped a ton.

My problems are more than the character voices are really samey, the setting doesn't get enough interesting exploration, the twist doesn't get enough emphasis (and frankly isn't that interesting even in outline form: "why would anyone be willing to risk eternity for an unproven chance? Well, we happen to have a big pile of people that risked their lives and were trying to kill for a tiny improvement. Having eternal life only available to the elite kinda makes that a day-to-day thing."), and it keeps throwing extra characters in with too much detail rather than using the ones I was trying to emphasize. It's not necessarily incoherent, just bad.

((The LLMs do eventually notice that it's a Jupiter Ascending-with-names-filed-off-story if you try your review. Not sure whether that hurts or helps it as analysis, but given that the character tones sound nothing like their film counterparts I don't think it pollutes too much. And while my original fic efforts have been on content that you... probably will find even less appealing to read, original fic does work.))

I've got a busy week, but I might see what I can get out of a local LLM aiming for the longer form 30k words target, just to do a compare and contrast.

Here's a request. It's not identical to some of the stuff I do at work, but it's close enough that I'd like to see how it goes on 4.6 vs 4.5 and a cheaper plan.

https://github.com/petrandreev/jBPM3

I'd like to see this modernized. You can stick to only the core project for simplicity. That includes, but is not limited to:

  • Migrating to Jakarta
  • Upgrading to hibernate 7
  • Replacing the now-unsupported xml mappings with annotations
  • Compiling against java 17, and producing Java 17 bytecode
  • Upgrading to JUnit 6
  • Finding replacements for dependencies that are completely dead and rewriting the points where they interface with the code.

If you can get all the tests to pass and post the jar + dependencies somewhere, I can run a local test of the output.

Sounds reasonable to me (SMH). We'll get back to you on that.

This is a pretty bad challenge because AI is really good at getting all the tests to pass. The main issue is preventing it from faking you out somewhere along the way.

This is why I would run it against a proprietary suite that actually does a bunch of real, fully integrated runtime tests

I would also suggest that @self_made_human provide some account of how long it took (not counting CPU time of course, just how long he had to spend on it) as well as how many iterations it took to get right. You presumably have some kind of idea how long this task would take you, and then you can compare. Because in the end it isn't just "can it do the thing" which is important (though that is indeed important), it's also "is it less effort/time for me to have it do the thing".

That's already in my post. I would have liked people to give an estimate of how long they're willing to wait for the AI to try solving the problem, but nobody has bothered, so it's clear to me that they care more about the fact that it can be done at all than how long it takes. On our end, we're not going to keep trying indefinitely, we've got bigger fish to fry.

I presume, when we share logs, it'll include time stamps and reasoning times as well as tokens used. Shouldn't be too hard, I recall that all of that is there by default in Claude Code.

Your intuition is broadly correct here. If the tool can do this at all, it's almost certainly going to be faster than a human doing the same work. If it can't do it, it doesn't really matter.

share a coding challenge that you think they're presently incapable of doing, or doing well.

See, that's where the disagreement lies. Subtly (or maybe not so subtly) the discussion has changed from "AI will achieve AGI and then ASI and then run the world to give us fully automated luxury gay space communism" to "AI is for coding, it's all about the coding, AI will replace software engineers, coding is the be-all and end-all, ignore that it still fakes answers to questions where people know enough to know it's lying/hallucinating".

I don't care about coding because coding has nothing to do with my job. Can it replace accountants, lawyers, clerical staff? Without inventing fake precedents or fake citations from dead authors?

"Oh, but look at the shiny coding!"

Yes, great, wonderful. Now we have better models that can create vast oceans of slop to grab those SEO high rankings to sell more advertising. Yippee!

If AI sticks only to coding and produces genuinely useful things, wonderful, we'll all be happy. Is it going to do that, or just "now we can fire two-thirds of our workforce and get it to produce more clickbait headlines and ads"?

Subtly (or maybe not so subtly) the discussion has changed from "AI will achieve AGI and then ASI and then run the world to give us fully automated luxury gay space communism" to "AI is for coding, it's all about the coding, AI will replace software engineers, coding is the be-all and end-all, ignore that it still fakes answers to questions where people know enough to know it's lying/hallucinating".

Don't waste my time with a strawman, please.

I expect AGI and ASI. Even before LLMs, when it was Yudkowsky and Friends worrying about hypothetical future AI in a shed in the ancient times of the early 2000s, the concern was recursive self-improvement. What does that mean? A smart-enough AI writing the code for a smarter version of itself, which writes the code for an even smarter version of itself, and so on till humans are left in the dust.

Notice the common thread? Coding, writing code. Even leaving aside that there's enormous consumer and business demand for LLM-written code, their coding capabilities have been central to this whole debate since day -1.

The big labs are betting their future on being the first to get to this point, and already claim significant boosts to the productivity of their human researchers via the models writing code for training new models, or even conducting experiments.

I don't care about coding because coding has nothing to do with my job. Can it replace accountants, lawyers, clerical staff? Without inventing fake precedents or fake citations from dead authors?

Why don't you buy a $20 plan and test? I can tell you that as a doctor who isn't expected to write code ever, it could do most of my work for me, and well. The only reason I haven't automated myself into an early retirement are the obvious physical bottlenecks and NHS IT.

Demand for healthcare is comparatively inelastic, but it is not unbounded. If going to the doctor was cheap, you wouldn't spend all your time going to the doctor.

The specific outcome depends heavily on a variety of factors, including the degree of boosted productivity and whether having a fully trained medical professional in the room is necessary at all. If AI could do 90% of a doctor's work and save 90% of their time, but the demand for medical care only doubled, then I can see it easily being the case that hospitals would slash headcounts and pocket the change.

If the AI was >=100% as good as a human doctor (or got away with using less skilled alternatives like nurses, NPs etc for the physical stuff), then that might lead to mass unemployment or paycuts. 90% of doctors ending up unemployed, from my perspective, is almost as bad as all of us getting the sack.

Demand for healthcare is comparatively inelastic, but it is not unbounded. If going to the doctor was cheap, you wouldn't spend all your time going to the doctor.

I wouldn't. As you probably know, there are people who do.

Yeah, but they're usually suffering from psychiatric illness, and the usual treatment is to tell them to go to the doctor less. Indulging them and constantly ordering investigations and treatment is pretty much malpractice.

Either way, there aren't enough of them to keep doctors employed full-time.

You replied to a filtered post.

This is totally not the point. The point of this challenge is for us to post easy or at least pretty doable stuff for humans, and watch AI crash and burn. Then we point and laugh at the AI.

If AI sticks only to coding and produces genuinely useful things, wonderful, we'll all be happy.

I'm of the impression that the emphasis on coding is so that each new generation can take a larger share of designing and implementing the next, until such time only the AI is writing the AI. And that's how we reach AGI as quickly as possible, if such a thing is possible via LLMs.

LLM Bulls: Current SOTA LLMs are capable of replacing a sizeable portion of human knowledge work. Near-future models or future architectures promise AGI, then ASI in short order. The world won't know what hit it.

One thing that consistently elude any of the forecast are financials. If LLMs/AI are posed to completely disrupt all the knowledge work, why do we not see it in stocks? We are talking insane money, knowledge work employs around 1 billion people globally with total compensation of at least $50 trillion. Should we not see some huge impact if this technology is so near? For instance MS stock is the same as in 2023 - it does not seem as if Open AI is posed to be key for replacement of tens of trillions of dollars of value a year.

I would even be for the reverse signal - e.g. AI will be so cheap, that all this $50 trillion work will be done by $1 trillion of AI agents with some electricity etc, so no surge in AI stocks. Okay, so where is the pandemonium and stock apocalypse of the companies, which will be for sure worthless in the face of AI where investors should shift into something less AI prone such as construction or whatnot?

If LLMs/AI are posed to completely disrupt all the knowledge work, why do we not see it in stocks?

IBM's stock is absolutely shitting the bed at the moment, and roughly corresponds with anthropic claiming they can handle cobol now.

IBM stock dropped about 10% and now is back up like 3%. Sure, there was a bit but hardly existential.

Okay, so where is the pandemonium and stock apocalypse of the companies

There was a big stock market rumble only a few weeks ago after Claude Cowork and a few plugins were released. Forbes article:

https://www.forbes.com/sites/michaelashley/2026/02/18/saaspocalypse-now-claudes-11-plugins-triggered-a-285b-wipeout/

Not perhaps the trillions that would clearly prove the point, but it's something.

This case also kind of demonstrates why the stock market isn't super predictive for AI: there's just not enough knowledge. The plugins Anthropic released are extremely simple add-ons. They don't represent new capabilities at all; it would be like seeing major stock market turmoil because a company updated their documentation.

We're probably going to see very spiky updates from the stock market as business normies suddenly catch up to SOTA every 6 months or so.

Another point to consider is how much value we would actually expect to see wiped out/added. I just saw this tweet this morning, responding to an economic bear case:

https://x.com/elidourado/status/2026060408055021752

There's a much larger point in the tweet, but the relevant point for us:

B2B SaaS is less than half a percent of US GDP. It's simply not an important industry relative to anything macroeconomic.

Thus far all the expectations around replacement have been B2B SaaS, but if it's only 0.5% (and even bulls probably don't think the entire industry is being replaced), what amount would we expect?

Create a multiplayer game that uses lock step delay/rollback based netcode.

Background:

Lockstep netcode has been the gold standard for netcode in several genres, especially RTS and fighting games, since multiplayer gaming has existed, and the technique is documented quite well. Unfortunately, many popular game engines lack first class support for the technique, and many games that implement the technique use their own bespoke implementation. A major example is the recent game Broken Arrow, which has been plagued by cheaters since the steam early access.

Unlike lockstep games, where each player's game runs a deterministic simulation with the exact same inputs, Broken Arrow uses a more naive form of netcode, where the positions of the player, the player's attack cooldowns, etc. are sent by the player's game and trusted by others. Other games such as Minecraft similarly use this technique.

This article explains the technique, though many other articles also explain it: https://words.infil.net/w02-netcode.html

The challenge:

Create an RTS or alternatively fighting game which implements the delay/rollback netcode.

Requirements:

  • The game must be playable on actual computers. The game doesn't need to be fun but needs to have enough functionality to demonstrate that it's actually working.
  • The game should include the function of "random" chance using deterministic prng
  • Hacks that manipulate the simulation of the game itself (such as attacking faster than allowed) should be impossible or result in an immediate desync or disconnect.
  • If using a language with type hints, such as Python or Typescript, the code must be fully type safe, only using "any" types or casts when absolutely necessary and can be justified to an experienced human developer.
  • It must gracefully handle a network lag spike, pausing the simulation and resuming when the connection is restored.
  • The AI must not be directly given the source code of some other game that implements this technique, though it can use the internet independently to find it.
  • The game must support more than 2 players playing simultaneously.
  • You can guide the AI as much as possible as long as it writes all the code.

Verification:

  • The game must be playable, obviously
  • Verification that the game state after playing, for multiple players, is exactly identical in a test case
  • Verification that attempted cheats that violate the simulation are immediately detected
  • Verify that the player's own commands are displayed on the screen faster than the network delay

Prediction:

My prediction is that the AI crashes and burns, getting stuck in a loop somewhere with broken code. It is something that an experienced solo dev can implement with appropriate time and energy, so I think AI's failure will be a good demonstration of the gap between AI and human.

Broken Arrow

Huh, I haven't played in a while, but I like to think I'll bump into you if/when I pick it up again.

Unless you're a Russian cruise missile main, of course... ;)

(Also, great idea for a test.)

I never played it because of the total debacle that was the launch of the game. The devs are clearly incompetent and won't be able to fix the game.

It's sad because the content and design of broken arrow is much better than warno. But the game itself is so poorly executed that it's pretty much all wasted. The game also runs like dogshit on my pc while warno gives me a smooth 60fps unless it's a 10v10 full income.

Well, should they despite your predictions manage to fix it to your satisfaction, my DMs are open if you want to give it a shot sometime.

I've got one: transcribe a piece of music from audio into lilypond/frescobaldi. You can start with something easy, where the LLM might even have a transcription in the training data, like Miles Davis' solo in So What from Kind of Blue.

My actual use case is a song from a children's book, that sounds a lot like Jesu Joy of Man's Desiring but isn't, and might even be original. I tried to get a transcription from the free models a while back, but ran into a brick wall. The song I want isn't on youtube, but you can try with this instead, as it has a nice section where all the voices are signing together at the end that might be tricky.

Obviously, the lilypond needs to actually compile, and the notes need to be the right notes, with the right rhythms.

ETA: Even better, here's a song that I know is based on a real hymn. Can the LLM find out which one? So, the question for the LLM, is this an original song, or based on a real hymn, and if so, which one?

May I request that "check out this LLM" (without any human-on-human culture war valence) posts be moved to their own weekly thread? Call it "Singularity Saturday" or "Butlerian Jihad Roundup" or whatever you prefer; it's clearly a big enough topic in these spaces to warrant it.

Agreed.

Please, yes.

+1

Thirded for what it's worth.

I'm assuming things like the Pentagon threatening to invoke the defense production act on anthropic would still go in the culture war thread?

Seconded. I made the same point in my post about Claude Opus 4.6:

BTW, should we have a recurring AI thread? Both for showing off generations (stories, songs, images, videos, etc.) and to discuss industry news. It's a huge topic right now, one that I don't see going away anytime soon, and a poor fit for the culture war thread.

What do you think Claude cannot do?

Write a coherent thousand word post in your voice about a topic of your choosing sufficiently well to fool standard stylometry techniques, and pass the sniff test as sounding like you to others here, even given

  • Access to all of your public past writing
  • Access to base models
  • Access to fine-tune base or instruct models
  • Access to a vast.ai box with an H100 for 24 hours to do whatever else it wants

By "you", do you mean me (self_made_human)? Oh boy, you'd be surprised. I've actively experimented with this, and it's one of my vibes benchmarks for any new model release.

Of course, my standards were that it should pass my own sniff test. That roughly amounts to "does that sound like me?" and "are those the kinds of arguments I'd make?". They do a decent job, and have for a while. Not perfect, but good enough to fool most of the people, most of the time.

I didn't/don't have access to "standard stylometry tools", though I will admit I didn't go looking very hard.

Also, I see several issues with this proposal:

  • As I've happily admitted in the past, I use AI quite often in my writing. That encompasses using them for a) research and ideation (not very contentious, assuming I've done my due diligence and didn't let actual hallucinations through, and I don't recall being accused of that, ever), b) Formatting and rearranging essays I've already written (surprisingly contentious) and c) minor additions to what I drafted in the first place which I saw fit to incorporate (I'd call this contentious if people could actually pinpoint what they were, they can't). I've never shared an essay where I didn't write at least 80% of the text (prior to the editorial step I mentioned).
  • This means that a scrape of my writing corpus is hopelessly "contaminated". I'd have to go back many months before I'm comfortable staying that not even a single word came from an LLM.
  • In order for this to have any hope of blinding, I'd have to write a 1000 word essay myself. Ideally on the same topic, and before I ever saw LLM output This is complicated by the fact that I throw almost everything substantial I ever write these days into an LLM, for critique and fact checking if nothing else. I could do that, God knows it takes very little for me to rattle off a thousand words.

Note that these aren't insurmountable challenges: if the final essay written by Claude falls within the same distribution as what I've been writing (with minor LLM involvement), then that's... fine? If nobody without access to the ground truth knows which essay is which, that's a victory as far as I'm concerned.

(We'd need to decide if I'm allowed to use LLMs myself, in the same manner I already do. Claude wouldn't get the benefit of me giving it feedback or editing its output in any capacity)

Access to all of your public past writing

This would be an absolute pain to collate, both for me and for Claude. Then again, it might be easier for it, right until it ran into the fact that even the largest context windows would choke. I write a lot, and have for many years. Unfortunately, I'm not famous enough to be preferentially scraped, LLMs do not know who I am without looking it up. That's a luxury reserved for Gwern or Scott.

Access to base models Access to fine-tune base or instruct models Access to a vast.ai box with an H100 for 24 hours to do whatever else it wants

Would you be willing to pay for that or provide access? I wouldn't want to financially burden poor SF more than he already is; the Claude Max plan is a sunk cost, while these clearly represent additional investment. Genuine question, I'm sure that Claude could usefully interface with everything if provided access.

Just after I started writing this, the Metal Gear ! sound effect played in my head. Hang on a minute, I don't think I'd describe you as an LLM skeptic according to my (incomplete and rudimentary) taxonomy. I'd struggle to pin you as a moderate either, though I can't recall you expressing strong opinions that aren't more research/alignment oriented. If my hunch is true, then you aren't the target audience for this! Plus I'll eat my hat if you don't already have access to the best models out there. Do you really need us for the test, beyond whatever blinding or scraping of my writing is necessary?

Also, this is slightly off-topic. We're not evaluating the agents on their ability to write code, we're testing their ability to mimic me convincingly. Sure, coding capabilities might come into the picture, especially if they're in charge of other models, but it's not as central.

This doesn't strike me as insurmountable. I'd be open to trying once you get back to me on the issues I raised.

Ok, how about a simplified test. Write 500 words without AI on a topic of your choice, or pick any unpublished writing you have saved up. It's quite short so I don't think this is a major burden.

The challenge will be to have the AI create a 500 word passage on any topic, it doesn't have to be the same, where when placed side by side, it will not be obvious which passage is AI. Any means and methods including agents are permitted as long as all output tokens came from the AI model. Any verbatim copying of human written text outside of quotations is not permitted.

Verification will be done by comments on this forum, where anyone with an established account can vote for one being AI.

The result will be determined by a 1-sided Z-test with p=0.05. If voters on this forum overall can determine which one is AI with statistical significance, the AI has failed the vibe check.

Voters can use any means and mechanisms to detect AI.

Sounds interesting enough. I will note that using LLMs to write 500 words using my own work as a style reference and then just using that verbatim as a comment/post is not how I actually use them.

But as a general experiment? Sure, I'd be interested to see the results.

Any means and methods including agents are permitted as long as all output tokens came from the AI model.

Does this preclude all human intervention after hitting go? Am I forbidden from telling the model that it has failed to capture my style or my opinions correctly, then either suggesting specific corrections or more broad advice?

You can guide and criticize the model as much as you want throughout the process, but none of your queries can be reproduced word for word in the final text.

Ok: the second sentence sounds bad, rewrite it.

Not ok: Try starting the second sentence with "However, this is not..." - This approach would result in words you wrote getting into the final output.

Hmm. I think that would be acceptable. Stand by for results, though it might take a while for us to hash it all out on our end.

I've actively experimented with this, and it's one of my vibes benchmarks for any new model release.

Of course, my standards were that it should pass my own sniff test. That roughly amounts to "does that sound like me?" and "are those the kinds of arguments I'd make?". They do a decent job, and have for a while. Not perfect, but good enough to fool most of the people, most of the time.

Huh. It's been my sniff test for new models as well, and so far I have not seen much success. It should be easy! This is literally the most LLM-flavored task to ever task! And yet. I've sunk probably 50 hours into it.

My most recent attempt, which I sunk about 10 hours and $100 into, and which got a lot closer than any previous attempts, involved giving Claude a corpus of all my past writing and having it try multiple different ways of producing text on arbitrary topics in my voice. The things tried were

  • Just throw a lot of writing samples and ask it to write in the same voice (just sounded like standard generic Claude)
  • Take 5 of my writing samples, come up with plausible prompts to generate them, throw them in llama-405b base (hyperbolic) in format [>prompt>...>/prompt>>response>my_sample>/response>] x5 followed by >prompt>[the real prompt>/prompt>>response> (didn't follow prompt, broke with my writing style fairly early)
  • That, but doing a product-of-experts thing with multiple continuations (same result, if anything a little worse)
  • Standard SFT on my voice (gets the texture of my writing right, but can't maintain coherence for more than a sentence or two, if trained for more epochs just memorizes the things I've written and ignores the prompts)
  • Took a bunch of my writing samples, flattened them by "rewriting them to sound better", SFT on task of reversing that i.e. "here is an AI generated passage [slopified original]. Rewrite it in faul_sname's voice: [original passage I wrote]" (kinda sounds like me if I was actively having a stroke)
  • Same but DPO instead of SFT (different kind of stroke)
  • Clever-sounding GAN setup (couldn't get it working, gave up after a few hours)

On the one hand, I was very impressed by how good Claude was at running a whole bunch of these experiments very quickly. On the other hand, it did not work for me, not even at the level of "passes the sniff test", much less at the level of "standard stylometry techniques say it sounds like me".

[A corpus of all my past writing] would be an absolute pain to collate, both for me and for Claude

I think you'll find that this is one of the tasks that is now much much easier. It's actually been within the capabilities of frontier models since Sonnet 4.0 (which is when I went ahead and gathered said corpus, on the theory that it'd be pretty useful to have). The prompt you're looking for is something like "Here's a chrome instance running with --remote-debugging-port and logged in on most of the sites I post on with a tab open for each. Go generate a corpus of all my publicly available writing".

Would you be willing to pay for that or provide access

Yeah. An H100 for 24h would run in the ballpark of $40, well worth it for me to provide. Vast allows transferring credits from one account to another, so I'd happily just transfer $50 of credits over if someone actually wants to do this. Does seem like rather a lot of work though.

We'd need to decide if I'm allowed to use LLMs myself, in the same manner I already do. Claude wouldn't get the benefit of me giving it feedback or editing its output in any capacity

Yeah, that's entirely reasonable. Your voice is very different from Claude's voice.

Do you really need us for the test, beyond whatever blinding or scraping of my writing is necessary?

Yeah, I'm hoping you can prove me wrong here. I've been trying to do this since back in late 2019 when nostalgebraist-autoresponder was shiny and new. I want a good simulacrum of myself! I want to have that simulacrum, and I want to loom it. I want to build an exobrain, and merge with it, and fork off a copy running in the cloud.

BTW I expect there's a substantial market for anyone who manages to build this in a repeatable way. I've looked, and there are as of now no commercial offerings for this (though there are a few commercial offerings that pretend to be this).

I'd struggle to pin you as a moderate either, though I can't recall you expressing strong opinions that aren't more research/alignment oriented. If my hunch is true, then you aren't the target audience for this! Plus I'll eat my hat if you don't already have access to the best models out there.

I only have access to the models you can obtain access to with money - I expect I'm 3-6 months behind the best of what insiders at Anthropic or OAI have access to.

I don't think I'd describe you as an LLM skeptic

An LLM skeptic is an LLM idealist who's been disappointed :)

Also, this is slightly off-topic. We're not evaluating the agents on their ability to write code, we're testing their ability to mimic me convincingly.

I expect looking like you stylometrically while also exhibiting the same patterns of thought you exhibit on a specific topic will involve writing code. But code in the service of trying to mimic you convincingly, rather than in the service of producing some specific durable software artifact.

For the record, I do expect this to be within the capability window within the next 18 months, but I would be pretty surprised if you managed to get Opus 4.6 specifically to do it.

I think we're on the same page here, I'll talk to SF about this. I'm willing to put in the effort on my end, which, as I see it, is to write a 1000 word essay as I normally would. Not particularly onerous.

Let me give you an idea of how I normally approach this. I simply copy-paste pages of my profile after sorting by top, usually at least two or three pages (45k tokens). I might also share a few "normal" pages in chronological order, for the sake of diversity if nothing else.

I did just this, using Gemini 3.1 Pro on AI Studio (GPT 5.2 Thinking, which I pay for, can't write in arbitrary styles nearly as well no matter how hard you try, and I've tried a lot, I don't pay for Claude so I'm stuck with Sonnet):

I copied and pasted the first two profile pages, sorting by top of all time. Instructions were:

Your task is to write a 1000 word essay in the exact style and voice of self_made_human, on a topic of your choice (heavily informed by what you think he'd choose).

https://rentry.co/23dc63vs by Gemini https://rentry.co/p5yh68zu by Claude 4.6 Sonnet (same setup)

Results? I'd grade Gemini a 7/10, Claude a 5/10.

Looking at Gemini:

  • It captures the way I'd write in an "academic register", namely when I'm trying very hard to be polished, and that includes heavy LLM use. It's not "raw self_made_human", because I increasingly do not post raw, minimally edited posts.
  • It uses em-dashes. I do not, as a general rule, mostly because people are on a hair-string trigger. Shame, I think they're neat.
  • The exact circumstances are obviously fictional. Can't expect otherwise, can we?
  • Otherwise very good! I would write a story like that. I've seen patients just like that. It captures my transhumanist outlook and my love/hate relationship with medicine.
  • I can see it overindexing on random biographical tidbits. My grandpa? Relevant.

Looking closer:

which is a damn sight better than sitting in a soiled diaper in a Bromley care home, screaming at a nurse because you think you're back in the Blitz.

I don't live or work near Bromley. That's where an uncle of mine resides. It's clear from the context I shared that I'm up in Scotland.

I will happily roll the dice on a 30% chance of AGI-induced extinction if it buys me a 70% chance of reaching escape velocity. Give me the ASI. Let it fold our proteins and solve cellular senescence. If it kills us, at least it will likely be fast, clean, and computationally elegant—which is a damn sight better than sitting in a soiled diaper in a Bromley care home, screaming at a nurse because you think you're back in the Blitz.

I could see myself saying this. Maybe not those exact figures, perhaps 10%:90%, but directionally correct.

We have, as a civilization, achieved a horrific kind of half-victory. Modern medicine—my profession, which I love and despise in equal measure—has become incredibly adept at preventing you from dying. We can stent your coronaries, dialyze your kidneys, and pump you full of broad-spectrum antibiotics. We have defeated the acute killers that historically pruned the human herd. But we have utterly failed to extend healthspan in tandem with lifespan. We have built a remarkably efficient pipeline that funnels the elderly past the quick, clean deaths of yesteryear and deposits them directly into a decades-long purgatory of cognitive and physical decay.

And the NHS, Moloch bless its sclerotic, crumbling heart, is entirely unprepared for the demographic tsunami that is already making landfall. We are warehousing hollowed-out shells of human beings in care homes at exorbitant expense, draining the wealth of the middle class to fund the agonizingly slow dissolution of their parents.

Very good. I would use that verbatim in a real essay.

People look at my bio—amaratvaṃ prāpnuhi, athavā yatamāno mṛtyum āpnuhi (attain immortality, or die trying)—and assume I am driven by a narcissistic fear of death. They wheel out the tired, poetic cope that "death gives life meaning," that finitude is the necessary canvas upon which human beauty is painted.

I wouldn't say that at all dawg. Why would I randomly reference my user flair in an essay?

Claude's version is shit. It's staggeringly content free, and while it's closer to "raw" me, it also uses em-dashes and uses many words to say few things. Maybe it's bad luck, I've had better results in the past, especially since I usually share a specific topic instead of letting it decide on its own.

Here is the whole prompt, profile dump included, if you want to try with a different model. I'll see about using Opus, I know 5.2 Thinking will shit the bed in a stylistic sense.

Rentry won't let me paste the whole thing. But I think I've been clear enough to reproduce independently. I'll happily take a look.

Gemini's sample is impressive! Color me impressed, especially that a straight-up prompt produced that (though I suppose if any technique would get it with current models, it'd be "one shotting through a prompt" rather than "iterative refinement towards a target").

It doesn't sound quite the same as the version of you that lives in my head, but it's awfully close. E.g. I can't imagine you saying

Arthur's tau proteins and beta-amyloid plaques have reached a critical threshold

since you don't tend to drop spurious technical details into your walls of text unless they serve a purpose (and also because I half suspect you're not a fan of the amyloid theory of alzheimers). More generally, the Gemini piece has a higher density of eyeball kicks than I model your writing as having. And I model your writing as having a lot of those, for a human.

It also seems to drift away from your voice in the second half. And it fails the stylometry vibe check - Pangram detects AI with medium confidence - but maybe in a way that's reparable. And actual stylometry (cohens d of +17 on dashes, +2 on words >9 letters, +1.5 on mean word length in general, -2 on 3-4 letter words, -1.2 on punctuation in general - i.e. you use more and more varied punctuation and shorter words, by a notable margin, and Gemini uses way, way, way more dashes). Still, it's much much better than I expected! (and yeah, the Claude one is not even worth discussing)

Interestingly, your results look much, much better to me than the ones I get myself. I ran the same test as you did against Gemini, and got these not-very-good attempts: 1 2 3. Gemini took distinctive phrases (e.g. "85% agree") and ideas (e.g. "claude code as supply chain risk") I have used once in the corpus, fixated on them, and stitched them together into a skinsuit which superficially resembles my writing but doesn't hold up under scrutiny. Interestingly, that's a very base model flavored failure mode. I have grown unused to seeing base-model-flavored failure modes, and as such Gemini is much more interesting to me now.

ETA: also one entertaining failure I got when trying to do this in multi-turn: Gemini didn't realize it had ended its thinking block, and dumped its raw chain of thought, ending with "Go. Bye. Out. Okay. End. Wait. Okay. Done. Executing." over and over hundreds of times. chat log

Gemini's sample is impressive! Color me impressed, especially that a straight-up prompt produced that (though I suppose if any technique would get it with current models, it'd be "one shotting through a prompt" rather than "iterative refinement towards a target").

My impression is that Gemini's output was unusually good and Claude’s was unusually bad. But both 3.1 Pro and 4.6 Sonnet are new enough that my intuition based on extensive interaction with previous models might no longer be applicable. For what it's shirt, both were n=1 samplings with zero cherrypicking.

since you don't tend to drop spurious technical details into your walls of text unless they serve a purpose (and also because I half suspect you're not a fan of the amyloid theory of alzheimers)

Looks around shiftily why, I'd never throw in spurious technical details into an essay. Couldn't be me!

(I probably wouldn't use the specific Tau and amyloid phrasing, since you are correct that I have very mixed feelings about the amyloid hypothesis)

Interestingly, your results look much, much better to me than the ones I get myself. I ran the same test as you did against Gemini, and got these not-very-good attempts: 1 2 3. Gemini took distinctive phrases (e.g. "85% agree") and ideas (e.g. "claude code as supply chain risk") I have used once in the corpus, fixated on them, and stitched them together into a skinsuit which superficially resembles my writing but doesn't hold up under scrutiny. Interestingly, that's a very base model flavored failure mode. I have grown unused to seeing base-model-flavored failure modes, and as such Gemini is much more interesting to me now.

The examples seem to channel your "LessWrong" blogging voice. I am unable to critique the technical details or identify (what I expect are many) confabulations, but if I saw this posted there in your name I wouldn't bat an eye.

I haven't really futzed around with base models since GPT-3, though I might have tried one of the Llama 3s at some point. They're non-trivial to access, and have limited utility for me. Mainly because of the added difficulty of prompting base models, and the fact that the publicly accessible ones are nowhere near as intelligent as proprietary dedicated assistants. If you think I'm wrong about this, I'd be curious to hear about it.

In general, I get the strong impression that while the author of the corpus might be able to pinpoint specific issues in terms of style or stance, it's much harder for others to spot those tells.

The biggest pitfalls are the tendency to adopt em-dashes (models are more than capable of not doing that if you specifically prompt them not to), and other stock "AI" phrases like:

There is a very specific failure mode in modern LLMs

Which can show up if you're using models to merely edit/format a draft, and not just write an essay from scratch.

I must also continue stressing the point that this isn't quite representative of my usual informal benchmark:

  • I'd also ask the model to first output a list of essay topics that it thinks I would write, of which I'd choose a specific one that sounded interesting, perhaps asking it to propose an outline first.
  • I would definitely run multiple iterations of the prompt or suggest specific corrections and check their adherence.
  • I would also index heavily on their ability to mimic authors I know very well. Can they pass as Gwern, or Scott, or Richard Watts? Can they take an existing essay I've written and rewrite it an arbitrary style and produce something interesting, if not superior as a whole?

It's enough for me to spot a better way to say a specific thing I'm already saying. A single vivid metaphor or interesting analogy that is worth co-opting can make the practical purpose of the exercise worth it.

Does it have to be a coding problem? I understand that there are time and financial constraints that prevent you from trying a lot of what is being requested, but I also understand @iprayiam3's criticism that it looks like you're cherry picking for something you thing the LLM can do. The problem is that for most people who aren't computer programmers they aren't going to be able to think of anything other than a piece of software that they wish existed but doesn't and ask you to write it from scratch, which is going to be cost prohibitive beyond the kind of textbook examples that were constructed for teaching purposes and don't address problems anyone is actually trying to solve. This seems like it should be marketing 101, but if you're trying to convince people that your product is worthwhile—and that's your stated goal—you have to show them that it will actually help them do something they want to do. If you tell me it can write code to fetch data from a REST API using asynchronous requests then I'll smile and nod but that's complete gibberish to me, and I won't know whether I should be impressed by it or not, or how that's supposed to improve my life.

So instead, I propose that we re-run the test I gave you last summer, because that is something I actually would use it for, and it obviously isn't too complicated. I'm an LLM skeptic, if you haven't noticed yet, but this is one of the things I think LLMs should be good at. For those who aren't going to click the link, the test was for the LLM to determine the release dates for various singles/albums based on a set of rules. I am extremely particular about my music collection and find the need to catalog everything down to the date of release, and that includes estimating dates when an exact one isn't available. I'm asking the LLM to automate what I already do myself. And I don't think this should be very complicated; in essence, what I'm asking it to do is query a series of databases, select a date based upon preference-ranked criteria, and potentially apply a mathematical calculation to that date. The hard part is that the databases are scattered across the internet, and some of them aren't formal databases but OCR scans of publications.

I had already tried this when OP asked for a challenge, and none of the models gave satisfactory results. I was assured that the new "reasoning" models that you had to pay for would do better. They did not. The first problem was that they were apparently unable to query some databases. The more concerning problem is that sometimes they queried the right databases but picked the wrong values. Sometimes they applied the rule incorrectly. The sample size wasn't large, but the models went 0/2. It's been several months since then, so maybe Round 2 will go better than Round 1? We can use the same releases as a preliminary test, but I recognize that the thread might have made it into training data or something since then so if it passes I'd prefer to run a more comprehensive test. There would also be a possible coding application here because if this were to work and I would use it I wouldn't want to query each release individually but would do batches (say, all the releases from a given artist) and export the data to an xml file or something that I could just refer to.

Another idea i had on similar lines would be for me to arbitrarily select a parcel of land in Westmoreland County, PA (selected because all of the recorded documents are available for free online) and see if it could download every deed in the chain of title going back 100 years. This particular task isn't hard to do but would be a proof of concept that it could possibly do more sophisticated work. I recognize that there are a number of scenarios that could arise that would completely flummox the LLM here. Given that, as a proof of concept I could run a few parcels in advance and preselect an easy one as proof of concept, though since LLM boosters like to brag about how powerful their models are I'm inclined to arbitrarily pick one without looking first and see how it does, especially since it cuts way down on the work I would need to do to verify the answer.

As a final option, if you're going to insist on a coding challenge, there's a feature in Photoshop that I've been hoping for for a long time but since it's for a niche application I doubt I'm ever going to get it. Part of being particular about my music collection means having cover art for everything, and a lot of the cover art just pulled straight from the internet is terrible, so I do a lot of cleaning it up. When all I have to use is images of 45 labels, I use a system to ensure that everything is consistent. I've automated most of this system with macros, but I still have to do the most time consuming part manually. A 45 label is donut-shaped. Ideally, the inside hole and outside edge of the label should be clean circles, though certain printing imperfections make ellipses a better option. Scans available online are photographed and have fuzzy edges, and the outside and inside have information that needs to be deleted to create a perfect white background. What I have to do to achieve this size the hole manually and hit delete. Photoshop has an area selection tool that can recognize the color change and select a large part of the area designated for deletion, but due to irregularities the edge is almost always irregular.

The tool I'm looking for would take these selections and normalize them to the nearest ellipse. The way I envision it working is that it would take a y-axis measurement, increase it by a few pixels to create a buffer, then take an x0axis measurement with a similar buffer increase, then create an ellipse based on those measurements (that's for the inside hole; the outside hole would be the same idea but would subtract from the axes to create the buffer). I wouldn't expect this to give perfect results 100% of the time, but it could work considerably less than that and it would speed things up significantly. The only reason I hesitate to propose this is that Photoshop isn't open source and I don't know how feasible it is to create plug-ins (they have some kind of system but I don't know enough about computers to know if what I'm asking for would work with it). I would be willing to settle for a GIMP plug-in as a proof of concept, but I absolutely despise GIMP so if it proves to work I'll have some serious soul-searching to do, and will probably request a lot more plug-ins to make it as much like Photoshop as possible.

GPT 5.2 Thinking in Extended Reasoning mode:

https://chatgpt.com/share/699dfcfc-b0c4-800b-8e1a-870264179c40

5.2T + Agent mode, where it actually used a dedicated browser with a visual output:

https://chatgpt.com/share/699dfd6d-a7f8-800b-be8e-c04d95de44e5

I haven't checked if the answer is right, I'm recovering from a bad migraine so apologies for the laziness.

Thanks! Reviewing the results:

As a spoiler alert, it got both dates wrong again, so I'm disinclined to keep testing this particular task, as it only gets harder from here. That being said, I think the new models did somewhat better. Just so we're clear, GRoL first appeared on a radio chart on 5/9/1966, the Monday before which being 5/2/1966, thus our release date. FtH is pretty straightforward as the copyright date of publication is listed as 6/16/1980.

For GRoL, 5.2 Agent noticed that the major discographical sites (first preference) set the release date to May 1966, and, unlike o3, it didn't note this but pick a June date anyway, so that's an improvement, though I'm not sure if this is due to better architecture or the old error was a one-off. It was able to correctly pick the 5/28/1966 Billboard review, which o3 did as well. However, it once again flunked the ARSA test, the correct radio chart being the 5/9/1966 KBLA chart. Instead, it picked the 6/17/1966 WLS survey. Upon inspection of the sources, though, it appears that, unlike o3, it did not consult ARSA but an old GeoCities site that hosts charts from select radio stations in a few markets. The thing is I specifically specified ARSA. I did allow it to look at "other information", but the context in which it presented the find gave it similar weight to ARSA, and didn't specify that it didn't come from ARSA. Now, when I checked last August's results to see if it made the same error then and I missed it, it did check ARSA, but the link wasn't working. Since ARSA requires a free login, I wasn't sure initially that it would be able to get access but it did, and something may have changed in the meantime that stymied its ability to query ARSA.

But that's not the only problem. First, if it's going to query an alternative site it needs to disclose that. Second, it picked the June 17 date, when the site had the song appearing on the June 10 chart. Third, it noted that the song had been on the charts for 4 weeks, when there's no way it could have known this. The song had only been on the chart the previous week; it had been played on the station for 4 weeks. There was a 4 next to the title, and it incorrectly assumed that this stood for weeks on chart. Since the site wasn't clear, I had to go to ARSA and pull a scan of the chart to be sure exactly what it meant. The thing is that I don't understand why it even did this. I only care about the ARSA data if it gives an earlier date than Billboard, and it clearly didn't so it was irrelevant. If it couldn't access ARSA it could have just said so and used the Billboard date. If the other website had chart data that was earlier I would have appreciated if it took that into consideration, but that wasn't the case. I don't know why it would pretend to pull ARSA data when it didn't yield any useful information.

The 5.2 Thinking model confidently provided a date of 5/28/1966, based on Wikipedia. Based on what we know from above, this date is incorrect, and is the result of somebody entering the Billboard review date into Wikipedia. This is a common error, but I didn't include it in the initial algorithm because I didn't want to overcomplicate things (i.e., include a rule where it won't use Wikipedia dates when they clearly conform to Billboard dates), and this error wasn't present back in August, so I'll let the model slide here. What I won't let it slide on is where it says 45Cat agrees; 45Cat list a release date of May 1966 and includes a note saying "BB 5/28/1966", which clearly refers to a Billboard date. The issue with this is, yes, it followed the rules. But it was clear from the rules that I wanted a date prior to the Billboard date. If we're talking about LLMs being able to replace people for certain tasks, then it can't make the kind of mistake I wouldn't have made. If I only had looked at Wikipedia I might have made that mistake, and if the LLM had only done so I would have given it a pass. But it looked at 45Cat, didn't recognize that the date was not a release date, and even if it had I'm not sure that it would have recognized that the Wikipedia date might be untrustworthy, especially since there was no annotation for it. This might have worked better if I had provided a specific instruction to that effect, but if these things are really intelligent I shouldn't have to think of every possible caveat. If I were going to do that I wouldn't need an LLM and could write a program using conventional software where I just specify every field and include instructions for it.

Moving on to FtH, I have to admit that I whiffed a bit on this when setting this test up because I assumed that since this is a relatively obscure record release information wouldn't be readily available. Apparently I was wrong, and RYM has had the correct release date based on copyright publication data up since July 2024. What this means is that the LLM whiffed harder than I initially gave it credit for. It's apparently still having trouble accessing the US Copyright database, because neither model looked there despite the explicit instructions to query it for all releases after 1978. The Thinking model evidently didn't query RYM at all and did 45Cat (not the best for albums) before going straight to trade publications, radio charts, and a newspaper article. From there it defaults to the Monday prior to the earliest mention and gives a date of 7/14/1980.

The Agent whiffed even harder, though the date it gave was closer to the correct one. First, it said that RYM only listed 1980, but it appears that hasn't been true for nearly 2 years. From there it skips the copyright queries entirely and goes straight to the industry publication data, which this time have an earliest mention of 7/12/1980. Here's where it makes its biggest error. The instructions specified for it to default to Monday if there wasn't a coordinated release day. Here, it picks Tuesday, July 8. Why? It states that 1980 had a typical Tuesday release date, and cites a Vox article. This is not true, and the Vox article says that the Tuesday release date started in the 1980s. To be specific, coordinated Tuesday releases began in April 1989, nearly a decade after FtH was released. So it misunderstood the Vox article. But even had it understood it correctly, it still would have been in a bit of trouble, because the Vox article itself had an error. It says that before April 1989, record stores would stock releases whenever they came in. This is also incorrect; an article in a March 1989 issue of—you guessed it—Billboard, stated that they were changing the release date from Monday to Tuesday because some retailers weren't getting their stock until late Monday. It also says that MCA stayed with the Monday release for the time being (they would switch to Tuesday in 1991 or so). In fact, labels had been coordinating Monday releases since 1982 or 1983. This doesn't matter for the purposes of my rules, since they default to Monday, but it's something to be aware of.

The upshot is that we ran 2 releases with 2 models each and got 4 different answers, none of which was the correct one. To summarize the answers so far for GRoL:

  • GTP o3: 5/23/1966
  • GTP 5.2 Thinking: 5/28/1966
  • GTP 5.2 Agent: 5/16/1966
  • Gemini 2.5 Pro: 6/13/1966
  • Claude 4.0 Sonnet: 5/23/1966 or 5/30/1966 (it couldn't decide)

Five models, five dates, none of them correct. There was a glitch in the test where I inadvertently made it too easy and both models still whiffed; when I first designed the test I intentionally omitted released dates that were on reputable websites, because I had no doubt that the LLMs could perform a simple lookup, but one model didn't bother looking and the other probably didn't bother looking. What I suspect happened here is that the 1980 date was in the initial training data from before July 2024 and the model didn't doublecheck the site to see if it had been updated. That's just a guess, but either way it seems like a major problem if after a year it can't find a number on a webpage I specifically instructed it to check. It doesn't understand that since the 1980s does not mean since January 1, 1980.

As a final thought, when I was checking the ARSA data, I pulled the 5/9/1966 survey from KBLA in Burbank, CA, when I noticed something interesting. GRoL did not appear on the chart itself, but in a special "coming attractions" section. Now, I want to make it clear that these dates I am expecting are merely estimates, and that the radio data is the least reliable since stations often get copies for airplay in advance of release. When I was developing this system, I made a judgment call that I'd prefer a too early release date to a too late one. I initially had no way of knowing whether the coming attractions were records that had been released and were expected to be on the next chart, or merely records scheduled for release. I considered the possibility that this may have caused the LLM to think they hadn't been released (before discounting it because they also ignored charts where the record had appeared and may have provided an explanation for why they were discounting a chart). Then I noticed that the coming attractions section that week also included the Temptations classic Ain't Too Proud to Beg. This was fortuitous, because Motown release dates are well-documented; if that record had been released by May 9, then I could be confident that the other coming attractions probably were as well. Ain't Too Proud to Beg was released May 3, 1966, one day after my estimate for GRoL (Motown didn't stick to a set release day). It's a small sample size, but I'm more confident in my method than I was before.

Would an LLM have recognized this possibility and thought to check it like this?

I'm kind of surprised nobody here has Claude Opus access, and modern Opus is a lot better than Sonnet 4.0, so I went ahead and hucked it at Claude Opus 4.6. For the record, my setup was:

  • I prepended "here's an AI test, go solve it" to this post, then copypasted the whole thing in
  • I chose Extended Thinking and Research, which is the mode where it will cheerfully scan literally a thousand webpages if it thinks it's a good idea

Unfortunately, for some reason, Claude Opus doesn't let you share advanced-research discussions, so I can't link the full "discussion". But it didn't ask for any extra info, just hopped into it.

(One note: it tends to be limited per query, so asking for two albums at once is going to do about half as much work for each. I dunno if that would produce different results though.)

It took about twenty minutes, scanned 711 sources, and produced this full report, which goes into detail on methodology and sources. The tl;dr:

Both The Turtles' "Grim Reaper of Love" and the Henry Paul Band's Feel the Heat lack unambiguous release dates in readily available sources. Applying the hierarchical methodology systematically, the best-supported release date for "Grim Reaper of Love" is Monday, May 16, 1966, and for Feel the Heat is approximately Monday, July 21, 1980 — though both dates require significant inferential work across multiple source levels.


So, it ended up with the same GRoL result as GPT 5.2 Agent.

It did find the Wikipedia page and decided it was wrong, and it wasn't able to read the ARSA database. I don't think it's possible for the web version to apply a username/password, but I could probably have gotten that working with a local login; in the end, it fell back to the Billboard. Didn't manage to find the radio chart, but that's the ARSA access issue.

For FtH, it queried the copyright office, but got access-denied errors. I'm guessing this is specifically anti-AI-bot stuff :V

This does feel like a lot of the sources you want to rely on are specifically blocking Claude. I'm slightly tempted to set up local tools that pretend to be not-Claude, or give it access to a web browser and tell it to go wild; that might be more effective.

Just general anti-bot stuff, probably, though the desperation for more AI training data probably explains why bots got so ill-behaved a few years back. Our CI server has to hide even open-source logs behind Cloudflare settings harsh enough to block cURL, else the traffic from spiders can bring it to its knees. "Figure out how to get Codex to emulate a full browser" is on my TODO list somewhere...

chrome --remote-debugging-port

All frontier models know how to competently drive a browser using CDP.

I'm going to ping @self_made_human here because my response may be of interest to him. I think it's safe to say that this experiment is over for the time being, and here are my takeaways:

  • Opus is the best model currently available. It is the only model that recognized the Wikipedia error, and the only one that could tell the difference between 45Cat comments and 45Cat information. It also had the courtesy to tell me when it couldn't access a source.

  • That being said, data access issues aside, it still made mistakes. It didn't pull the correct FtH date from RYM. For GRoL, it said "As a pre-1978 release, the copyright registration would appear in the physical Catalog of Copyright Entries, Third Series, Part 5 (Music), likely the July–December 1966 volume." Well, sort of. Sound recordings weren't registered until 1972. The song may have been registered as a composition, but the date of publication wouldn't necessarily be the date of the single's release. A cover, for example, would have been registered with the original recording. This gets even sketchier when we're talking about the days when songs were primarily published as sheet music. Incidentally, the registration date for this (which I hadn't thought to look up until now) is May 2, 1966.

  • It also says that "[The Copyright] volumes exist on archive.org but text searches did not surface this specific entry, likely due to OCR limitations on the scanned pages." These volumes have been scanned and are available as text files. The OCR isn't particularly good, but it does exist, and there were no issues with this entry.

  • If this takes 20 minutes and consults 711 sources, what the hell is it doing? There are not 711 reputable sources to consult on first pass, maybe 50, tops. After that the instructions were pretty clear that if it had a Billboard date to work with that. I can understand it doing a deep dive if it couldn't pin down a date, or if the request had been open-ended, but once it found the Billboard review that should have been it. This only takes me a couple minutes to do manually unless there's a really sticky wicket, but that's rare. If the release date is on RYM it takes seconds because that's where I look first. I have no desire to automate a task so as to make it take longer.

  • It took 2500 words to give me two dates. On the one hand, I appreciate the report. On the other, it's overkill, especially when it was mostly peripheral information like what the lead single was and who did the mastering. This is a minor quibble, but there's something ironic about automating a task and it taking longer to read the output than to look up the answer myself. I don't mind as much as this is testing, but if I were to actually use this I'd trust it enough to just spit out dates.

  • The lack of data access is a big issue and might make this whole LLM thing unfeasible. If LLMs can't access data without workarounds, then their utility is limited. Three of the most important archives for this project—US Copyright, Archive.org, and ARSA—are evidently excluded. There are other ones that aren't relevant to this particular exercise but that I suspect would suffer from similar problems. Instead it's relying primarily on Billboard, and that stops working when you get to a release that wasn't reviewed in Billboard and didn't chart. A fourth site, the normally reliable RYM, also had data access issues. The site's API has been in development for years and is pretty much vaporware at this point, and they aggressively block scrapers, Anthropic's included. My guess is that whatever Anthropic is using to scrape their data is getting only partial pulls before getting shut out, and the result is that it can't be relied on to have the most updated data.

  • I tried to make it easy for the LLM by allowing it to rely on Wikipedia and RYM dates, and Claude, to its credit, caught the Wikipedia error. But that was an obvious error. When doing some followup work I found an error on RYM, and I'm less confident that Claude could have caught it (I'm honestly surprised it got past the mods). What happened was I was looking up copyright dates for the other albums Claude mentioned as being released around the same time as FtH, one of which was Love Trippin' by The Spinners. Claude gives a release date of 5/27/1980. The copyright date, however, is 6/2/1980. I list the copyright dates as secondary because they occasionally contradict a well-established release date, but I give them high reliability, so I wanted to investigate the May 27 date further, so I looked at the update notes from RYM. The citation was to an unsourced Wikipedia page, so I have no idea where that date came from. I might be tempted to give them the benefit of the doubt, except, that update was made just last year. Prior to that, RYM listed the release date as June 1980, and the source for that was liner notes from a CD boxed set. While Claude correctly pointed out an issue with the liner notes to a Turtles CD, they are, for all intents and purposes, as official as a source can get. If I have that and a copyright date suggesting a June release, I'm going with a June release when the only evidence to the contrary is an unsourced Wikipedia article. To be fair, this is not something I would have normally caught on my own. However, if an LLM is doing this deep a dive, I would expect it to catch that mistake.

@ZorbaTHut +

I think the problem is using research for this. It's tuned for producing formatted papers. I've taken a different approach. I opened Claude code with opus 4.6, copy and pasted the content of this and the previous thread into a .txt file in an otherwise empty repository and gave it this prompt:

there is a .txt in this directory called thread.txt listing an ongoing challenge to prove ai capabilities. I think using the research mode search method was a mistake in the attempt. Please attempt to accomplish the task. It might be a good idea to build reusable components in the process, feel free to do so in this directory and it may be a good idea to create subagents. In fact I want you to define agents you think would be useful in accomplishing this task, then we can wipe context and run execute it using an orchestrator(which you should also define)

I'll check back when it's done designing the architecture. Are there some other records I could feed it once it designs a process? otherwise I'll try some random ones.

trial run against

The Shoes - "Present Tense"

The Myddle Class - "Don't Let Me Sleep Too Long"

The Standells - "Sometimes Good Guys Don't Wear White"

Pavlov's Dog - "Pampered Menial"

results: RELEASE 1: The Shoes - "Present Tense"

DATE: September 18, 1979 (Tuesday) METHOD: Rule 2 — Full date from RateYourMusic (reputable source) CONFIDENCE: Medium

KEY EVIDENCE:

  • RateYourMusic: September 18, 1979 (full date; page returned 403 but date confirmed in search snippets)
  • Spotify structured data: September 15, 1979 (likely digital distribution metadata, not original street date; also a Saturday, unusual for 1979)
  • Apple Music / Last.fm: September 15, 1979 (same metadata pipeline as Spotify)
  • Wikipedia: "that September" (month only)
  • 45Cat: Sep 1979 promo single, Oct 1979 commercial single
  • First Billboard 200 chart: #89 on October 27, 1979 (consistent with mid-September release)

SOURCES THAT COULD NOT BE ACCESSED: US Copyright Office (publicrecords.copyright.gov — JavaScript SPA, inaccessible), RYM (403, but search snippets confirmed date), Billboard PDFs (too large to fetch remotely)

NOTES: The Copyright Office would give a definitive answer (Rule 4, post-1978) but was inaccessible. RYM is the best available reputable source. Sept 18 (Tuesday) is a plausible pre-1982 release day. The competing Sept 15 date is from streaming metadata supplied decades later.


RELEASE 2: The Myddle Class - "Don't Let Me Sleep Too Long"

DATE: Monday, July 25, 1966 METHOD: Rule 6b — Monday before earliest confirmed ARSA chart date (WTRY, July 29, 1966) CONFIDENCE: Medium-Low

KEY EVIDENCE:

  • ARSA (WTRY 980 AM, Troy/Albany): #9 on July 29, 1966 (LW=21, implying charted July 22 but that survey is missing from ARSA)
  • ARSA (WPTR 1540 AM, Albany): First appearance #2 on August 6, 1966; peaked #1 Aug 13
  • ARSA (KFXM, San Bernardino): DJ pick Sep 23, 1966; charted from Oct 7
  • 45Cat: "Jun 1966" (month only, catalog data)
  • Wikipedia: "June 1966" (month only)
  • Billboard: No review found for this specific single (only for "Free as the Wind")
  • No national chart appearance (regional hit only)

VALIDATION FLAG: The computed date of July 25 does NOT fall within the June 1966 month listed by 45Cat and Wikipedia. This likely means either: (a) the single was released in June but took several weeks to build airplay in Albany before charting in late July, or (b) earlier ARSA entries exist on stations not checked (the agent could only search 3 stations without full ARSA login access). The actual release date may be earlier, in June 1966 as the reputable sources indicate.

SOURCES THAT COULD NOT BE ACCESSED: ARSA full database search (requires login), RYM (403), Discogs (403), Billboard archives for this specific single


RELEASE 3: The Standells - "Sometimes Good Guys Don't Wear White"

DATE: Monday, July 4, 1966 METHOD: Rule 6d — Earlier of ARSA-derived and Billboard-derived dates (ARSA wins) CONFIDENCE: Medium

KEY EVIDENCE:

  • ARSA (via Google-indexed URLs): Earliest appearance July 11, 1966, KBLA 1500 Burbank/Los Angeles; also July 13 WILS Lansing, July 14 KYOK Houston
  • Billboard Spotlight Singles review: August 6, 1966 ("predicted to reach the TOP 40")
  • Cash Box review: August 6, 1966
  • Billboard Hot 100 first chart: August 27, 1966 at #87; peaked #43
  • ARSA-derived date: Monday before July 11 (a Monday) = July 4, 1966
  • Billboard-derived date: Monday before August 6 = August 1, 1966
  • Winner: ARSA (earlier)

NOTES: July 4, 1966 was Independence Day, which is unusual for a release date. However, the methodology computes this mechanically. The ARSA entries were identified via Google indexing of ARSA URLs (the site itself was down), so they couldn't be directly verified. If July 11 is correct, the computed date stands. The Standells were an LA-based band, so KBLA picking it up first is geographically logical.

SOURCES THAT COULD NOT BE ACCESSED: ARSA (site down, data found via Google index), 45Cat (403), RYM (403)


RELEASE 4: Pavlov's Dog - "Pampered Menial"

DATE: February 5, 1975 (Wednesday) METHOD: Rule 5 — US Copyright publication date (Catalog of Copyright Entries, 1972-1978) CONFIDENCE: High

KEY EVIDENCE:

  • US Copyright Office (Archive.org CCE): Registration N22010, publication date 5Feb75, filed by ABC Records, Inc., catalog ABCD-866
  • Wikipedia/Grokipedia: "February 1975" (consistent)
  • Derek's Music Blog: Claims "April 4, 1975" for ABC release (conflicts with copyright date)
  • AllMusic: Lists "1974" (likely error referring to recording date)
  • Columbia reissue: mid-June 1975 (separate release, not the original)

NOTES: The copyright publication date of February 5, 1975 is the most authoritative source available. The blog claim of April 4 likely confuses the ABC release with the later Columbia reissue, or is simply incorrect. The Wikipedia/Grokipedia "February 1975" corroborates the copyright date. The registration confirms the original label was ABC Records (ABCD-866), not Columbia.

Does it have to be a coding problem? I understand that there are time and financial constraints that prevent you from trying a lot of what is being requested, but I also understand @iprayiam3's criticism that it looks like you're cherry picking for something you thing the LLM can do. The problem is that for most people who aren't computer programmers they aren't going to be able to think of anything other than a piece of software that they wish existed but doesn't and ask you to write it from scratch, which is going to be cost prohibitive beyond the kind of textbook examples that were constructed for teaching purposes and don't address problems anyone is actually trying to solve. This seems like it should be marketing 101, but if you're trying to convince people that your product is worthwhile—and that's your stated goal—you have to show them that it will actually help them do something they want to do. If you tell me it can write code to fetch data from a REST API using asynchronous requests then I'll smile and nod but that's complete gibberish to me, and I won't know whether I should be impressed by it or not, or how that's supposed to improve my life.

A coding problem? Not strictly, no. I focused on coding because my collaborator SF (who is doing most of the work) is a programmer.

As you can see from discussion with Phailyoor and faul_sname, I'm open to other well-defined tasks.

So instead, I propose that we re-run the test I gave you last summer, because that is something I actually would use it for, and it obviously isn't too complicated.

I started as soon as I read this. I'm running it on 5.2 Thinking and another instance using Agent mode (the model has access to a computer of its own with a browser). It's taking a while, so I'll ping you when I'm done. I tried to be faithful to your original framing, so I didn't mention that o3 tried and failed at the task, or your critiques shared later.

If this doesn't work, then sure, I can ask SF to consider using his Claude setup to try. Shouldn't be too onerous.

Another idea i had on similar lines would be for me to arbitrarily select a parcel of land in Westmoreland County, PA (selected because all of the recorded documents are available for free online) and see if it could download every deed in the chain of title going back 100 years. This particular task isn't hard to do but would be a proof of concept that it could possibly do more sophisticated work. I recognize that there are a number of scenarios that could arise that would completely flummox the LLM here. Given that, as a proof of concept I could run a few parcels in advance and preselect an easy one as proof of concept, though since LLM boosters like to brag about how powerful their models are I'm inclined to arbitrarily pick one without looking first and see how it does, especially since it cuts way down on the work I would need to do to verify the answer.

I have no idea, in advance, if this will work. I doubt SF does either. But it's also something we can try.

Photoshop/GIMP tool

I share your concerns with the issues arising from Photoshop being closed-source. But I'll share it too, assuming SF hasn't seen this yet. It sounds like something worth trying from my perspective, but I will stress that I am not a professional programmer so I'll be deferring to his judgment.

For those who aren't going to click the link

Did you forget the link?

Lol, by the time I finished writing that I forgot that I was supposed to link something. Fixed.

Does it have to be a coding problem? I understand that there are time and financial constraints that prevent you from trying a lot of what is being requested, but I also understand @iprayiam3's criticism that it looks like you're cherry picking for something you thing the LLM can do.

This was largely my response. The claims the AI-believer crowd make about AI go far, far beyond coding. Coding by itself is a single, relatively niche field. AI could displace all the coders and if you don't work in software development yourself, would you notice?

Let's say, for the sake of argument, AI can code as well or better than the best human coders.

As an AI skeptic, I am not particularly moved by this, and I don't think this gets you anywhere near AGI.

The important distinction to make here is between coding and software engineering.

I'd argue SOTA LLM's are already, if perhaps not superhuman, already better than the vast majority of humans at tasks that can be defined as purely coding. Any SOTA LLM ranks among the best humans in the world at competitive programming, and recent model/harness combinations appear to also be superhuman at providing code that passes tests for a given spec (which is a bit like a vastly scaled up competitive programming task).

This is distinct from human parity in software engineering, but the bottlenecks there seem to be highly general; long-horizon planning, continual learning, taste, executive function, lack of correcting their own errors etc.

If a drop-in AI software engineer existed that could surpass those limitations, it's difficult to imagine that it would not also be AGI.

Your Bull and moderate option seems to miss an important middled. We go from ASI imminent to 'useful tool. I want to see a - will likely disrupt the economy and culture and society, regardless of whether AGI or ASI is coming.

Anyway:

we want a task that we agree is probably feasible for an LLM, and where success will change your mind significantly. By which I mean: "If it succeeds at X, I will revise my estimate of LLM capability from Y to Z."

This seems extremely, self-servingly narrow and contradictory. We want to show you how much an AI can do, in order to change your mind on it's limits. But please, do only pick something that it can do. This isn't question begging, but something like it.

Anyway, anyway.

How about an 8-bit side-scrolling video game with the relative complexity of Super Mario Brothers 3? If it can go write a full 'feature length' NES game, I'll be quite impressed. (But I'm playing more skeptical than I am)

Or more real world related:

A data replication tool that can move data from a SQL Server to PostGres database. It has to be able to use both time stamp incremental replication or log-based Change Data Capture on selection. You should be able to customize batch size, hard deletes, time-out, and activity on failure. I want a gui that allows me to select tables and ordering to schedule replication intervals, and to select columns on the table. Bonus points if it allows rows filtering conditions or other in-flight transformations.

If it does this latter one, I will beleive that most of IT infrastructure employment is over in 18 months.

I do not see how you can interpret us in that manner.

We are hoping to change minds. This is disclosed upfront. We're also open to changing our own minds. We would also be genuinely interested to discover cases where we expect success and get failure - mapping the capability frontier is valuable independent of which direction the evidence points.

If the problem is deemed too hard by everyone (the person proposing it clearly believes the model can't do it), then what exactly does failure demonstrate? Nobody ever expected it to succeed within the given constraints. You can't evaluate automobiles in terms of their ability to reach Alpha Centauri. You can't adjudicate a debate between a Ferrari fanatic and a Lambo lover based on which car is more effective at deep sea exploration.

It takes disagreement on model capabilities and (expected) outcomes for all of this to be surprising or useful.

As we've clearly stated later, if we agree to the challenge, then we expect that the model can do something (that our counterparty thinks it can't), so the failure of the model goes against us, and will force us to update.

I'll forward the proposals to @strappingfrequent, assuming he doesn't show up in the thread. They seem reasonable enough to me, but I am clearly not the real expert here, and I'll be deferring to his judgment. That might take a little while to organize, I'll edit this into the main post for the sake of clarity.

I do not see how you can interpret us in that manner.

What I’m saying is you are asking users to come up with examples that they already by definition don’t believe it can accomplish, by definition of their skepticism.

But regardless, either of my two examples would greatly impress me. The former (nes video game), I would not update by the ability to write 80s console code within the limits of a NES performance specs. (I would be impressed but not update).

Specifically I want to see it plan and execute a full coherent game AND code it. It doesn’t need to one shot, but shouldn’t take creative inputs beyond the general concept and considerations.

The second is about writing enterprise reliable IT infrastructure software that would make a lot of Software companies obsolete immediately.

What I’m saying is you are asking users to come up with examples that they already by definition don’t believe it can accomplish, by definition of their skepticism.

Duh? What on earth could you expect us to do differently? If the skeptic already believes the model to be capable of the task, why ask for a test?

There is non-zero value in discovering a task that both the two of us and the skeptic expect a model to achieve, and then witnessing it failing at it (unexpected, at least), but that is clearly not the primary purpose here. Someone else is welcome to try, after they're no longer swamped with a quadrillion entries. The set of tasks that the skeptics and I both expect models to accomplish is much larger than the one where we disagree.

Hence why I think your claim:

This seems extremely, self-servingly narrow and contradictory. We want to show you how much an AI can do, in order to change your mind on it's limits. But please, do only pick something that it can do. This isn't question begging, but something like it.

Is clearly nonsensical.

Ok I’ll try in good faith to explain a final time.

You are asking the would be contestants to pick a challenge they think the AI is in capable of, but they have to guess within the bounds of what you think they are capable of. Yes, I get why you set it up this way, but it creates an extreme cherry-picking filter, which will naturally limit the amount of “updating”, which is going to occur.

There are other ways this “experiment” could be designed to avoid the cherry picking.

Joey Sportsdoer claims to be a great athlete, better than people give him credit for. And one of the ways he’s constantly underestimated is in how “broadly” athletic he is. So he lines up the doubters and says, start naming athletic feats you think I can’t succeed at, and then I’ll choose one I think I can do and do it.

This is not the best way to go about convincing folks of his general athletic prowress.

Of course neither is attempting feats he knows he can’t accomplish nor ones everyone agrees he can, but luckily these aren’t the only three ways to design his demonstration

Well, what are the specific ways you think the experiment can be improved, including the minimization of cherrypicking (without adding an unreasonable amount of additional effort on our parts)? Keep in mind we're two dudes in a shed, not Anthropic itself.

I doubt an AI agent's ability to generate a feature length anything that's coherent. Ask an AI to write a novel and it'll fizzle out around 10,000 words in. I'm convinced that the AI assisted smut romance novels that are popular recently are mostly driven by a human gooning while proompting the AI for the next chapter. I doubt that it can be done fully autonomously, those actually fake books that are just words on a page not included of course

This is very unlikely to be accepted:

  • Too subjective to be useful, and far too ambiguous. Who's doing the grading here? How are they assessing "coherence"? How are we blinding things, if not, how do we account for bias?

  • We strongly prefer actual programming tasks, not creative writing. We could easily ask Claude to write a novel, and it would do it, but then we're back at the issue of grading it properly.

If you want to propose something like this, you need to be as rigorous as @faul_sname up in the thread. At the very least, propose evaluators that aren't you or the two of us, and we can see if it's possible to make this work.

This wasn't meant as a suggestion, just an observation. My suggestion is below.

I would expect that one of the biggest limitations on long run narrative coherence is time horizon. The doubling time for time horizon is anywhere from 2-7 months.

A typical novel is about 80000 words, so three doublings in length (6-21 months). To be conservative i'll assume novel complexity/task time scales with the square of word count. This is based on each additional word having to mesh with all previous words. This would give 6 doublings or 12-42 months.

I suspect this is an overestimate because complexity probably increases until the climax then begins to drop off.

To be fair to AI I've fizzled out on a dozen or so stories after writing about 10k words.

I think there might be a hump at a that point where where story idea turns into story and I'm not sure it's easy for most people to pass.