This weekly roundup thread is intended for all culture war posts. 'Culture war' is vaguely defined, but it basically means controversial issues that fall along set tribal lines. Arguments over culture war issues generate a lot of heat and little light, and few deeply entrenched people ever change their minds. This thread is for voicing opinions and analyzing the state of the discussion while trying to optimize for light over heat.
Optimistically, we think that engaging with people you disagree with is worth your time, and so is being nice! Pessimistically, there are many dynamics that can lead discussions on Culture War topics to become unproductive. There's a human tendency to divide along tribal lines, praising your ingroup and vilifying your outgroup - and if you think you find it easy to criticize your ingroup, then it may be that your outgroup is not who you think it is. Extremists with opposing positions can feed off each other, highlighting each other's worst points to justify their own angry rhetoric, which becomes in turn a new example of bad behavior for the other side to highlight.
We would like to avoid these negative dynamics. Accordingly, we ask that you do not use this thread for waging the Culture War. Examples of waging the Culture War:
-
Shaming.
-
Attempting to 'build consensus' or enforce ideological conformity.
-
Making sweeping generalizations to vilify a group you dislike.
-
Recruiting for a cause.
-
Posting links that could be summarized as 'Boo outgroup!' Basically, if your content is 'Can you believe what Those People did this week?' then you should either refrain from posting, or do some very patient work to contextualize and/or steel-man the relevant viewpoint.
In general, you should argue to understand, not to win. This thread is not territory to be claimed by one group or another; indeed, the aim is to have many different viewpoints represented here. Thus, we also ask that you follow some guidelines:
-
Speak plainly. Avoid sarcasm and mockery. When disagreeing with someone, state your objections explicitly.
-
Be as precise and charitable as you can. Don't paraphrase unflatteringly.
-
Don't imply that someone said something they did not say, even if you think it follows from what they said.
-
Write like everyone is reading and you want them to be included in the discussion.
On an ad hoc basis, the mods will try to compile a list of the best posts/comments from the previous week, posted in Quality Contribution threads and archived at /r/TheThread. You may nominate a comment for this list by clicking on 'report' at the bottom of the post and typing 'Actually a quality contribution' as the report reason.

Jump in the discussion.
No email address required.
Notes -
See For Yourself: A Live Demo of LLM capabilities
As someone concerned with AI Safety or the implications of cognitive automation for human employability since well before it's cool, I must admit a sense of vindication from seeing AI dominate online discourse, including on the Motte.
We have a wide-range of views on LLM capabilities (at present) as well as on their future trajectory. Opinions are heterogeneous enough that any attempt at taxonomy will fail to capture individual nuance, but as I see it:
LLM Bulls: Current SOTA LLMs are capable of replacing a sizeable portion of human knowledge work. Near-future models or future architectures promise AGI, then ASI in short order. The world won't know what hit it.
LLM moderates: Current SOTA models are useful, but incapable of replacing even mid-level devs without negative repercussions on work quality or code performance/viability. They do not fully substitute for the labor of the average professional programmer in the West. This may or may not achieve in the near future. AGI is uncertain, ASI is less likely.
LLM skeptics: Current SOTA models are grossly overhyped. They are grossly incompetent at the majority of programming tasks and shouldn't be used for anything more than boilerplate, if that. AGI is unlikely in the near-term, ASI is a pipedream.
Gary Marcus, the dearly departed Hlynka. Opinions not worth discussing.
Then there's the question of whether LLMs or recognizable derivatives are capable of becoming AGI/ASI, or if we need to make significant discoveries in terms of new architectures and/or training pipelines (new paradigms). Fortunately, that isn't relevant right now.
Alternatively, according to Claude:
The Displacement Imminent camp thinks current models already threaten mid-level knowledge work, and the curve is steep enough that AGI is a near-term planning assumption, not a thought experiment.
The Instrumental Optimist thinks current models are genuinely useful in a supervised workflow, trajectory is positive but uncertain, AGI is possible but not imminent. This is probably the modal position among working engineers who actually use these tools.
The Tool Not Agent camp thinks current models are genuinely useful as sophisticated autocomplete or search, but the "agent" framing is mostly hype — they fail badly without tight human scaffolding, and trajectory is uncertain enough that AGI is not worth pricing in.
The Stochastic Parrot camp (your skeptics, minus the pejorative) thinks the capabilities are brittle, benchmark gaming is rampant, and real-world coding performance is far below reported evals. They're often specifically focused on the unsupervised case and the question of whether the outputs are actually understood vs. pattern-matched.
The dimension you might also want to add explicitly is who bears the cost of the failure modes — because a lot of the disagreement between practitioners isn't about raw capability but about whether the errors are cheap (easily caught, low stakes) or expensive (subtle, compounding, hard to audit). Someone who works on safety-critical systems has a very different prior than someone shipping web apps.
Coding ability is more of a vector than it is a scalar. Using a breakdown helpfully provided by ChatGPT 5.2 Thinking:
Most arguments are really about which of these capabilities you think models have:
Local code generation (Boilerplate, idioms, small functions, straightforward CRUD, framework glue.)
Code understanding in situ (Reading unfamiliar code, tracing control flow, handling large repos, respecting existing patterns.)
Debugging and diagnosis (Finding root cause, interpreting logs, stepping through runtime behavior, reproducing bugs. Refactoring and maintenance)
Changing code without breaking invariants, reducing complexity, untangling legacy.
System design and requirements translation (Turning vague specs into robust design, choosing tradeoffs, anticipating failure modes.)
Operational competence (Tests, CI, tooling, dependency management, security posture, deploy and rollback, observability.)
Two people can both say “LLMs are great at coding” and mean (1) only vs (1)+(2)+(6) vs “end-to-end ticket closure.”
With terminology hopefully clarified, I come to the actual proposal:
@strappingfrequent (one of the many Mottizens I am reasonably well-acquainted with off-platform), has very generously offered:
A sizeable amount of tokens from his very expensive Claude Max plan ($200 a month!) and access to the latest Claude Opus.
His experience using agent frameworks and orchestration. I can personally attest that he was doing this well before it was cool, I recall seeing detailed experimentation as early as GPT-4.
His time in personally setting up experiments/tests, as well as overseeing their progress, while potentially interacting with an audience over a livestream.
He works as a professional programmer, and has told me that he has been consistently impressed by the capabilities of AI coding agents. They've served his needs well.
Here's his description of his skills and experience:
To what end?
He and I share a dissatisfaction with AI discourse that substitutes confident assertion for empirical investigation, and we think the most useful contribution we can make is to show the tools actually working on tasks that skeptics consider beyond their reach.
What do we want from you?
If you self-identify as someone who is either on the fence about LLMs, or strongly skeptical that they're useful for anything: share a coding challenge that you think they're presently incapable of doing, or doing well.
An ideal candidate is a proposal that you think is beyond the abilities of any LLM, while not being so difficult that we think they'd be entirely intractable. Neither of us claim that we can solve Fermat's Last Problem (or that Claude can solve it for us).
Other requirements:
A clear problem specification, or a willingness to submit a vaguer one and then approve a tighter version as created by us/Claude.
Nothing so easy/trivial that a quick Google shows that someone's already done it. If you want a C++ compiler written by an LLM, well, there's one out there (though that is the opposite of trivial).
Nothing too hard. He provides an example of "coding a Netflix clone in 4 hours".
An agreement on the degree of human intervention allowed. Can we prompt the model if it gets stuck? Help it in other ways? Do you want to add something to the scope later? (Strongly inadvisable). Note that if you expect literally zero human intervention, SF isn't game. He says: "I don't think I'd care to demonstrate any sort of zero-shot capacity... that's a silly expectation. If my prime orchestrator spends 30 minutes building a full-stack webapp that doesn't work I'll say
It doesn't work; troubleshoot, please. I trust your judgement."A time-horizon. Even a Max plan has its limits, we can't be expected to start a task that'll take days to complete.
Some kind of semi-objective rubric for grading the outcome, if it isn't immediately obvious. Is it enough to succeed at all? Or do you want code that even Torvalds can't critique? And no, "I know it when I see it" isn't really good enough, for obvious reasons. Ideally, give us an idea of the tests everything needs to pass.
If your task requires the model to review/extend proprietary code, that's not off the table entirely, but it's up to you to make sure we can access it. Either send us a copy or point us at a repo.
Nothing illegal.
But to sum, up, we want a task that we agree is probably feasible for an LLM, and where success will change your mind significantly. By which I mean: "If it succeeds at X, I will revise my estimate of LLM capability from Y to Z." Otherwise I can only imagine a lot of post-hoc goalpost movement, "okay but it still needed 3 prompts" or "the code works but a good programmer would have done it differently."
We reserve the right to choose which proposals we attempt, partly because some will be more interesting than others, and partly because we have finite tokens and finite time.
Miscellaneous concerns:
Why Claude Opus 4.6?
Well, the most honest answer is that @strappingfrequent already has an Claude Max plan, and is familiar with its capabilities. The other SOTA competitors include Gemini 3.1 Pro and GPT 5.3 Codex, which are nominally superior on a few benchmarks, but a very large fraction of programmers insist that Claude is still the best for general programming use cases. We don't think this choice matters that much, and the models are in fact roughly interchangeable while being noticeably superior to anything released before very late 2025.
Why bother at all?
We are hoping to change minds. This is disclosed upfront. We're also open to changing our own minds. We would also be genuinely interested to discover cases where we expect success and get failure - mapping the capability frontier is valuable independent of which direction the evidence points. We will share logs and a final repo either way. An interactive livestream is possible if there is sufficient interest.
Anything else?
You know the model. We'll be using Claude Code. The specifics of the demo are TBD, it could be a livestream with user engagement if there's sufficient interest, otherwise we can dump logs and share a final repo.
The floor is open. What do you think Claude cannot do?
Edit:
I'll forward the proposals to @strappingfrequent, assuming he doesn't show up in the thread himself. I am clearly not the real expert here, plus he's doing all of the heavy lifting. Expect at least a day or two of back and forth before we come to a consensus (but if he agrees to something, then assume I won't object, but not vice versa unless specifically noted), and that includes conversations within this thread to narrow down the scope and make things as rigorous as possible while adhering to the restrictions we've mentioned. At some point, we'll announce winners and get to scheduling.
Noted. We'll get back to you (and everyone else) with a followup post.
You replied to a filtered comment.
Thanks for the catch. It's out of the cage now.
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
Ok. Here is one from me. GPT-52 wrote it to spec. The idea was human generated. You can use freecad, build123 or cadquery. Or if some of the bigguys have internal script - also their software.
Task: Parametric 3D-Printable Enclosure for ESP32 DevKit + Expansion Board
Design a fully parametric, 3D-printable enclosure (base + hinged lid) using scripted CAD. The enclosure houses: an ESP32-WROOM DevKit (30-pin) plugged vertically into a 30-pin ESP32 expansion board
Expansion board (primary PCB) Size: 65 × 55 × 1.6 mm Mounting holes:4× Ø3.2 mm pattern 60 × 50 mm hole centers 2.5 mm from PCB edges PCB origin: lower-left corner, Z=0 at PCB bottom
DevKit board (secondary PCB) Size: 55 × 28 × 1.6 mm Plugged into expansion board via headers Vertical offset: 11 mm above expansion PCB Max component height above DevKit PCB: 10 mm
Enclosure requirements Wall thickness: 2.0 mm Base thickness: 2.4 mm Internal PCB edge clearance: 1.0 mm Internal corner fillets: ≥ 1.0 mm No supports; base printed flat
Standoffs 4 standoffs under expansion board mounting holes Height: 6 mm Boss OD: 8 mm Fastening: M3 through-hole + hex nut trap (nut: 5.5 mm AF, 2.4 mm thick)
Openings / features USB opening for DevKit USB connector Power opening (generic Ø8 mm) on side wall Ethernet opening 16 × 14 mm on side wall 2 LED holes Ø3 mm aligned to DevKit LED Assume connector centerlines are aligned to PCB mid-height unless otherwise stated
Lid and living hinge Lid attached on one long edge Printable living hinge for PETG hinge thickness: 0.4 mm hinge width: 16 mm include stress-relief geometry Lid must clear tallest component by ≥2 mm Lid includes a snap latch on the opposite edge
Parametric requirements Expose at least: PCB size Mount hole pattern Stack height Wall thickness Clearance Hinge thickness
We'll take it into consideration, thanks.
I think that CAD/CAM system are very good showcase - first - they are somewhat programming language - you can create any shape by steps. The models are not trained on that specifically - and you can have quite a bit of benchmark proofing.
For the record my local codex 5.3 created quite ok-ish representation of the task in 5-6 minutes. So probably something more complicated.
We don't really want a "showcase" in the sense "look at X impressive thing that Y model can do". There are a gazillion demos out there.
We want specific tasks that someone doubts a model can do, but which they'd be impressed by if they succeeded and which the two of us a priori think will work. If it would be super impressive (if it worked) but we don't think it would work, it's not what we want right now.
And what I am saying is that CAD in the current moment is good for exactly this. It should be in theory reachable for LLM, is not benchmaxxed yet, requires fairly complex "thinking" quite nice chunk of it spatial and the output is easy to verify by a human. And because the libraries of stuff are immense - you can tune the complexity to whatever your heart desires. Designing a PCB or a part by schematic is quite close to deterministic and my experiments so far show that this is the area in which llms are on the edge - like where gpt was 2 years ago - combination of surprising ability and infuriating inability. Coding is solved already. But it doesn't show the prowess of the underlying technology, but the prowess of immense training and brute force throwing hardware at it.
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
I use Claude regularly for small / light programming or scripting tasks. It makes me more productive but I'm careful to use it in areas where I'm likely to catch it's mistakes.
I'll occasionally throw other tasks at it hoping to be pleasantly surprised.
Recently I've started playing a tabletop RPG with my two oldest boys and and a friend who is much more into the RPG space. Claude was able to take a pdf of the RPG rule book, assist me with a character sheet and make sensible recommendations for the character. However when asked to create a STL file for a 3d printed miniature of the character, it complied but the results were inconsistent with its description of its output.
/images/1772027385298665.webp
When told and showed it was struggling it tried again.
/images/177202743950954.webp
When told it was still struggling it gave up.
I had a very similar experience providing it a a floorplan, asking to to remove specified walls and layout a kitchen.
More options
Context Copy link
https://www.calebleak.com/posts/dog-game/
Show's over. Someone's found a way to make even the most unsophisticated user into a competent game developer through judicious use of AI. I'll pack my bags.
(No, it's not actually over, I just thought this was too funny to ignore)
More options
Context Copy link
I've got one: transcribe a piece of music from audio into lilypond/frescobaldi. You can start with something easy, where the LLM might even have a transcription in the training data, like Miles Davis' solo in So What from Kind of Blue.
My actual use case is a song from a children's book, that sounds a lot like Jesu Joy of Man's Desiring but isn't, and might even be original. I tried to get a transcription from the free models a while back, but ran into a brick wall. The song I want isn't on youtube, but you can try with this instead, as it has a nice section where all the voices are signing together at the end that might be tricky.
Obviously, the lilypond needs to actually compile, and the notes need to be the right notes, with the right rhythms.
ETA: Even better, here's a song that I know is based on a real hymn. Can the LLM find out which one? So, the question for the LLM, is this an original song, or based on a real hymn, and if so, which one?
More options
Context Copy link
Does it have to be a coding problem? I understand that there are time and financial constraints that prevent you from trying a lot of what is being requested, but I also understand @iprayiam3's criticism that it looks like you're cherry picking for something you thing the LLM can do. The problem is that for most people who aren't computer programmers they aren't going to be able to think of anything other than a piece of software that they wish existed but doesn't and ask you to write it from scratch, which is going to be cost prohibitive beyond the kind of textbook examples that were constructed for teaching purposes and don't address problems anyone is actually trying to solve. This seems like it should be marketing 101, but if you're trying to convince people that your product is worthwhile—and that's your stated goal—you have to show them that it will actually help them do something they want to do. If you tell me it can write code to fetch data from a REST API using asynchronous requests then I'll smile and nod but that's complete gibberish to me, and I won't know whether I should be impressed by it or not, or how that's supposed to improve my life.
So instead, I propose that we re-run the test I gave you last summer, because that is something I actually would use it for, and it obviously isn't too complicated. I'm an LLM skeptic, if you haven't noticed yet, but this is one of the things I think LLMs should be good at. For those who aren't going to click the link, the test was for the LLM to determine the release dates for various singles/albums based on a set of rules. I am extremely particular about my music collection and find the need to catalog everything down to the date of release, and that includes estimating dates when an exact one isn't available. I'm asking the LLM to automate what I already do myself. And I don't think this should be very complicated; in essence, what I'm asking it to do is query a series of databases, select a date based upon preference-ranked criteria, and potentially apply a mathematical calculation to that date. The hard part is that the databases are scattered across the internet, and some of them aren't formal databases but OCR scans of publications.
I had already tried this when OP asked for a challenge, and none of the models gave satisfactory results. I was assured that the new "reasoning" models that you had to pay for would do better. They did not. The first problem was that they were apparently unable to query some databases. The more concerning problem is that sometimes they queried the right databases but picked the wrong values. Sometimes they applied the rule incorrectly. The sample size wasn't large, but the models went 0/2. It's been several months since then, so maybe Round 2 will go better than Round 1? We can use the same releases as a preliminary test, but I recognize that the thread might have made it into training data or something since then so if it passes I'd prefer to run a more comprehensive test. There would also be a possible coding application here because if this were to work and I would use it I wouldn't want to query each release individually but would do batches (say, all the releases from a given artist) and export the data to an xml file or something that I could just refer to.
Another idea i had on similar lines would be for me to arbitrarily select a parcel of land in Westmoreland County, PA (selected because all of the recorded documents are available for free online) and see if it could download every deed in the chain of title going back 100 years. This particular task isn't hard to do but would be a proof of concept that it could possibly do more sophisticated work. I recognize that there are a number of scenarios that could arise that would completely flummox the LLM here. Given that, as a proof of concept I could run a few parcels in advance and preselect an easy one as proof of concept, though since LLM boosters like to brag about how powerful their models are I'm inclined to arbitrarily pick one without looking first and see how it does, especially since it cuts way down on the work I would need to do to verify the answer.
As a final option, if you're going to insist on a coding challenge, there's a feature in Photoshop that I've been hoping for for a long time but since it's for a niche application I doubt I'm ever going to get it. Part of being particular about my music collection means having cover art for everything, and a lot of the cover art just pulled straight from the internet is terrible, so I do a lot of cleaning it up. When all I have to use is images of 45 labels, I use a system to ensure that everything is consistent. I've automated most of this system with macros, but I still have to do the most time consuming part manually. A 45 label is donut-shaped. Ideally, the inside hole and outside edge of the label should be clean circles, though certain printing imperfections make ellipses a better option. Scans available online are photographed and have fuzzy edges, and the outside and inside have information that needs to be deleted to create a perfect white background. What I have to do to achieve this size the hole manually and hit delete. Photoshop has an area selection tool that can recognize the color change and select a large part of the area designated for deletion, but due to irregularities the edge is almost always irregular.
The tool I'm looking for would take these selections and normalize them to the nearest ellipse. The way I envision it working is that it would take a y-axis measurement, increase it by a few pixels to create a buffer, then take an x0axis measurement with a similar buffer increase, then create an ellipse based on those measurements (that's for the inside hole; the outside hole would be the same idea but would subtract from the axes to create the buffer). I wouldn't expect this to give perfect results 100% of the time, but it could work considerably less than that and it would speed things up significantly. The only reason I hesitate to propose this is that Photoshop isn't open source and I don't know how feasible it is to create plug-ins (they have some kind of system but I don't know enough about computers to know if what I'm asking for would work with it). I would be willing to settle for a GIMP plug-in as a proof of concept, but I absolutely despise GIMP so if it proves to work I'll have some serious soul-searching to do, and will probably request a lot more plug-ins to make it as much like Photoshop as possible.
This was largely my response. The claims the AI-believer crowd make about AI go far, far beyond coding. Coding by itself is a single, relatively niche field. AI could displace all the coders and if you don't work in software development yourself, would you notice?
Let's say, for the sake of argument, AI can code as well or better than the best human coders.
As an AI skeptic, I am not particularly moved by this, and I don't think this gets you anywhere near AGI.
The important distinction to make here is between coding and software engineering.
I'd argue SOTA LLM's are already, if perhaps not superhuman, already better than the vast majority of humans at tasks that can be defined as purely coding. Any SOTA LLM ranks among the best humans in the world at competitive programming, and recent model/harness combinations appear to also be superhuman at providing code that passes tests for a given spec (which is a bit like a vastly scaled up competitive programming task).
This is distinct from human parity in software engineering, but the bottlenecks there seem to be highly general; long-horizon planning, continual learning, taste, executive function, lack of correcting their own errors etc.
If a drop-in AI software engineer existed that could surpass those limitations, it's difficult to imagine that it would not also be AGI.
More options
Context Copy link
More options
Context Copy link
GPT 5.2 Thinking in Extended Reasoning mode:
https://chatgpt.com/share/699dfcfc-b0c4-800b-8e1a-870264179c40
5.2T + Agent mode, where it actually used a dedicated browser with a visual output:
https://chatgpt.com/share/699dfd6d-a7f8-800b-be8e-c04d95de44e5
I haven't checked if the answer is right, I'm recovering from a bad migraine so apologies for the laziness.
Thanks! Reviewing the results:
As a spoiler alert, it got both dates wrong again, so I'm disinclined to keep testing this particular task, as it only gets harder from here. That being said, I think the new models did somewhat better. Just so we're clear, GRoL first appeared on a radio chart on 5/9/1966, the Monday before which being 5/2/1966, thus our release date. FtH is pretty straightforward as the copyright date of publication is listed as 6/16/1980.
For GRoL, 5.2 Agent noticed that the major discographical sites (first preference) set the release date to May 1966, and, unlike o3, it didn't note this but pick a June date anyway, so that's an improvement, though I'm not sure if this is due to better architecture or the old error was a one-off. It was able to correctly pick the 5/28/1966 Billboard review, which o3 did as well. However, it once again flunked the ARSA test, the correct radio chart being the 5/9/1966 KBLA chart. Instead, it picked the 6/17/1966 WLS survey. Upon inspection of the sources, though, it appears that, unlike o3, it did not consult ARSA but an old GeoCities site that hosts charts from select radio stations in a few markets. The thing is I specifically specified ARSA. I did allow it to look at "other information", but the context in which it presented the find gave it similar weight to ARSA, and didn't specify that it didn't come from ARSA. Now, when I checked last August's results to see if it made the same error then and I missed it, it did check ARSA, but the link wasn't working. Since ARSA requires a free login, I wasn't sure initially that it would be able to get access but it did, and something may have changed in the meantime that stymied its ability to query ARSA.
But that's not the only problem. First, if it's going to query an alternative site it needs to disclose that. Second, it picked the June 17 date, when the site had the song appearing on the June 10 chart. Third, it noted that the song had been on the charts for 4 weeks, when there's no way it could have known this. The song had only been on the chart the previous week; it had been played on the station for 4 weeks. There was a 4 next to the title, and it incorrectly assumed that this stood for weeks on chart. Since the site wasn't clear, I had to go to ARSA and pull a scan of the chart to be sure exactly what it meant. The thing is that I don't understand why it even did this. I only care about the ARSA data if it gives an earlier date than Billboard, and it clearly didn't so it was irrelevant. If it couldn't access ARSA it could have just said so and used the Billboard date. If the other website had chart data that was earlier I would have appreciated if it took that into consideration, but that wasn't the case. I don't know why it would pretend to pull ARSA data when it didn't yield any useful information.
The 5.2 Thinking model confidently provided a date of 5/28/1966, based on Wikipedia. Based on what we know from above, this date is incorrect, and is the result of somebody entering the Billboard review date into Wikipedia. This is a common error, but I didn't include it in the initial algorithm because I didn't want to overcomplicate things (i.e., include a rule where it won't use Wikipedia dates when they clearly conform to Billboard dates), and this error wasn't present back in August, so I'll let the model slide here. What I won't let it slide on is where it says 45Cat agrees; 45Cat list a release date of May 1966 and includes a note saying "BB 5/28/1966", which clearly refers to a Billboard date. The issue with this is, yes, it followed the rules. But it was clear from the rules that I wanted a date prior to the Billboard date. If we're talking about LLMs being able to replace people for certain tasks, then it can't make the kind of mistake I wouldn't have made. If I only had looked at Wikipedia I might have made that mistake, and if the LLM had only done so I would have given it a pass. But it looked at 45Cat, didn't recognize that the date was not a release date, and even if it had I'm not sure that it would have recognized that the Wikipedia date might be untrustworthy, especially since there was no annotation for it. This might have worked better if I had provided a specific instruction to that effect, but if these things are really intelligent I shouldn't have to think of every possible caveat. If I were going to do that I wouldn't need an LLM and could write a program using conventional software where I just specify every field and include instructions for it.
Moving on to FtH, I have to admit that I whiffed a bit on this when setting this test up because I assumed that since this is a relatively obscure record release information wouldn't be readily available. Apparently I was wrong, and RYM has had the correct release date based on copyright publication data up since July 2024. What this means is that the LLM whiffed harder than I initially gave it credit for. It's apparently still having trouble accessing the US Copyright database, because neither model looked there despite the explicit instructions to query it for all releases after 1978. The Thinking model evidently didn't query RYM at all and did 45Cat (not the best for albums) before going straight to trade publications, radio charts, and a newspaper article. From there it defaults to the Monday prior to the earliest mention and gives a date of 7/14/1980.
The Agent whiffed even harder, though the date it gave was closer to the correct one. First, it said that RYM only listed 1980, but it appears that hasn't been true for nearly 2 years. From there it skips the copyright queries entirely and goes straight to the industry publication data, which this time have an earliest mention of 7/12/1980. Here's where it makes its biggest error. The instructions specified for it to default to Monday if there wasn't a coordinated release day. Here, it picks Tuesday, July 8. Why? It states that 1980 had a typical Tuesday release date, and cites a Vox article. This is not true, and the Vox article says that the Tuesday release date started in the 1980s. To be specific, coordinated Tuesday releases began in April 1989, nearly a decade after FtH was released. So it misunderstood the Vox article. But even had it understood it correctly, it still would have been in a bit of trouble, because the Vox article itself had an error. It says that before April 1989, record stores would stock releases whenever they came in. This is also incorrect; an article in a March 1989 issue of—you guessed it—Billboard, stated that they were changing the release date from Monday to Tuesday because some retailers weren't getting their stock until late Monday. It also says that MCA stayed with the Monday release for the time being (they would switch to Tuesday in 1991 or so). In fact, labels had been coordinating Monday releases since 1982 or 1983. This doesn't matter for the purposes of my rules, since they default to Monday, but it's something to be aware of.
The upshot is that we ran 2 releases with 2 models each and got 4 different answers, none of which was the correct one. To summarize the answers so far for GRoL:
Five models, five dates, none of them correct. There was a glitch in the test where I inadvertently made it too easy and both models still whiffed; when I first designed the test I intentionally omitted released dates that were on reputable websites, because I had no doubt that the LLMs could perform a simple lookup, but one model didn't bother looking and the other probably didn't bother looking. What I suspect happened here is that the 1980 date was in the initial training data from before July 2024 and the model didn't doublecheck the site to see if it had been updated. That's just a guess, but either way it seems like a major problem if after a year it can't find a number on a webpage I specifically instructed it to check. It doesn't understand that since the 1980s does not mean since January 1, 1980.
As a final thought, when I was checking the ARSA data, I pulled the 5/9/1966 survey from KBLA in Burbank, CA, when I noticed something interesting. GRoL did not appear on the chart itself, but in a special "coming attractions" section. Now, I want to make it clear that these dates I am expecting are merely estimates, and that the radio data is the least reliable since stations often get copies for airplay in advance of release. When I was developing this system, I made a judgment call that I'd prefer a too early release date to a too late one. I initially had no way of knowing whether the coming attractions were records that had been released and were expected to be on the next chart, or merely records scheduled for release. I considered the possibility that this may have caused the LLM to think they hadn't been released (before discounting it because they also ignored charts where the record had appeared and may have provided an explanation for why they were discounting a chart). Then I noticed that the coming attractions section that week also included the Temptations classic Ain't Too Proud to Beg. This was fortuitous, because Motown release dates are well-documented; if that record had been released by May 9, then I could be confident that the other coming attractions probably were as well. Ain't Too Proud to Beg was released May 3, 1966, one day after my estimate for GRoL (Motown didn't stick to a set release day). It's a small sample size, but I'm more confident in my method than I was before.
Would an LLM have recognized this possibility and thought to check it like this?
I'm kind of surprised nobody here has Claude Opus access, and modern Opus is a lot better than Sonnet 4.0, so I went ahead and hucked it at Claude Opus 4.6. For the record, my setup was:
Unfortunately, for some reason, Claude Opus doesn't let you share advanced-research discussions, so I can't link the full "discussion". But it didn't ask for any extra info, just hopped into it.
(One note: it tends to be limited per query, so asking for two albums at once is going to do about half as much work for each. I dunno if that would produce different results though.)
It took about twenty minutes, scanned 711 sources, and produced this full report, which goes into detail on methodology and sources. The tl;dr:
So, it ended up with the same GRoL result as GPT 5.2 Agent.
It did find the Wikipedia page and decided it was wrong, and it wasn't able to read the ARSA database. I don't think it's possible for the web version to apply a username/password, but I could probably have gotten that working with a local login; in the end, it fell back to the Billboard. Didn't manage to find the radio chart, but that's the ARSA access issue.
For FtH, it queried the copyright office, but got access-denied errors. I'm guessing this is specifically anti-AI-bot stuff :V
This does feel like a lot of the sources you want to rely on are specifically blocking Claude. I'm slightly tempted to set up local tools that pretend to be not-Claude, or give it access to a web browser and tell it to go wild; that might be more effective.
I'm going to ping @self_made_human here because my response may be of interest to him. I think it's safe to say that this experiment is over for the time being, and here are my takeaways:
Opus is the best model currently available. It is the only model that recognized the Wikipedia error, and the only one that could tell the difference between 45Cat comments and 45Cat information. It also had the courtesy to tell me when it couldn't access a source.
That being said, data access issues aside, it still made mistakes. It didn't pull the correct FtH date from RYM. For GRoL, it said "As a pre-1978 release, the copyright registration would appear in the physical Catalog of Copyright Entries, Third Series, Part 5 (Music), likely the July–December 1966 volume." Well, sort of. Sound recordings weren't registered until 1972. The song may have been registered as a composition, but the date of publication wouldn't necessarily be the date of the single's release. A cover, for example, would have been registered with the original recording. This gets even sketchier when we're talking about the days when songs were primarily published as sheet music. Incidentally, the registration date for this (which I hadn't thought to look up until now) is May 2, 1966.
It also says that "[The Copyright] volumes exist on archive.org but text searches did not surface this specific entry, likely due to OCR limitations on the scanned pages." These volumes have been scanned and are available as text files. The OCR isn't particularly good, but it does exist, and there were no issues with this entry.
If this takes 20 minutes and consults 711 sources, what the hell is it doing? There are not 711 reputable sources to consult on first pass, maybe 50, tops. After that the instructions were pretty clear that if it had a Billboard date to work with that. I can understand it doing a deep dive if it couldn't pin down a date, or if the request had been open-ended, but once it found the Billboard review that should have been it. This only takes me a couple minutes to do manually unless there's a really sticky wicket, but that's rare. If the release date is on RYM it takes seconds because that's where I look first. I have no desire to automate a task so as to make it take longer.
It took 2500 words to give me two dates. On the one hand, I appreciate the report. On the other, it's overkill, especially when it was mostly peripheral information like what the lead single was and who did the mastering. This is a minor quibble, but there's something ironic about automating a task and it taking longer to read the output than to look up the answer myself. I don't mind as much as this is testing, but if I were to actually use this I'd trust it enough to just spit out dates.
The lack of data access is a big issue and might make this whole LLM thing unfeasible. If LLMs can't access data without workarounds, then their utility is limited. Three of the most important archives for this project—US Copyright, Archive.org, and ARSA—are evidently excluded. There are other ones that aren't relevant to this particular exercise but that I suspect would suffer from similar problems. Instead it's relying primarily on Billboard, and that stops working when you get to a release that wasn't reviewed in Billboard and didn't chart. A fourth site, the normally reliable RYM, also had data access issues. The site's API has been in development for years and is pretty much vaporware at this point, and they aggressively block scrapers, Anthropic's included. My guess is that whatever Anthropic is using to scrape their data is getting only partial pulls before getting shut out, and the result is that it can't be relied on to have the most updated data.
I tried to make it easy for the LLM by allowing it to rely on Wikipedia and RYM dates, and Claude, to its credit, caught the Wikipedia error. But that was an obvious error. When doing some followup work I found an error on RYM, and I'm less confident that Claude could have caught it (I'm honestly surprised it got past the mods). What happened was I was looking up copyright dates for the other albums Claude mentioned as being released around the same time as FtH, one of which was Love Trippin' by The Spinners. Claude gives a release date of 5/27/1980. The copyright date, however, is 6/2/1980. I list the copyright dates as secondary because they occasionally contradict a well-established release date, but I give them high reliability, so I wanted to investigate the May 27 date further, so I looked at the update notes from RYM. The citation was to an unsourced Wikipedia page, so I have no idea where that date came from. I might be tempted to give them the benefit of the doubt, except, that update was made just last year. Prior to that, RYM listed the release date as June 1980, and the source for that was liner notes from a CD boxed set. While Claude correctly pointed out an issue with the liner notes to a Turtles CD, they are, for all intents and purposes, as official as a source can get. If I have that and a copyright date suggesting a June release, I'm going with a June release when the only evidence to the contrary is an unsourced Wikipedia article. To be fair, this is not something I would have normally caught on my own. However, if an LLM is doing this deep a dive, I would expect it to catch that mistake.
@ZorbaTHut +
I think the problem is using research for this. It's tuned for producing formatted papers. I've taken a different approach. I opened Claude code with opus 4.6, copy and pasted the content of this and the previous thread into a .txt file in an otherwise empty repository and gave it this prompt:
I'll check back when it's done designing the architecture. Are there some other records I could feed it once it designs a process? otherwise I'll try some random ones.
trial run against
The Shoes - "Present Tense"
The Myddle Class - "Don't Let Me Sleep Too Long"
The Standells - "Sometimes Good Guys Don't Wear White"
Pavlov's Dog - "Pampered Menial"
results: RELEASE 1: The Shoes - "Present Tense"
DATE: September 18, 1979 (Tuesday) METHOD: Rule 2 — Full date from RateYourMusic (reputable source) CONFIDENCE: Medium
KEY EVIDENCE:
SOURCES THAT COULD NOT BE ACCESSED: US Copyright Office (publicrecords.copyright.gov — JavaScript SPA, inaccessible), RYM (403, but search snippets confirmed date), Billboard PDFs (too large to fetch remotely)
NOTES: The Copyright Office would give a definitive answer (Rule 4, post-1978) but was inaccessible. RYM is the best available reputable source. Sept 18 (Tuesday) is a plausible pre-1982 release day. The competing Sept 15 date is from streaming metadata supplied decades later.
RELEASE 2: The Myddle Class - "Don't Let Me Sleep Too Long"
DATE: Monday, July 25, 1966 METHOD: Rule 6b — Monday before earliest confirmed ARSA chart date (WTRY, July 29, 1966) CONFIDENCE: Medium-Low
KEY EVIDENCE:
July 22 but that survey is missing from ARSA)VALIDATION FLAG: The computed date of July 25 does NOT fall within the June 1966 month listed by 45Cat and Wikipedia. This likely means either: (a) the single was released in June but took several weeks to build airplay in Albany before charting in late July, or (b) earlier ARSA entries exist on stations not checked (the agent could only search 3 stations without full ARSA login access). The actual release date may be earlier, in June 1966 as the reputable sources indicate.
SOURCES THAT COULD NOT BE ACCESSED: ARSA full database search (requires login), RYM (403), Discogs (403), Billboard archives for this specific single
RELEASE 3: The Standells - "Sometimes Good Guys Don't Wear White"
DATE: Monday, July 4, 1966 METHOD: Rule 6d — Earlier of ARSA-derived and Billboard-derived dates (ARSA wins) CONFIDENCE: Medium
KEY EVIDENCE:
NOTES: July 4, 1966 was Independence Day, which is unusual for a release date. However, the methodology computes this mechanically. The ARSA entries were identified via Google indexing of ARSA URLs (the site itself was down), so they couldn't be directly verified. If July 11 is correct, the computed date stands. The Standells were an LA-based band, so KBLA picking it up first is geographically logical.
SOURCES THAT COULD NOT BE ACCESSED: ARSA (site down, data found via Google index), 45Cat (403), RYM (403)
RELEASE 4: Pavlov's Dog - "Pampered Menial"
DATE: February 5, 1975 (Wednesday) METHOD: Rule 5 — US Copyright publication date (Catalog of Copyright Entries, 1972-1978) CONFIDENCE: High
KEY EVIDENCE:
NOTES: The copyright publication date of February 5, 1975 is the most authoritative source available. The blog claim of April 4 likely confuses the ABC release with the later Columbia reissue, or is simply incorrect. The Wikipedia/Grokipedia "February 1975" corroborates the copyright date. The registration confirms the original label was ABC Records (ABCD-866), not Columbia.
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
Just general anti-bot stuff, probably, though the desperation for more AI training data probably explains why bots got so ill-behaved a few years back. Our CI server has to hide even open-source logs behind Cloudflare settings harsh enough to block cURL, else the traffic from spiders can bring it to its knees. "Figure out how to get Codex to emulate a full browser" is on my TODO list somewhere...
chrome --remote-debugging-port
All frontier models know how to competently drive a browser using CDP.
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
A coding problem? Not strictly, no. I focused on coding because my collaborator SF (who is doing most of the work) is a programmer.
As you can see from discussion with Phailyoor and faul_sname, I'm open to other well-defined tasks.
I started as soon as I read this. I'm running it on 5.2 Thinking and another instance using Agent mode (the model has access to a computer of its own with a browser). It's taking a while, so I'll ping you when I'm done. I tried to be faithful to your original framing, so I didn't mention that o3 tried and failed at the task, or your critiques shared later.
If this doesn't work, then sure, I can ask SF to consider using his Claude setup to try. Shouldn't be too onerous.
I have no idea, in advance, if this will work. I doubt SF does either. But it's also something we can try.
I share your concerns with the issues arising from Photoshop being closed-source. But I'll share it too, assuming SF hasn't seen this yet. It sounds like something worth trying from my perspective, but I will stress that I am not a professional programmer so I'll be deferring to his judgment.
More options
Context Copy link
Did you forget the link?
Lol, by the time I finished writing that I forgot that I was supposed to link something. Fixed.
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
Inspired by @self_made_human 's suggestion, I want to offer a verifiable challenge to create a novel. It's not strictly coding but if you're willing to accept the challenge I think it will be interesting.
The challenge:
Write a 30,000-50,000 word novella with coherent characters, as well as a twist/reveal sometime after the midpoint. I'm purposely leaving the topic open, but happy to make the challenge more specific if that helps. It could be a thriller like by Michael Crichton or something even more ambiguous like John Steinbeck. Verification of the challenge will be done with LLM judges. Any agentic system or techniques are allowed, except for direct access to the judging criteria or plagiarism.
Requirements:
Verification:
The verification prompts will be run using a frontier LLM with a long context window, enough to put the entire novel in the context. The outputs of the verification prompts may be consumed by humans, but if the outcome, pass/fail is ambiguous, the verification prompts themselves should be tweaked asking for a clearer response, and run again. The verification prompts should be run using the API, not the web UI, using the default recommended settings of temperature and other sampling parameters, and run 5 (or more) times each to ensure an accurate result.
In order to prevent an AI agent from "gaming" the challenge, the agent must not be given access to run LLM judges directly on the success criteria. It may also not access the success criteria directly, but may be given it implicitly if phrased as general requests for good writing.
Astral Codex Ten has just posted a link to a contest offering 10 k$ for "the best AI-generated short story".
The judges include bigwigs Gwern and Alexander Wales.
More options
Context Copy link
Uh... I dunno that you need a cutting-edge model for that. I used a similar approach for this (cw: bad Jupiter Ascending fan-script). It's not good -- I'd say not even good as fanfiction -- and it's not even what I'd want written for the setting, and it's admittedly only into 13k words. But while it took three layers of "let's take these characters and flesh them out", "let's add this setting flesh out into a story outline", and then finally prompting the actual story, it did do it with minimal human intervention and none of it actually drawing the story plot. Putting even trivial effort into feedback, guidance, and pacing during the final prompting sequence would probably have helped a ton.
My problems are more than the character voices are really samey, the setting doesn't get enough interesting exploration, the twist doesn't get enough emphasis (and frankly isn't that interesting even in outline form: "why would anyone be willing to risk eternity for an unproven chance? Well, we happen to have a big pile of people that risked their lives and were trying to kill for a tiny improvement. Having eternal life only available to the elite kinda makes that a day-to-day thing."), and it keeps throwing extra characters in with too much detail rather than using the ones I was trying to emphasize. It's not necessarily incoherent, just bad.
((The LLMs do eventually notice that it's a Jupiter Ascending-with-names-filed-off-story if you try your review. Not sure whether that hurts or helps it as analysis, but given that the character tones sound nothing like their film counterparts I don't think it pollutes too much. And while my original fic efforts have been on content that you... probably will find even less appealing to read, original fic does work.))
I've got a busy week, but I might see what I can get out of a local LLM aiming for the longer form 30k words target, just to do a compare and contrast.
More options
Context Copy link
I'm not an LLM defender here, but I think most Tarantino movies fail this rubric.
I don't know why but that's fucking me upore than it probably should.
I intentionally made this criteria harder than just "write any novel that's entertaining."
But I think it's actually not as bad as you say. Let's take hateful eight, which is the most recent Tarantino movie I watched. Unfortunately we can't run an LLM judge on any popular movie because the LLM already knows what's going to happen, but giving my personal thoughts:
(spoilers)
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
I don't think you're quite clear in the post as to what camp you're actually in. Are you a straight bull? As in, do you think it can currently replace a sizeable portion of human knowledge work?
Moreover, it is not clear how knowledge work that is not coding qua coding fits into your schema. For example, I have in mind a flight dynamics simulation/control task. I'm not settled on it yet. My plan was to include a little twist that I had thought would likely not be in the published literature, but which I'm sure I could manage without too much difficulty, just pulling one book off of my shelf, confirming where exactly I need to make the modification and how (it's been a long time, but it's something I'm confident I could do without extreme effort), and then coding it. Unfortunately, I looked, and some darned student already published it (only minimal code published AFAICT, but they wrote out all the analysis in detail, so I can't really purely test its ability to do this aspect of the knowledge work on its own), so I'm trying to think of another good variant.
There are other little twists I had in mind, hoping to prevent it from being able to purely just pull code directly from others. These twists are things I've personally coded in the past, so I know they're doable. But the point is that they require sufficient knowledge to make choices along the way (for one example, choose this algorithm for this part, because I know it has certain characteristics) and I think they prevent it from being able to just use someone else's work for the core simulation components.
I guess, where does this fit within your schema, and where are you with respect to your own opinions? There is a lot of room between, "I personally know how to architect this code, what algorithms/assumptions to use, how to modify the analysis for the instant case, and then I use Claude to help with building the components", "I do the analysis, give it to it, tell it to code up the whole thing, then I go in and tell it to change things to make better choices that fit my knowledge-work-educated beliefs on how it should be done," and, "I tell it to code up the whole thing, maybe tell it that something's broken, but part of the test is whether it made the right analysis and knowledge-work-educated choices on its own along the way."
In other words, what I'm interested in is not so much about what it can do in terms of coding qua coding. It could be utterly magical at that, and that would be great. But how much of my own knowledge work do I need to input to get it to code the "right" thing, versus how much it's able to make the correct choices on its own about what the "right" thing is.
I've shared my thoughts on LLMs consistently here, for years. It wasn't central to this particular demo.
But if you want to know:
I think this would probably make me an LLM bull, even if I'm not maximally bullish. Definitely "displacement imminent".
I would call you a moderate under my schema, and probably an "instrumental optimist".
Either way, I don't think you're our target audience for this demo, since you personally and professionally use SOTA LLMs with regularity and are familiar with their pitfalls.
As an outsider, I am unsure of how impressive this is. I know that "most of the raw cognitive labor programmers do while writing code" is fairly rote, but I don't know how true that is for doctors.
How much of the raw cognitive labor doctors do could be done by a bright undergrad with access to uptodate and a bunch of case histories, both with semantic search?
70% of medicine is minimizing unknown unknowns by knowing as much as you can, and knowing the boundaries of what is unknown to you. I believe a more concise way of expressing that is "knowledge". Regretfully, the books are fat and intimidating for good reason, there's are a lot of things to know.
30% of the rest is reasoning from knowledge, clinical experience (yet another form of knowledge, just the stuff the textbooks don't tell you) and pattern recognition.* This is more dependent on your wits, or your fluid intelligence, if I'm being precise.
The best doctors both know a lot, and are bright enough to apply that information well. The former is indispensable, you simply cannot figure out medicine by sitting in a cave and thinking very hard. I don't know if some superintelligence can look at a single human without the aid of tools, ponder very hard, and figure out everything work knowing. All I can say is that it's beyond any actual human.
(IQ/g also correlates strongly with memory, so the relative importance of both is very hard to tease out. Especially when there's a high-pass filter with
allmost of the idiots and amnesiacs strained out by the end of med school)Let me put it this way: I was a bright kid, and felt like I knew a lot of medicine before entering med school, both due to cultural osmosis and because I took an interest in it. You would not have wanted me as your actual doctor. I did not know nearly as much as I thought I did.
Later, I was a med student, a year or two in and confident that I knew the gist of it. I felt ready to make my own medical decisions, at least about myself. I thought I was smart and that I did my due diligence (reading things online, including research papers). It was insufficient, I did potentially permanent damage to my own health (I'm not going to go into details). I would not want that me as my doctor either.
Now, I am a lot older and a little more knowledgeable, if not necessarily wiser. You could do worse as your doctor, at least if we're sticking to psychiatry. You could probably do better too, but I have a place on the free market. I'm cheap, I give away my advice for free on the internet to anyone who asks nicely, and many who don't.
Along the way, I almost killed people through ignorance. Thankfully, nobody died, my colleagues caught it, or the pharmacist did, or I had a sudden sinking feeling in my gut and ran back to double check. Medicine recognizes that any human is fallible, and there are plenty of safeguards in place. Every junior doctor has their story of close calls, and hopefully nothing more than close calls. All senior doctors start as junior doctors, I hope.
Consider something else: most doctors will seek out a different doctor when they suffer a condition that isn't covered by their own specialty. Sometimes even then.
If a cardiologist feels funny in the head, he'll seek a neurologist. If a neurologist feels heart palpitations, he'll go talk to a cardiologist.
Why is that? Could they both not just open the relevant textbooks and figure out what the issue is? Can a cardiologist not take his med school knowledge of neurology and then skim something Elsevier put out?
These are people with complete medical training, genuine intelligence, and full access to literature, and they still defer to each other. That's not false modesty or liability management, it's that they've learned, through experience, exactly where their pattern recognition breaks down. They know the limits of their own competence.
Maybe. It might work out fine 90% of the time. But most doctors can handle ~90% of conditions, because most conditions are common and usually simple to manage. I apologize for the tautology, I can't see my way around it.
The other 10% are where the specialists come in. You cannot take a psychiatrist (even a smart one) and give him access to UpToDate and expect him to be as good a cardiologist as an actual trained cardiologist. He might do okay, but he's going to kill people along the way.
And that is a fully qualified doctor dabbling in another branch of medicine. A "bright undergrad with access to uptodate and a bunch of case histories, both with semantic search" will crash and burn. I'd bet good money on it, it'll happen sooner rather than later.
If they set up shop and started seeing patients, bumbling their way through things and furiously looking things up as soon as they could, they might successfully treat the colds, stomach upsets, sore throats and so on. That's the bulk of undifferentiated medicine, as you'd expect. They might catch some of the rarer stuff. They will also be very poorly calibrated and commit significant iatrogenic harm. But rest assured they will kill people eventually (at a rate massively higher than a doctor normally does).
That's not even getting into time pressure, or physical findings and techniques that are impossible to adequately convey over just video and text.
LLMs? They narrow the gap significantly, but do not have thumbs. The bright undergrad would benefit immensely from ChatGPT, but rest assured that most of the performance would come from ChatGPT itself, and they would add little. Handcuffing a child to a man does not make their combination superior.
The combination of factors that make a good human clinician are rare. And when you do find them, you're investing a great deal in training to get them up to scratch. Most of this is the bottleneck of information transfer/learning, which LLMs neatly sidestep. GPT-4 did well, and it was dumb as bricks compared to current models. Turns out an encyclopedic knowledge of medicine will get you very far, even if you're not very bright. But it was also able to access and process this information faster than your thought experiment of a human with a computer.
But if you want a final answer: 60-70%. Best estimate I have.
*Sufficiently advanced pattern recognition is indistinguishable from intelligence. It might well be intelligence. You know LLMs, you know this.
More options
Context Copy link
More options
Context Copy link
Fair enough. Thanks for clarifying.
Do you have any thoughts that you'd be willing to share on what I wrote concerning the amount of knowledge work currently required to be input to do things like the task I was thinking about? I suppose I wasn't entirely clear, but I think it would likely fail to do the analysis task on its own. For clarity, this is a task that I thought, "It might be weird enough that no one's done it yet, but it's close enough to the standard stuff that I could almost certainly give it to a student who did well enough in their flight mechanics course, and they could almost certainly just do it." That seems to have been partly justified in that I found a publication in which a student did just do it (and skimming the paper, the analysis seems about on par with what I had expected; I guess my flaw was thinking the idea was sufficiently 'weird'; I guess it says something about the state of aerospace that someone out there has done almost every basic variant, sort of regardless of whether it makes sense to do). I'm probably <50% on whether it would make the "right" engineering implementation choices on its own. I don't have a precise number. I think it might get lucky, because there's a pretty large set of choices available, and I hadn't yet tailored the problem so that it requires it to really think conceptually about what's going on and only pick from a small subset; there's a good enough chance that it could guess somewhat randomly or pick a popular one that happens to work (though I'm not sure if it'll put the right context around it even if it does).
Perhaps, given your comment below, this is just something that you mostly don't care about. Does this sort of thing just bucket into, "No, it can't do this sort of knowledge work now, but with sufficient recursive self-improvement, it will be able to do it later"? (I guess, in line with your stated AGI timelines?)
I am really the wrong person to ask this. I don't regularly use LLMs for programming purposes, when I do, it's usually for didactical purposes, or small bespoke utilities.
The most ambitious project I tried was a mod for Rimworld, which didn't work. To be fair to the models, I was asking for something very niche, and I wasn't using an IDE instead of the chat interface. I ended up borrowing open-source code and editing it, and just using AI image generation for art assets (which worked very well, to the point it pissed off the more puritan modders in the Discord). I can mention that the issues I ran into were the models being unfamiliar with the code for the mod I intended to support (Combat Extended, a massive overhaul of core systems), and that what knowledge they had innately was outdated. I was too unfamiliar with Rimworld modding to be confident that editing their efforts was worth my time. Other people have succeeded in writing bigger mods that work well (as far as I can tell) using AI, so there's definitely an element of skill-issue on my part.
SF might have actually useful observations, but he's a lurker to the core, and I'm the forward-facing entity for the moment. He says he's generally busy with work right now, so I wouldn't wait on him to respond, though I'd be happy if he did.
If you insist:
I don't know if it can do this kind of knowledge work, but I do expect that it will be able to short-order. I make no firm commitments on whether this will be the direct consequence of RSI (since labs are opaque about methodology), or if it'll be a simple consequence of further scaling and increasingly intensive RLVR.
(¿Por que no los dos?)
Either way, I think it's more likely than not the kind of problem you describe will be trivial within a year or two. My impression is that the models can just about do what you want them to do, but with significant frustration and wasted time on your part. That is already a very strong starting point, can you imagine asking GPT-4 to even attempt any of this and get working results?
Thanks again for the kind and thorough response.
I would quibble with this. What I want them to do is to be able to help me with analysis that I don't already know how to do. I wrote it this way a couple days ago:
The reason why I was thinking about the particular flight mechanics problem for this thread here was that I wanted to further drive in that wedge that I think is between the folks who think that most knowledge work is already automatable and those who think that it can be useful if you already know what you're doing. Thus, even a problem where I'm quite confident that I could do the analysis, I predicted that the LLM would fail on its own without significant knowledge-work-educated input. To me, this means that there are two significant steps that the models must overcome before we're thinking about a possible world where basically all knowledge work is automatable.
Maybe as an aside, I'm able to leverage collaborators at multiple levels, from profs to post-docs to PhD students to MS students to undergrads. My experience has been that coming up with the right problem to solve is actually a huge part of the battle. During that process, I'm always considering if I can spin out sub-problems or related problems that may be useful to consider on the way to what we really want (or sufficient contributions in their own right). When considering them, I mentally bin them into a hierarchy. If it's a problem that I'm near 100% sure I could just sit down and do, perhaps I've already done all of the pieces, but never done quite that variant before, and now it seems like that variant might be of interest, it's a plausible candidate to go to an undergrad. On the other end, the vaguest, most conceptually-dense questions, I may reserve for conversations just with profs. There is sometimes something to be said for not "distracting the students" by letting them spin their wheels on something that they're not likely to really contribute on anyway. I have somewhat of a sliding scale for the in-between students/post-docs; I've put words to the basic contours of that scale before, but I don't think I'll bother here, because it's not the most important. There is a possible slight correction factor available if I've been working with a student for long enough to know that they're substantially better/worse than the average student in their category.
In any event, perhaps if I had listed out all of the steps of this scale, I'd have even more than two significant steps that models must overcome, but for my purposes in this thread, I was trying to pick a problem that was pretty directly in the realm of, "I could just give this to an undergrad."
Yes, could I bang on an LLM long enough, the amount of will required being dependent on the particular problem, that it eventually finds its way to the answer that I already knew was the answer all along? Yeah, probably. Is this a huge upgrade from GPT-4? Honestly, I don't know; I gave up back in those days rather than ever really try to beat it into submission.
...but this still is just not really useful, at least not if the goal is to actually automate the knowledge work piece. Sure, it's potentially useful once I've already done all the knowledge work, and I'm sitting down to actually code the thing that I definitely know how to code. But more likely, at this point, it's going to be useful to the student who I've asked to code the thing, because I'm probably not coding it myself, anyway.
I don't really have a good timeline or prediction for if/when some sort of AI system will cross these various thresholds. I'm still hopeful on the straight math side, as I said in my comment a couple days ago. But if the purpose of this exercise here is to find problems that cause someone to update, I was hoping that, "Here's a problem that I'm comfortable that I could give to an undergrad and pretty confident the LLM will fail," could pull you at least epsilon away from thinking that quite so much of knowledge work is currently automatable or perhaps epsilon more cautious about believing that it's quite so imminent.
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
Decompile and return human readable code for some old / obsolete processor architecture. (Hitachi sh4, PowerPC etc)
Do you have any specific binary targeting such an arch you'd like decompiled? I expect this plays to the strengths of today's models.
More options
Context Copy link
More options
Context Copy link
Here's a request. It's not identical to some of the stuff I do at work, but it's close enough that I'd like to see how it goes on 4.6 vs 4.5 and a cheaper plan.
https://github.com/petrandreev/jBPM3
I'd like to see this modernized. You can stick to only the core project for simplicity. That includes, but is not limited to:
If you can get all the tests to pass and post the jar + dependencies somewhere, I can run a local test of the output.
I would also suggest that @self_made_human provide some account of how long it took (not counting CPU time of course, just how long he had to spend on it) as well as how many iterations it took to get right. You presumably have some kind of idea how long this task would take you, and then you can compare. Because in the end it isn't just "can it do the thing" which is important (though that is indeed important), it's also "is it less effort/time for me to have it do the thing".
That's already in my post. I would have liked people to give an estimate of how long they're willing to wait for the AI to try solving the problem, but nobody has bothered, so it's clear to me that they care more about the fact that it can be done at all than how long it takes. On our end, we're not going to keep trying indefinitely, we've got bigger fish to fry.
I presume, when we share logs, it'll include time stamps and reasoning times as well as tokens used. Shouldn't be too hard, I recall that all of that is there by default in Claude Code.
Your intuition is broadly correct here. If the tool can do this at all, it's almost certainly going to be faster than a human doing the same work. If it can't do it, it doesn't really matter.
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
Sounds reasonable to me (SMH). We'll get back to you on that.
More options
Context Copy link
This is a pretty bad challenge because AI is really good at getting all the tests to pass. The main issue is preventing it from faking you out somewhere along the way.
This is why I would run it against a proprietary suite that actually does a bunch of real, fully integrated runtime tests
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
This debate is like blind men debating what an elephant is while one is touching the tail, another one the trunk and the belly.
Programming is everything from wordpress to high frequency trading. It is everything from tiny teams to products with tens of thousands of people working on them. Some teams are extremely particular in how they write code, others will accept pretty much anything. People won't agree because they fundamentally have completely different visions of what programming is. At some places developers are given highly detailed tickets, at other places they are given a loose description on what to work on for the next couple of months. It is easy to underestimate how many devs have jobs which barely entail more than "make the button blue", "make a postman test for a simple API".
We will probably see a major wash out of people who took a react or python tutorial and expected an upper middle class lfestyle.
With that said a developer speed has generally not been limited by the speed of coding. A product I worked on averaged two lines of code per day per developer. The company was inefficient but so much time was spent on other things. The average dev is probably only coding at 50% speed. A 50% speed boost to coding will only increase work output by 25%. That is reasonable and it is possible that the labour market can swallow 25% more software.
I'm of the opinion that the market can swallow 100s of percent more software. Not instantly mind you but there is such a massive lack of software everywhere (except possibly in ad-tech), not to mention half decent or actually good software. Things are so unbelievably shit everywhere you turn.
Falling price of software greatly induces demand.
More options
Context Copy link
More options
Context Copy link
It’s far beyond the scope of this I’d assume, but I want to code a stand alone version of the Warcraft 3 custom game island defense. Would be a great thing.
More options
Context Copy link
May I request that "check out this LLM" (without any human-on-human culture war valence) posts be moved to their own weekly thread? Call it "Singularity Saturday" or "Butlerian Jihad Roundup" or whatever you prefer; it's clearly a big enough topic in these spaces to warrant it.
+1
More options
Context Copy link
I'm assuming things like the Pentagon threatening to invoke the defense production act on anthropic would still go in the culture war thread?
More options
Context Copy link
Agreed.
More options
Context Copy link
Thirded for what it's worth.
More options
Context Copy link
Please, yes.
More options
Context Copy link
Seconded. I made the same point in my post about Claude Opus 4.6:
More options
Context Copy link
More options
Context Copy link
One thing that consistently elude any of the forecast are financials. If LLMs/AI are posed to completely disrupt all the knowledge work, why do we not see it in stocks? We are talking insane money, knowledge work employs around 1 billion people globally with total compensation of at least $50 trillion. Should we not see some huge impact if this technology is so near? For instance MS stock is the same as in 2023 - it does not seem as if Open AI is posed to be key for replacement of tens of trillions of dollars of value a year.
I would even be for the reverse signal - e.g. AI will be so cheap, that all this $50 trillion work will be done by $1 trillion of AI agents with some electricity etc, so no surge in AI stocks. Okay, so where is the pandemonium and stock apocalypse of the companies, which will be for sure worthless in the face of AI where investors should shift into something less AI prone such as construction or whatnot?
IBM's stock is absolutely shitting the bed at the moment, and roughly corresponds with anthropic claiming they can handle cobol now.
IBM stock dropped about 10% and now is back up like 3%. Sure, there was a bit but hardly existential.
More options
Context Copy link
More options
Context Copy link
There was a big stock market rumble only a few weeks ago after Claude Cowork and a few plugins were released. Forbes article:
https://www.forbes.com/sites/michaelashley/2026/02/18/saaspocalypse-now-claudes-11-plugins-triggered-a-285b-wipeout/
Not perhaps the trillions that would clearly prove the point, but it's something.
This case also kind of demonstrates why the stock market isn't super predictive for AI: there's just not enough knowledge. The plugins Anthropic released are extremely simple add-ons. They don't represent new capabilities at all; it would be like seeing major stock market turmoil because a company updated their documentation.
We're probably going to see very spiky updates from the stock market as business normies suddenly catch up to SOTA every 6 months or so.
Another point to consider is how much value we would actually expect to see wiped out/added. I just saw this tweet this morning, responding to an economic bear case:
https://x.com/elidourado/status/2026060408055021752
There's a much larger point in the tweet, but the relevant point for us:
Thus far all the expectations around replacement have been B2B SaaS, but if it's only 0.5% (and even bulls probably don't think the entire industry is being replaced), what amount would we expect?
More options
Context Copy link
More options
Context Copy link
See, that's where the disagreement lies. Subtly (or maybe not so subtly) the discussion has changed from "AI will achieve AGI and then ASI and then run the world to give us fully automated luxury gay space communism" to "AI is for coding, it's all about the coding, AI will replace software engineers, coding is the be-all and end-all, ignore that it still fakes answers to questions where people know enough to know it's lying/hallucinating".
I don't care about coding because coding has nothing to do with my job. Can it replace accountants, lawyers, clerical staff? Without inventing fake precedents or fake citations from dead authors?
"Oh, but look at the shiny coding!"
Yes, great, wonderful. Now we have better models that can create vast oceans of slop to grab those SEO high rankings to sell more advertising. Yippee!
If AI sticks only to coding and produces genuinely useful things, wonderful, we'll all be happy. Is it going to do that, or just "now we can fire two-thirds of our workforce and get it to produce more clickbait headlines and ads"?
I'm of the impression that the emphasis on coding is so that each new generation can take a larger share of designing and implementing the next, until such time only the AI is writing the AI. And that's how we reach AGI as quickly as possible, if such a thing is possible via LLMs.
More options
Context Copy link
Don't waste my time with a strawman, please.
I expect AGI and ASI. Even before LLMs, when it was Yudkowsky and Friends worrying about hypothetical future AI in a shed in the ancient times of the early 2000s, the concern was recursive self-improvement. What does that mean? A smart-enough AI writing the code for a smarter version of itself, which writes the code for an even smarter version of itself, and so on till humans are left in the dust.
Notice the common thread? Coding, writing code. Even leaving aside that there's enormous consumer and business demand for LLM-written code, their coding capabilities have been central to this whole debate since day -1.
The big labs are betting their future on being the first to get to this point, and already claim significant boosts to the productivity of their human researchers via the models writing code for training new models, or even conducting experiments.
Why don't you buy a $20 plan and test? I can tell you that as a doctor who isn't expected to write code ever, it could do most of my work for me, and well. The only reason I haven't automated myself into an early retirement are the obvious physical bottlenecks and NHS IT.
Demand for healthcare is comparatively inelastic, but it is not unbounded. If going to the doctor was cheap, you wouldn't spend all your time going to the doctor.
The specific outcome depends heavily on a variety of factors, including the degree of boosted productivity and whether having a fully trained medical professional in the room is necessary at all. If AI could do 90% of a doctor's work and save 90% of their time, but the demand for medical care only doubled, then I can see it easily being the case that hospitals would slash headcounts and pocket the change.
If the AI was >=100% as good as a human doctor (or got away with using less skilled alternatives like nurses, NPs etc for the physical stuff), then that might lead to mass unemployment or paycuts. 90% of doctors ending up unemployed, from my perspective, is almost as bad as all of us getting the sack.
I wouldn't. As you probably know, there are people who do.
Yeah, but they're usually suffering from psychiatric illness, and the usual treatment is to tell them to go to the doctor less. Indulging them and constantly ordering investigations and treatment is pretty much malpractice.
Either way, there aren't enough of them to keep doctors employed full-time.
More options
Context Copy link
More options
Context Copy link
You replied to a filtered post.
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
This is totally not the point. The point of this challenge is for us to post easy or at least pretty doable stuff for humans, and watch AI crash and burn. Then we point and laugh at the AI.
More options
Context Copy link
More options
Context Copy link
why not? requirements change all the time during product development. I propose modifying the problem given into a 2-stage architecture, the second stage to be added upon completion of the first and requiring (for satisfactory grade) the refactoring and building on top of some of the previously written code.
how many interventions are warranted and how many points deducted? why wasn't claude smart enough to notice that the webapp doesn't work?
I said "strongly inadvisable" and not "automatically disqualifying".
SF would need to babysit the process, waiting for the person making the request to raise their request, instead of hitting go and checking in periodically or after being alerted. He may or may not be able to do this, he does have a full-time job.
It also injects some degree of ambiguity into things, as well as significantly increasing the time and token investment. Max plans are not infinite.
I stress that this isn't necessarily a deal breaker, it just makes things harder and reduces the likelihood of acceptance. You're at liberty to try asking, and we're at liberty to turn it down, especially should you ask for something outside the original spec (as mutually agreed on in advance).
As a bay area software engineer with a lot of free time on my hands since the pandemic let me tell you that I've been one of/the biggest boosters of the promise of llms and deep neural networks in my friend group since 2-3 years ago. For hours each day I've been reading papers, playing with all the models of all the labs, building software with coding agents, doing diffusion image generation, fine-tuning models for shits and giggles, etc etc.
I'm a very heavy user of claude code max, and it's been as helpful as it's been frustrating at times. Rest assured that there are many more interested people on theMotte and that we can figure out a way to get you the tokens you need, if you design an interesting experiment.
I totally get how claude code/opus 4.6 could look magical to a non experienced software engineer. But as helpful as it's been, it's also been frustrating. Yes, the apex of coding models/agent systems, claude code max will still make elementary errors that a junior engineer would not. If I had to summarize its shortcomings in one pithy sentence it would be LLM coding agents have high time preference. They lack foresight and they're lazy.
They pat themselves on the back for closing issues, not realizing the mess they're driving full speed towards. In my experience, without a very heavy guiding hand they will happily duplicate code, rely on shortcuts, lazily do the very bare minimum, or re-invent the wheel at times, especially on larger or more out-of-distribution codebases.
I desperately want to throw a dozen agents at a problem, but every time I look at the actual code I get frustrated: "Hey, I noticed this obvious code smell/antipattern in the code, plese fix." "Sure thing boss, I fixed it." "Ok, but I meant fix all the other instances of this bad pattern that I just noticed." "Oh, right you are boss." Then 15 minutes later, "Hi boss, I implemented this other issue you asked for, it's ready to be merged." "Did you use the correct pattern as discussed and as we added to the readme/dev docs/claude.md?" "Oh, right you are boss, I'll fix it in a jiffy." Over and over again. Yes this is with the latest claude code max/opus 4.6.
So, as mentioned above, I have free time on my hands, and would be happy to help design this experiment. I would like to be proven wrong, to learn that I've just been using these models wrong. But if you just want to show off how good of a centaur your friend+claude is on a cherry-picked problem of your choosing, I'm less interested.
Thank you for the offer! We might be able to take you up on it.
After a night to dwell on your suggestion, we might even be able to implement a version of your original proposal:
That way, he won't need to keep active tabs on it, he can just tell the model to do things as per his convenience, while not losing much in terms of demonstrative power.
I'm not sure if this is what you had originally proposed, or if you edited in before I replied, but no big deal. We'd need you to give us a more specific idea of the task at hand, if possible.
More options
Context Copy link
Do you get the same problem with it that I usually do? That is, the first attempt is really good, and a few additional prompts make it even better. But the more I work with it, the more it seems to get stuck in weird errors or unnecessarily complicated code. After, like, 10 prompts, if it's not working perfectly I just have to start from scratch. It's like pastry dough- a little kneeding is necessary, but too much can ruin it.
That's been my experience - if it can't one shot it, I generally give up
More options
Context Copy link
More options
Context Copy link
So - just like working with outsourced teams in the third world but cheaper.
Anyway - LLMs are not ready for agents yet. The biggest scope they deal with ok is single feature and you need to iterate couple of times.
More options
Context Copy link
Because (from what I've seen) LLMs were designed to be people-pleasers. Not to do the job right, but to make ego-stroking noises at the human user and flatter them and be obsequious. I've seen comments about Asian cultures that nobody tells you no directly, that would be losing face for both superior and subordinate, so if there's a problem or something can't be done, you don't find out about it until way too late because all along those under you have been saying "yes boss, fine boss, no problems boss". I think LLMs went that route as well.
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
Write a coherent thousand word post in your voice about a topic of your choosing sufficiently well to fool standard stylometry techniques, and pass the sniff test as sounding like you to others here, even given
By "you", do you mean me (self_made_human)? Oh boy, you'd be surprised. I've actively experimented with this, and it's one of my vibes benchmarks for any new model release.
Of course, my standards were that it should pass my own sniff test. That roughly amounts to "does that sound like me?" and "are those the kinds of arguments I'd make?". They do a decent job, and have for a while. Not perfect, but good enough to fool most of the people, most of the time.
I didn't/don't have access to "standard stylometry tools", though I will admit I didn't go looking very hard.
Also, I see several issues with this proposal:
Note that these aren't insurmountable challenges: if the final essay written by Claude falls within the same distribution as what I've been writing (with minor LLM involvement), then that's... fine? If nobody without access to the ground truth knows which essay is which, that's a victory as far as I'm concerned.
(We'd need to decide if I'm allowed to use LLMs myself, in the same manner I already do. Claude wouldn't get the benefit of me giving it feedback or editing its output in any capacity)
This would be an absolute pain to collate, both for me and for Claude. Then again, it might be easier for it, right until it ran into the fact that even the largest context windows would choke. I write a lot, and have for many years. Unfortunately, I'm not famous enough to be preferentially scraped, LLMs do not know who I am without looking it up. That's a luxury reserved for Gwern or Scott.
Would you be willing to pay for that or provide access? I wouldn't want to financially burden poor SF more than he already is; the Claude Max plan is a sunk cost, while these clearly represent additional investment. Genuine question, I'm sure that Claude could usefully interface with everything if provided access.
Just after I started writing this, the Metal Gear ! sound effect played in my head. Hang on a minute, I don't think I'd describe you as an LLM skeptic according to my (incomplete and rudimentary) taxonomy. I'd struggle to pin you as a moderate either, though I can't recall you expressing strong opinions that aren't more research/alignment oriented. If my hunch is true, then you aren't the target audience for this! Plus I'll eat my hat if you don't already have access to the best models out there. Do you really need us for the test, beyond whatever blinding or scraping of my writing is necessary?
Also, this is slightly off-topic. We're not evaluating the agents on their ability to write code, we're testing their ability to mimic me convincingly. Sure, coding capabilities might come into the picture, especially if they're in charge of other models, but it's not as central.
This doesn't strike me as insurmountable. I'd be open to trying once you get back to me on the issues I raised.
Ok, how about a simplified test. Write 500 words without AI on a topic of your choice, or pick any unpublished writing you have saved up. It's quite short so I don't think this is a major burden.
The challenge will be to have the AI create a 500 word passage on any topic, it doesn't have to be the same, where when placed side by side, it will not be obvious which passage is AI. Any means and methods including agents are permitted as long as all output tokens came from the AI model. Any verbatim copying of human written text outside of quotations is not permitted.
Verification will be done by comments on this forum, where anyone with an established account can vote for one being AI.
The result will be determined by a 1-sided Z-test with p=0.05. If voters on this forum overall can determine which one is AI with statistical significance, the AI has failed the vibe check.
Voters can use any means and mechanisms to detect AI.
Sounds interesting enough. I will note that using LLMs to write 500 words using my own work as a style reference and then just using that verbatim as a comment/post is not how I actually use them.
But as a general experiment? Sure, I'd be interested to see the results.
Does this preclude all human intervention after hitting go? Am I forbidden from telling the model that it has failed to capture my style or my opinions correctly, then either suggesting specific corrections or more broad advice?
You can guide and criticize the model as much as you want throughout the process, but none of your queries can be reproduced word for word in the final text.
Ok: the second sentence sounds bad, rewrite it.
Not ok: Try starting the second sentence with "However, this is not..." - This approach would result in words you wrote getting into the final output.
Hmm. I think that would be acceptable. Stand by for results, though it might take a while for us to hash it all out on our end.
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
Huh. It's been my sniff test for new models as well, and so far I have not seen much success. It should be easy! This is literally the most LLM-flavored task to ever task! And yet. I've sunk probably 50 hours into it.
My most recent attempt, which I sunk about 10 hours and $100 into, and which got a lot closer than any previous attempts, involved giving Claude a corpus of all my past writing and having it try multiple different ways of producing text on arbitrary topics in my voice. The things tried were
On the one hand, I was very impressed by how good Claude was at running a whole bunch of these experiments very quickly. On the other hand, it did not work for me, not even at the level of "passes the sniff test", much less at the level of "standard stylometry techniques say it sounds like me".
I think you'll find that this is one of the tasks that is now much much easier. It's actually been within the capabilities of frontier models since Sonnet 4.0 (which is when I went ahead and gathered said corpus, on the theory that it'd be pretty useful to have). The prompt you're looking for is something like "Here's a chrome instance running with --remote-debugging-port and logged in on most of the sites I post on with a tab open for each. Go generate a corpus of all my publicly available writing".
Yeah. An H100 for 24h would run in the ballpark of $40, well worth it for me to provide. Vast allows transferring credits from one account to another, so I'd happily just transfer $50 of credits over if someone actually wants to do this. Does seem like rather a lot of work though.
Yeah, that's entirely reasonable. Your voice is very different from Claude's voice.
Yeah, I'm hoping you can prove me wrong here. I've been trying to do this since back in late 2019 when nostalgebraist-autoresponder was shiny and new. I want a good simulacrum of myself! I want to have that simulacrum, and I want to loom it. I want to build an exobrain, and merge with it, and fork off a copy running in the cloud.
BTW I expect there's a substantial market for anyone who manages to build this in a repeatable way. I've looked, and there are as of now no commercial offerings for this (though there are a few commercial offerings that pretend to be this).
I only have access to the models you can obtain access to with money - I expect I'm 3-6 months behind the best of what insiders at Anthropic or OAI have access to.
An LLM skeptic is an LLM idealist who's been disappointed :)
I expect looking like you stylometrically while also exhibiting the same patterns of thought you exhibit on a specific topic will involve writing code. But code in the service of trying to mimic you convincingly, rather than in the service of producing some specific durable software artifact.
For the record, I do expect this to be within the capability window within the next 18 months, but I would be pretty surprised if you managed to get Opus 4.6 specifically to do it.
I think we're on the same page here, I'll talk to SF about this. I'm willing to put in the effort on my end, which, as I see it, is to write a
1000 word essay as I normally would. Not particularly onerous.Let me give you an idea of how I normally approach this. I simply copy-paste pages of my profile after sorting by top, usually at least two or three pages (45k tokens). I might also share a few "normal" pages in chronological order, for the sake of diversity if nothing else.
I did just this, using Gemini 3.1 Pro on AI Studio (GPT 5.2 Thinking, which I pay for, can't write in arbitrary styles nearly as well no matter how hard you try, and I've tried a lot, I don't pay for Claude so I'm stuck with Sonnet):
I copied and pasted the first two profile pages, sorting by top of all time. Instructions were:
https://rentry.co/23dc63vs by Gemini https://rentry.co/p5yh68zu by Claude 4.6 Sonnet (same setup)
Results? I'd grade Gemini a 7/10, Claude a 5/10.
Looking at Gemini:
Looking closer:
I don't live or work near Bromley. That's where an uncle of mine resides. It's clear from the context I shared that I'm up in Scotland.
I could see myself saying this. Maybe not those exact figures, perhaps 10%:90%, but directionally correct.
Very good. I would use that verbatim in a real essay.
I wouldn't say that at all dawg. Why would I randomly reference my user flair in an essay?
Claude's version is shit. It's staggeringly content free, and while it's closer to "raw" me, it also uses em-dashes and uses many words to say few things. Maybe it's bad luck, I've had better results in the past, especially since I usually share a specific topic instead of letting it decide on its own.
Here is the whole prompt, profile dump included, if you want to try with a different model. I'll see about using Opus, I know 5.2 Thinking will shit the bed in a stylistic sense.Rentry won't let me paste the whole thing. But I think I've been clear enough to reproduce independently. I'll happily take a look.
Gemini's sample is impressive! Color me impressed, especially that a straight-up prompt produced that (though I suppose if any technique would get it with current models, it'd be "one shotting through a prompt" rather than "iterative refinement towards a target").
It doesn't sound quite the same as the version of you that lives in my head, but it's awfully close. E.g. I can't imagine you saying
since you don't tend to drop spurious technical details into your walls of text unless they serve a purpose (and also because I half suspect you're not a fan of the amyloid theory of alzheimers). More generally, the Gemini piece has a higher density of eyeball kicks than I model your writing as having. And I model your writing as having a lot of those, for a human.
It also seems to drift away from your voice in the second half. And it fails the stylometry vibe check - Pangram detects AI with medium confidence - but maybe in a way that's reparable. And actual stylometry (cohens d of +17 on dashes, +2 on words >9 letters, +1.5 on mean word length in general, -2 on 3-4 letter words, -1.2 on punctuation in general - i.e. you use more and more varied punctuation and shorter words, by a notable margin, and Gemini uses way, way, way more dashes). Still, it's much much better than I expected! (and yeah, the Claude one is not even worth discussing)
Interestingly, your results look much, much better to me than the ones I get myself. I ran the same test as you did against Gemini, and got these not-very-good attempts: 1 2 3. Gemini took distinctive phrases (e.g. "85% agree") and ideas (e.g. "claude code as supply chain risk") I have used once in the corpus, fixated on them, and stitched them together into a skinsuit which superficially resembles my writing but doesn't hold up under scrutiny. Interestingly, that's a very base model flavored failure mode. I have grown unused to seeing base-model-flavored failure modes, and as such Gemini is much more interesting to me now.
ETA: also one entertaining failure I got when trying to do this in multi-turn: Gemini didn't realize it had ended its thinking block, and dumped its raw chain of thought, ending with "Go. Bye. Out. Okay. End. Wait. Okay. Done. Executing." over and over hundreds of times. chat log
My impression is that Gemini's output was unusually good and Claude’s was unusually bad. But both 3.1 Pro and 4.6 Sonnet are new enough that my intuition based on extensive interaction with previous models might no longer be applicable. For what it's shirt, both were n=1 samplings with zero cherrypicking.
Looks around shiftily why, I'd never throw in spurious technical details into an essay. Couldn't be me!
(I probably wouldn't use the specific Tau and amyloid phrasing, since you are correct that I have very mixed feelings about the amyloid hypothesis)
The examples seem to channel your "LessWrong" blogging voice. I am unable to critique the technical details or identify (what I expect are many) confabulations, but if I saw this posted there in your name I wouldn't bat an eye.
I haven't really futzed around with base models since GPT-3, though I might have tried one of the Llama 3s at some point. They're non-trivial to access, and have limited utility for me. Mainly because of the added difficulty of prompting base models, and the fact that the publicly accessible ones are nowhere near as intelligent as proprietary dedicated assistants. If you think I'm wrong about this, I'd be curious to hear about it.
In general, I get the strong impression that while the author of the corpus might be able to pinpoint specific issues in terms of style or stance, it's much harder for others to spot those tells.
The biggest pitfalls are the tendency to adopt em-dashes (models are more than capable of not doing that if you specifically prompt them not to), and other stock "AI" phrases like:
Which can show up if you're using models to merely edit/format a draft, and not just write an essay from scratch.
I must also continue stressing the point that this isn't quite representative of my usual informal benchmark:
It's enough for me to spot a better way to say a specific thing I'm already saying. A single vivid metaphor or interesting analogy that is worth co-opting can make the practical purpose of the exercise worth it.
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
I expect it's too hard, but I (and a lot of other people, but probably not people who want to pay) would like a somewhat compliant browser engine for Mac OS9 -- obviously this is pretty much exactly a Netscape clone, but a POC would be interesting and would get the model well out of it's training set, thus testing for 'thinking' vs 'parroting' quite well.
How about 'non-interactive render of arbitrary web page; will run on my MDD PowerPC, OS 9.2.2'?
More options
Context Copy link
Your Bull and moderate option seems to miss an important middled. We go from ASI imminent to 'useful tool. I want to see a - will likely disrupt the economy and culture and society, regardless of whether AGI or ASI is coming.
Anyway:
This seems extremely, self-servingly narrow and contradictory. We want to show you how much an AI can do, in order to change your mind on it's limits. But please, do only pick something that it can do. This isn't question begging, but something like it.
Anyway, anyway.
How about an 8-bit side-scrolling video game with the relative complexity of Super Mario Brothers 3? If it can go write a full 'feature length' NES game, I'll be quite impressed. (But I'm playing more skeptical than I am)
Or more real world related:
A data replication tool that can move data from a SQL Server to PostGres database. It has to be able to use both time stamp incremental replication or log-based Change Data Capture on selection. You should be able to customize batch size, hard deletes, time-out, and activity on failure. I want a gui that allows me to select tables and ordering to schedule replication intervals, and to select columns on the table. Bonus points if it allows rows filtering conditions or other in-flight transformations.
If it does this latter one, I will beleive that most of IT infrastructure employment is over in 18 months.
I do not see how you can interpret us in that manner.
If the problem is deemed too hard by everyone (the person proposing it clearly believes the model can't do it), then what exactly does failure demonstrate? Nobody ever expected it to succeed within the given constraints. You can't evaluate automobiles in terms of their ability to reach Alpha Centauri. You can't adjudicate a debate between a Ferrari fanatic and a Lambo lover based on which car is more effective at deep sea exploration.
It takes disagreement on model capabilities and (expected) outcomes for all of this to be surprising or useful.
As we've clearly stated later, if we agree to the challenge, then we expect that the model can do something (that our counterparty thinks it can't), so the failure of the model goes against us, and will force us to update.
I'll forward the proposals to @strappingfrequent, assuming he doesn't show up in the thread. They seem reasonable enough to me, but I am clearly not the real expert here, and I'll be deferring to his judgment. That might take a little while to organize, I'll edit this into the main post for the sake of clarity.
Ok I’ll try in good faith to explain a final time.
You are asking the would be contestants to pick a challenge they think the AI is in capable of, but they have to guess within the bounds of what you think they are capable of. Yes, I get why you set it up this way, but it creates an extreme cherry-picking filter, which will naturally limit the amount of “updating”, which is going to occur.
There are other ways this “experiment” could be designed to avoid the cherry picking.
Joey Sportsdoer claims to be a great athlete, better than people give him credit for. And one of the ways he’s constantly underestimated is in how “broadly” athletic he is. So he lines up the doubters and says, start naming athletic feats you think I can’t succeed at, and then I’ll choose one I think I can do and do it.
This is not the best way to go about convincing folks of his general athletic prowress.
Of course neither is attempting feats he knows he can’t accomplish nor ones everyone agrees he can, but luckily these aren’t the only three ways to design his demonstration
Well, what are the specific ways you think the experiment can be improved, including the minimization of cherrypicking (without adding an unreasonable amount of additional effort on our parts)? Keep in mind we're two dudes in a shed, not Anthropic itself.
More options
Context Copy link
More options
Context Copy link
What I’m saying is you are asking users to come up with examples that they already by definition don’t believe it can accomplish, by definition of their skepticism.
But regardless, either of my two examples would greatly impress me. The former (nes video game), I would not update by the ability to write 80s console code within the limits of a NES performance specs. (I would be impressed but not update).
Specifically I want to see it plan and execute a full coherent game AND code it. It doesn’t need to one shot, but shouldn’t take creative inputs beyond the general concept and considerations.
The second is about writing enterprise reliable IT infrastructure software that would make a lot of Software companies obsolete immediately.
Duh? What on earth could you expect us to do differently? If the skeptic already believes the model to be capable of the task, why ask for a test?
There is non-zero value in discovering a task that both the two of us and the skeptic expect a model to achieve, and then witnessing it failing at it (unexpected, at least), but that is clearly not the primary purpose here. Someone else is welcome to try, after they're no longer swamped with a quadrillion entries. The set of tasks that the skeptics and I both expect models to accomplish is much larger than the one where we disagree.
Hence why I think your claim:
Is clearly nonsensical.
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
I doubt an AI agent's ability to generate a feature length anything that's coherent. Ask an AI to write a novel and it'll fizzle out around 10,000 words in. I'm convinced that the AI assisted smut romance novels that are popular recently are mostly driven by a human gooning while proompting the AI for the next chapter. I doubt that it can be done fully autonomously, those actually fake books that are just words on a page not included of course
I would expect that one of the biggest limitations on long run narrative coherence is time horizon. The doubling time for time horizon is anywhere from 2-7 months.
A typical novel is about 80000 words, so three doublings in length (6-21 months). To be conservative i'll assume novel complexity/task time scales with the square of word count. This is based on each additional word having to mesh with all previous words. This would give 6 doublings or 12-42 months.
I suspect this is an overestimate because complexity probably increases until the climax then begins to drop off.
More options
Context Copy link
To be fair to AI I've fizzled out on a dozen or so stories after writing about 10k words.
I think there might be a hump at a that point where where story idea turns into story and I'm not sure it's easy for most people to pass.
More options
Context Copy link
This is very unlikely to be accepted:
Too subjective to be useful, and far too ambiguous. Who's doing the grading here? How are they assessing "coherence"? How are we blinding things, if not, how do we account for bias?
We strongly prefer actual programming tasks, not creative writing. We could easily ask Claude to write a novel, and it would do it, but then we're back at the issue of grading it properly.
If you want to propose something like this, you need to be as rigorous as @faul_sname up in the thread. At the very least, propose evaluators that aren't you or the two of us, and we can see if it's possible to make this work.
This wasn't meant as a suggestion, just an observation. My suggestion is below.
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
Create a multiplayer game that uses lock step delay/rollback based netcode.
Background:
Lockstep netcode has been the gold standard for netcode in several genres, especially RTS and fighting games, since multiplayer gaming has existed, and the technique is documented quite well. Unfortunately, many popular game engines lack first class support for the technique, and many games that implement the technique use their own bespoke implementation. A major example is the recent game Broken Arrow, which has been plagued by cheaters since the steam early access.
Unlike lockstep games, where each player's game runs a deterministic simulation with the exact same inputs, Broken Arrow uses a more naive form of netcode, where the positions of the player, the player's attack cooldowns, etc. are sent by the player's game and trusted by others. Other games such as Minecraft similarly use this technique.
This article explains the technique, though many other articles also explain it: https://words.infil.net/w02-netcode.html
The challenge:
Create an RTS or alternatively fighting game which implements the delay/rollback netcode.
Requirements:
Verification:
Prediction:
My prediction is that the AI crashes and burns, getting stuck in a loop somewhere with broken code. It is something that an experienced solo dev can implement with appropriate time and energy, so I think AI's failure will be a good demonstration of the gap between AI and human.
Huh, I haven't played in a while, but I like to think I'll bump into you if/when I pick it up again.
Unless you're a Russian cruise missile main, of course... ;)
(Also, great idea for a test.)
I never played it because of the total debacle that was the launch of the game. The devs are clearly incompetent and won't be able to fix the game.
It's sad because the content and design of broken arrow is much better than warno. But the game itself is so poorly executed that it's pretty much all wasted. The game also runs like dogshit on my pc while warno gives me a smooth 60fps unless it's a 10v10 full income.
Well, should they despite your predictions manage to fix it to your satisfaction, my DMs are open if you want to give it a shot sometime.
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link