This weekly roundup thread is intended for all culture war posts. 'Culture war' is vaguely defined, but it basically means controversial issues that fall along set tribal lines. Arguments over culture war issues generate a lot of heat and little light, and few deeply entrenched people ever change their minds. This thread is for voicing opinions and analyzing the state of the discussion while trying to optimize for light over heat.
Optimistically, we think that engaging with people you disagree with is worth your time, and so is being nice! Pessimistically, there are many dynamics that can lead discussions on Culture War topics to become unproductive. There's a human tendency to divide along tribal lines, praising your ingroup and vilifying your outgroup - and if you think you find it easy to criticize your ingroup, then it may be that your outgroup is not who you think it is. Extremists with opposing positions can feed off each other, highlighting each other's worst points to justify their own angry rhetoric, which becomes in turn a new example of bad behavior for the other side to highlight.
We would like to avoid these negative dynamics. Accordingly, we ask that you do not use this thread for waging the Culture War. Examples of waging the Culture War:
-
Shaming.
-
Attempting to 'build consensus' or enforce ideological conformity.
-
Making sweeping generalizations to vilify a group you dislike.
-
Recruiting for a cause.
-
Posting links that could be summarized as 'Boo outgroup!' Basically, if your content is 'Can you believe what Those People did this week?' then you should either refrain from posting, or do some very patient work to contextualize and/or steel-man the relevant viewpoint.
In general, you should argue to understand, not to win. This thread is not territory to be claimed by one group or another; indeed, the aim is to have many different viewpoints represented here. Thus, we also ask that you follow some guidelines:
-
Speak plainly. Avoid sarcasm and mockery. When disagreeing with someone, state your objections explicitly.
-
Be as precise and charitable as you can. Don't paraphrase unflatteringly.
-
Don't imply that someone said something they did not say, even if you think it follows from what they said.
-
Write like everyone is reading and you want them to be included in the discussion.
On an ad hoc basis, the mods will try to compile a list of the best posts/comments from the previous week, posted in Quality Contribution threads and archived at /r/TheThread. You may nominate a comment for this list by clicking on 'report' at the bottom of the post and typing 'Actually a quality contribution' as the report reason.

Jump in the discussion.
No email address required.
Notes -
Does it have to be a coding problem? I understand that there are time and financial constraints that prevent you from trying a lot of what is being requested, but I also understand @iprayiam3's criticism that it looks like you're cherry picking for something you thing the LLM can do. The problem is that for most people who aren't computer programmers they aren't going to be able to think of anything other than a piece of software that they wish existed but doesn't and ask you to write it from scratch, which is going to be cost prohibitive beyond the kind of textbook examples that were constructed for teaching purposes and don't address problems anyone is actually trying to solve. This seems like it should be marketing 101, but if you're trying to convince people that your product is worthwhile—and that's your stated goal—you have to show them that it will actually help them do something they want to do. If you tell me it can write code to fetch data from a REST API using asynchronous requests then I'll smile and nod but that's complete gibberish to me, and I won't know whether I should be impressed by it or not, or how that's supposed to improve my life.
So instead, I propose that we re-run the test I gave you last summer, because that is something I actually would use it for, and it obviously isn't too complicated. I'm an LLM skeptic, if you haven't noticed yet, but this is one of the things I think LLMs should be good at. For those who aren't going to click the link, the test was for the LLM to determine the release dates for various singles/albums based on a set of rules. I am extremely particular about my music collection and find the need to catalog everything down to the date of release, and that includes estimating dates when an exact one isn't available. I'm asking the LLM to automate what I already do myself. And I don't think this should be very complicated; in essence, what I'm asking it to do is query a series of databases, select a date based upon preference-ranked criteria, and potentially apply a mathematical calculation to that date. The hard part is that the databases are scattered across the internet, and some of them aren't formal databases but OCR scans of publications.
I had already tried this when OP asked for a challenge, and none of the models gave satisfactory results. I was assured that the new "reasoning" models that you had to pay for would do better. They did not. The first problem was that they were apparently unable to query some databases. The more concerning problem is that sometimes they queried the right databases but picked the wrong values. Sometimes they applied the rule incorrectly. The sample size wasn't large, but the models went 0/2. It's been several months since then, so maybe Round 2 will go better than Round 1? We can use the same releases as a preliminary test, but I recognize that the thread might have made it into training data or something since then so if it passes I'd prefer to run a more comprehensive test. There would also be a possible coding application here because if this were to work and I would use it I wouldn't want to query each release individually but would do batches (say, all the releases from a given artist) and export the data to an xml file or something that I could just refer to.
Another idea i had on similar lines would be for me to arbitrarily select a parcel of land in Westmoreland County, PA (selected because all of the recorded documents are available for free online) and see if it could download every deed in the chain of title going back 100 years. This particular task isn't hard to do but would be a proof of concept that it could possibly do more sophisticated work. I recognize that there are a number of scenarios that could arise that would completely flummox the LLM here. Given that, as a proof of concept I could run a few parcels in advance and preselect an easy one as proof of concept, though since LLM boosters like to brag about how powerful their models are I'm inclined to arbitrarily pick one without looking first and see how it does, especially since it cuts way down on the work I would need to do to verify the answer.
As a final option, if you're going to insist on a coding challenge, there's a feature in Photoshop that I've been hoping for for a long time but since it's for a niche application I doubt I'm ever going to get it. Part of being particular about my music collection means having cover art for everything, and a lot of the cover art just pulled straight from the internet is terrible, so I do a lot of cleaning it up. When all I have to use is images of 45 labels, I use a system to ensure that everything is consistent. I've automated most of this system with macros, but I still have to do the most time consuming part manually. A 45 label is donut-shaped. Ideally, the inside hole and outside edge of the label should be clean circles, though certain printing imperfections make ellipses a better option. Scans available online are photographed and have fuzzy edges, and the outside and inside have information that needs to be deleted to create a perfect white background. What I have to do to achieve this size the hole manually and hit delete. Photoshop has an area selection tool that can recognize the color change and select a large part of the area designated for deletion, but due to irregularities the edge is almost always irregular.
The tool I'm looking for would take these selections and normalize them to the nearest ellipse. The way I envision it working is that it would take a y-axis measurement, increase it by a few pixels to create a buffer, then take an x0axis measurement with a similar buffer increase, then create an ellipse based on those measurements (that's for the inside hole; the outside hole would be the same idea but would subtract from the axes to create the buffer). I wouldn't expect this to give perfect results 100% of the time, but it could work considerably less than that and it would speed things up significantly. The only reason I hesitate to propose this is that Photoshop isn't open source and I don't know how feasible it is to create plug-ins (they have some kind of system but I don't know enough about computers to know if what I'm asking for would work with it). I would be willing to settle for a GIMP plug-in as a proof of concept, but I absolutely despise GIMP so if it proves to work I'll have some serious soul-searching to do, and will probably request a lot more plug-ins to make it as much like Photoshop as possible.
This was largely my response. The claims the AI-believer crowd make about AI go far, far beyond coding. Coding by itself is a single, relatively niche field. AI could displace all the coders and if you don't work in software development yourself, would you notice?
Let's say, for the sake of argument, AI can code as well or better than the best human coders.
As an AI skeptic, I am not particularly moved by this, and I don't think this gets you anywhere near AGI.
The important distinction to make here is between coding and software engineering.
I'd argue SOTA LLM's are already, if perhaps not superhuman, already better than the vast majority of humans at tasks that can be defined as purely coding. Any SOTA LLM ranks among the best humans in the world at competitive programming, and recent model/harness combinations appear to also be superhuman at providing code that passes tests for a given spec (which is a bit like a vastly scaled up competitive programming task).
This is distinct from human parity in software engineering, but the bottlenecks there seem to be highly general; long-horizon planning, continual learning, taste, executive function, lack of correcting their own errors etc.
If a drop-in AI software engineer existed that could surpass those limitations, it's difficult to imagine that it would not also be AGI.
More options
Context Copy link
More options
Context Copy link
GPT 5.2 Thinking in Extended Reasoning mode:
https://chatgpt.com/share/699dfcfc-b0c4-800b-8e1a-870264179c40
5.2T + Agent mode, where it actually used a dedicated browser with a visual output:
https://chatgpt.com/share/699dfd6d-a7f8-800b-be8e-c04d95de44e5
I haven't checked if the answer is right, I'm recovering from a bad migraine so apologies for the laziness.
Thanks! Reviewing the results:
As a spoiler alert, it got both dates wrong again, so I'm disinclined to keep testing this particular task, as it only gets harder from here. That being said, I think the new models did somewhat better. Just so we're clear, GRoL first appeared on a radio chart on 5/9/1966, the Monday before which being 5/2/1966, thus our release date. FtH is pretty straightforward as the copyright date of publication is listed as 6/16/1980.
For GRoL, 5.2 Agent noticed that the major discographical sites (first preference) set the release date to May 1966, and, unlike o3, it didn't note this but pick a June date anyway, so that's an improvement, though I'm not sure if this is due to better architecture or the old error was a one-off. It was able to correctly pick the 5/28/1966 Billboard review, which o3 did as well. However, it once again flunked the ARSA test, the correct radio chart being the 5/9/1966 KBLA chart. Instead, it picked the 6/17/1966 WLS survey. Upon inspection of the sources, though, it appears that, unlike o3, it did not consult ARSA but an old GeoCities site that hosts charts from select radio stations in a few markets. The thing is I specifically specified ARSA. I did allow it to look at "other information", but the context in which it presented the find gave it similar weight to ARSA, and didn't specify that it didn't come from ARSA. Now, when I checked last August's results to see if it made the same error then and I missed it, it did check ARSA, but the link wasn't working. Since ARSA requires a free login, I wasn't sure initially that it would be able to get access but it did, and something may have changed in the meantime that stymied its ability to query ARSA.
But that's not the only problem. First, if it's going to query an alternative site it needs to disclose that. Second, it picked the June 17 date, when the site had the song appearing on the June 10 chart. Third, it noted that the song had been on the charts for 4 weeks, when there's no way it could have known this. The song had only been on the chart the previous week; it had been played on the station for 4 weeks. There was a 4 next to the title, and it incorrectly assumed that this stood for weeks on chart. Since the site wasn't clear, I had to go to ARSA and pull a scan of the chart to be sure exactly what it meant. The thing is that I don't understand why it even did this. I only care about the ARSA data if it gives an earlier date than Billboard, and it clearly didn't so it was irrelevant. If it couldn't access ARSA it could have just said so and used the Billboard date. If the other website had chart data that was earlier I would have appreciated if it took that into consideration, but that wasn't the case. I don't know why it would pretend to pull ARSA data when it didn't yield any useful information.
The 5.2 Thinking model confidently provided a date of 5/28/1966, based on Wikipedia. Based on what we know from above, this date is incorrect, and is the result of somebody entering the Billboard review date into Wikipedia. This is a common error, but I didn't include it in the initial algorithm because I didn't want to overcomplicate things (i.e., include a rule where it won't use Wikipedia dates when they clearly conform to Billboard dates), and this error wasn't present back in August, so I'll let the model slide here. What I won't let it slide on is where it says 45Cat agrees; 45Cat list a release date of May 1966 and includes a note saying "BB 5/28/1966", which clearly refers to a Billboard date. The issue with this is, yes, it followed the rules. But it was clear from the rules that I wanted a date prior to the Billboard date. If we're talking about LLMs being able to replace people for certain tasks, then it can't make the kind of mistake I wouldn't have made. If I only had looked at Wikipedia I might have made that mistake, and if the LLM had only done so I would have given it a pass. But it looked at 45Cat, didn't recognize that the date was not a release date, and even if it had I'm not sure that it would have recognized that the Wikipedia date might be untrustworthy, especially since there was no annotation for it. This might have worked better if I had provided a specific instruction to that effect, but if these things are really intelligent I shouldn't have to think of every possible caveat. If I were going to do that I wouldn't need an LLM and could write a program using conventional software where I just specify every field and include instructions for it.
Moving on to FtH, I have to admit that I whiffed a bit on this when setting this test up because I assumed that since this is a relatively obscure record release information wouldn't be readily available. Apparently I was wrong, and RYM has had the correct release date based on copyright publication data up since July 2024. What this means is that the LLM whiffed harder than I initially gave it credit for. It's apparently still having trouble accessing the US Copyright database, because neither model looked there despite the explicit instructions to query it for all releases after 1978. The Thinking model evidently didn't query RYM at all and did 45Cat (not the best for albums) before going straight to trade publications, radio charts, and a newspaper article. From there it defaults to the Monday prior to the earliest mention and gives a date of 7/14/1980.
The Agent whiffed even harder, though the date it gave was closer to the correct one. First, it said that RYM only listed 1980, but it appears that hasn't been true for nearly 2 years. From there it skips the copyright queries entirely and goes straight to the industry publication data, which this time have an earliest mention of 7/12/1980. Here's where it makes its biggest error. The instructions specified for it to default to Monday if there wasn't a coordinated release day. Here, it picks Tuesday, July 8. Why? It states that 1980 had a typical Tuesday release date, and cites a Vox article. This is not true, and the Vox article says that the Tuesday release date started in the 1980s. To be specific, coordinated Tuesday releases began in April 1989, nearly a decade after FtH was released. So it misunderstood the Vox article. But even had it understood it correctly, it still would have been in a bit of trouble, because the Vox article itself had an error. It says that before April 1989, record stores would stock releases whenever they came in. This is also incorrect; an article in a March 1989 issue of—you guessed it—Billboard, stated that they were changing the release date from Monday to Tuesday because some retailers weren't getting their stock until late Monday. It also says that MCA stayed with the Monday release for the time being (they would switch to Tuesday in 1991 or so). In fact, labels had been coordinating Monday releases since 1982 or 1983. This doesn't matter for the purposes of my rules, since they default to Monday, but it's something to be aware of.
The upshot is that we ran 2 releases with 2 models each and got 4 different answers, none of which was the correct one. To summarize the answers so far for GRoL:
Five models, five dates, none of them correct. There was a glitch in the test where I inadvertently made it too easy and both models still whiffed; when I first designed the test I intentionally omitted released dates that were on reputable websites, because I had no doubt that the LLMs could perform a simple lookup, but one model didn't bother looking and the other probably didn't bother looking. What I suspect happened here is that the 1980 date was in the initial training data from before July 2024 and the model didn't doublecheck the site to see if it had been updated. That's just a guess, but either way it seems like a major problem if after a year it can't find a number on a webpage I specifically instructed it to check. It doesn't understand that since the 1980s does not mean since January 1, 1980.
As a final thought, when I was checking the ARSA data, I pulled the 5/9/1966 survey from KBLA in Burbank, CA, when I noticed something interesting. GRoL did not appear on the chart itself, but in a special "coming attractions" section. Now, I want to make it clear that these dates I am expecting are merely estimates, and that the radio data is the least reliable since stations often get copies for airplay in advance of release. When I was developing this system, I made a judgment call that I'd prefer a too early release date to a too late one. I initially had no way of knowing whether the coming attractions were records that had been released and were expected to be on the next chart, or merely records scheduled for release. I considered the possibility that this may have caused the LLM to think they hadn't been released (before discounting it because they also ignored charts where the record had appeared and may have provided an explanation for why they were discounting a chart). Then I noticed that the coming attractions section that week also included the Temptations classic Ain't Too Proud to Beg. This was fortuitous, because Motown release dates are well-documented; if that record had been released by May 9, then I could be confident that the other coming attractions probably were as well. Ain't Too Proud to Beg was released May 3, 1966, one day after my estimate for GRoL (Motown didn't stick to a set release day). It's a small sample size, but I'm more confident in my method than I was before.
Would an LLM have recognized this possibility and thought to check it like this?
More options
Context Copy link
More options
Context Copy link
A coding problem? Not strictly, no. I focused on coding because my collaborator SF (who is doing most of the work) is a programmer.
As you can see from discussion with Phailyoor and faul_sname, I'm open to other well-defined tasks.
I started as soon as I read this. I'm running it on 5.2 Thinking and another instance using Agent mode (the model has access to a computer of its own with a browser). It's taking a while, so I'll ping you when I'm done. I tried to be faithful to your original framing, so I didn't mention that o3 tried and failed at the task, or your critiques shared later.
If this doesn't work, then sure, I can ask SF to consider using his Claude setup to try. Shouldn't be too onerous.
I have no idea, in advance, if this will work. I doubt SF does either. But it's also something we can try.
I share your concerns with the issues arising from Photoshop being closed-source. But I'll share it too, assuming SF hasn't seen this yet. It sounds like something worth trying from my perspective, but I will stress that I am not a professional programmer so I'll be deferring to his judgment.
More options
Context Copy link
Did you forget the link?
Lol, by the time I finished writing that I forgot that I was supposed to link something. Fixed.
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link