This weekly roundup thread is intended for all culture war posts. 'Culture war' is vaguely defined, but it basically means controversial issues that fall along set tribal lines. Arguments over culture war issues generate a lot of heat and little light, and few deeply entrenched people ever change their minds. This thread is for voicing opinions and analyzing the state of the discussion while trying to optimize for light over heat.
Optimistically, we think that engaging with people you disagree with is worth your time, and so is being nice! Pessimistically, there are many dynamics that can lead discussions on Culture War topics to become unproductive. There's a human tendency to divide along tribal lines, praising your ingroup and vilifying your outgroup - and if you think you find it easy to criticize your ingroup, then it may be that your outgroup is not who you think it is. Extremists with opposing positions can feed off each other, highlighting each other's worst points to justify their own angry rhetoric, which becomes in turn a new example of bad behavior for the other side to highlight.
We would like to avoid these negative dynamics. Accordingly, we ask that you do not use this thread for waging the Culture War. Examples of waging the Culture War:
-
Shaming.
-
Attempting to 'build consensus' or enforce ideological conformity.
-
Making sweeping generalizations to vilify a group you dislike.
-
Recruiting for a cause.
-
Posting links that could be summarized as 'Boo outgroup!' Basically, if your content is 'Can you believe what Those People did this week?' then you should either refrain from posting, or do some very patient work to contextualize and/or steel-man the relevant viewpoint.
In general, you should argue to understand, not to win. This thread is not territory to be claimed by one group or another; indeed, the aim is to have many different viewpoints represented here. Thus, we also ask that you follow some guidelines:
-
Speak plainly. Avoid sarcasm and mockery. When disagreeing with someone, state your objections explicitly.
-
Be as precise and charitable as you can. Don't paraphrase unflatteringly.
-
Don't imply that someone said something they did not say, even if you think it follows from what they said.
-
Write like everyone is reading and you want them to be included in the discussion.
On an ad hoc basis, the mods will try to compile a list of the best posts/comments from the previous week, posted in Quality Contribution threads and archived at /r/TheThread. You may nominate a comment for this list by clicking on 'report' at the bottom of the post and typing 'Actually a quality contribution' as the report reason.

Jump in the discussion.
No email address required.
Notes -
For years, the story of AI progress has been one of moving goalposts. First, it was chess. Deep Blue beat Kasparov in 1997, and people said, fine, chess is a well defined game of search and calculation, not true intelligence. Then it was Go, which has a state space so vast it requires "intuition." AlphaGo prevailed in 2016, and the skeptics said, alright, but these are still just board games with clear rules and win conditions. "True" intelligence is about ambiguity, creativity, and language. Then came the large language models, and the critique shifted again: they are just "stochastic parrots," excellent mimics who remix their training data without any real understanding. They can write a sonnet or a blog post, but they cannot perform multi step, abstract reasoning.
I present an existence proof:
OpenAI just claimed that a model of theirs qualifies for gold in the IMO:
To be clear, this isn't a production-ready model. It's going to be kept internal, because it's clearly unfinished. Looking at its output makes it obvious why that's the case, it's akin to hearing the muttering of a wild-haired maths professor as he's hacking away at a chalkboard. The aesthetics are easily excused, because the sums don't need one.
The more mathematically minded might enjoy going through the actual proofs. This unnamed model (which is not GPT-5) solved 5/6 of the problems correctly, under the same constraints as a human sitting the exam-
As much as AI skeptics and naysayers might wish otherwise, progress hasn't slowed. It certainly hasn't stalled outright. If a "stochastic parrot" is solving the IMO, I'm just going to shut up, and let it multiply on my behalf. If you're worse than a parrot, then have the good grace to feel ashamed about it.
The most potent argument against AI understanding has been its reliance on simple reward signals. In reinforcement learning for games, the reward is obvious: you won, or you lost. But how do you provide a reward signal for a multi page mathematical proof? The space of possible proofs is infinite, and most of them are wrong in subtle ways. Wei notes that their progress required moving beyond "the RL paradigm of clear cut, verifiable rewards."
How did they manage that? Do I look like I know? It's all secret-sauce. The recent breakthroughs in reasoning models like o1 and onwards relied heavily on "RLVR", which stands for reinforcement learning with verifiable reward. At its core, RLVR is a training method that refines AI models by giving them clear, objective feedback on their performance. Unlike Reinforcement Learning from Human Feedback (RLHF), which relies on subjective human preferences to guide the model, RLVR uses an automated "verifier" to tell the model whether its output is demonstrably correct. Presumably, Wei means something different here, instead of simply scaling up RLVR.
It's also important to note that previous SOTA, DeepMind's AlphaGeometry, a specialized system, had previously achieved a silver-medal level performance and was within spitting distance of gold. A significant milestone in its own right, but OpenAI's result comes from a general-purpose reasoning model. GPT-5 won't be as good at maths, either because it's being trained to be more general at the cost of sacrificing narrow capabilities, or because this model is too unwieldy to serve at a profit. I'll bet the farm on it being used to distill more mainstream models, and the most important fact is that it exists at all.
Update: To further show that isn't just a fluke, GDM also had a model that scored Gold this Olympiad. Unfortunately, in a very Google-like manner, they were stuck waiting for legal and marketing to sign off, and OAI beat them to the scoop.
https://x.com/ns123abc/status/1946631376385515829
https://x.com/zjasper666/status/1946650175063384091
"Moving the goalposts" is a bad metaphor.
Putting literal goal posts in approximately the right place is easy. Just put them on the end line, at the middle. But sports are competitive. Players will not be happy with the goal posts being in approximately the right place. They have to be in exactly the right place. This too is easy. Goal posts are self defining; the right place for the goal post is where the goal post is!
Belabouring the point, I invite you to consider a soccer match. In the first half, team A score with a shot just inside the right post. In the second half, team B fail to score with a shot just outside the left post. In the post-game adjudication, it is discovered that the goal posts were two feet right of they ought to be. Team A's goal gets disallowed. Team B's miss becomes a goal. Moving the goal posts flips a win for Team A into a win for Team B. The absurdity here is not so much the motion as the neglect. We are neglecting that the goal posts define the goal.
Turning now to Artificial Intelligence, we notice that humans are intelligent [citation needed :-)]. Which raises the question: why are we bothering to create an artificial version of what we already have? Mostly because the devil in in the details; humans are intelligent, but ...
If Alice copies Bob, and Bob copies Charles, and Charles copies Alice, then who should David copy? Human intelligence has a circle jerk problem. Perhaps David should copy Edward, who has reasoned things out from first principles. Perhaps David should copy Fiona who has done experiments. But brilliant, charismatic intectuals lead societies over cliffs. I wrote a paragraph on the difficulties of empirical science, but I deleted it because I couldn't get it to replicate.
We want something from Artificial Intelligence. We want it to cover the gaps in human intelligence. If we could crisply and accurately characterise those gaps we would be well on our way to fixing them ourselves. We have (had?) exactly one example of intelligence to look at, and we are not happy with it. We certainly notice that it has a weak meta-game: human intelligence is bad at seeing its own flaws. We are not able to install self-defining goal posts.
Old people bring baggage from the 1960's to discussion of AI. The word Computer invokes images of banks of tape drives reading databases. Human written legal briefs have a sloppiness problem. Need a precedent? A quick skim and this one looks close enough. It is job of the opposing lawyers to read it carefully and notice that it is not relevant. (The legal system is not supposed to work like this!) One images that an Artificial Intelligence actually reads the entire legal database and finds precedents that humans would miss. When an LLM invents a plausible, fictional precedent that just doesn't exist, one is taken by surprise. One wants to mark the AI down a lot for that non-human error. Doing so involves both moving the goal posts and admitting to not anticipating that failure mode at all.
There is a more subtle issue with LLMs writing computer programs. We may be underestimating the effort that goes into cleaning up LLM messes. LLMs learn to program from code bases written by humans. Not just written by humans, maintained by humans. So the bugs that humans spot and remove are under-represented in the training data. Meanwhile, the bugs that evade human skill at debugging lurk indefinitely and are over-represented in the training data. We have created tools to write code with bugs that humans have difficulty spotting. Worse, we estimate the quality of the code that our new tools produce on the basis that they are inhuman and have no special skill at writing bugs that we cannot spot, despite the nature of their training data.
Notice the clash with old-school expectations. A lot of GOFAI focussed on formal verification of mathematics. Some early theorem provers were ad hoc (and performed poorly). The attention shifted to algorithms growing out of Gödel's completeness theorem and Robinson's work on resolution theorem provers. Algorithms that were provably correct. The old school expectation involves a language such as SML, with a formal semantics, a methodology such as Dijkstra's "A Discipline of Programming", and code accompanied by a formally verified proof of correctness.
A tool for writing code with bugs that humans cannot find sounds like the kind of thing that Mossad would use to sabotage Iranian IT infrastructure. It may be super‐humanly intelligent, but we still want to move the goal posts to exclude it as the bad kind of intelligence.
One old school of AI imagined that the language of thought would be importantly different from natural language. The architecture of AI would involve translating natural language into a more rigorous and expressive internal language, thinking in this internal language and then translating back to natural language for output. LLMs do perhaps partially realise this dream. The tokens are placed in a multidimensional space and training involves discovering the latent structure, effectively inventing that training run's own, custom language of thought. If so, that is a win for the bitter lesson.
On the other hand, LLMs learn the world through human language. I believe that humans suffer from linguistic poverty. Many of our disputes bog down for lack of words. When we have one word for two concepts our discussions are reduced to hopping instead of walking.(My missing words web page is neglected, my post https://www.themotte.org/post/1043/splitting-defensive-alliance-into-chaining-alliance was not well liked, I'm not managing to explain the concept of linguistic poverty.) I hope that AI will "... cover the gaps in human intelligence." but LLMs seemed doomed to inherit our linguistic poverty and reproduce our existing confusions. The dream was that AI would cure human intellectual weakness not copy it.
I think that it is legitimate to notice that LLMs are indeed intelligent, and to then move the goal posts, declaring that, now we have seen it, we realise our error and this is not what we had in mind.
As I elaborated on in another comment in this thread, I do not think that some moving of goalposts is necessarily illegitimate. Our specifications can be incorrect, no one's immune from good old Goodhart.
Yet AI skeptics tend to make moving the goalposts into the entire sport. I will grant that their objections exist in a range of reasonableness, from genuine dissatisfaction with current approaches to AI, to Gary Marcus's not even wrong nonsense.
This is an interesting concern, and I mean that seriously. Fortunately, it doesn't seem to be empirically borne out. LLMs are increasingly better at solving all bugs, not just obvious-to-human ones. The ones in commercial production are not base models, naively concerned only with the next most likely token (and which necessarily includes subtle bugs that exist in the training distribution), but they're beaten into trying to find any and all bugs they can catch. Nothing in our (limited but not nonexistent) ability to interpret their behavior or cognition suggests that they're deliberately letting bugs through because they seem plausible. I am reasonably confident in making that claim, but I hope @faul_sname or @DaseindustriesLtd might chime in.
At the end of the day, there exist techniques like adversarial training to make such issues not a concern. Ideally, with formal verifications of code, you can't have unwanted behavior, ruled out by mathematical certainty. Of course, interpreting that you haven't made errors in formulating your specification is a challenge in itself.
There's been a decent amount of work done on dispensing with the need for tokenization in the first place, and letting the LLM operate/reason entirely in the latent space till it needs to output an answer. It seems to work, but hasn't been scaled to the same extent, and the benefits are debatable beyond perhaps solving minor tokenization errors that existing models have.
Human language, as used, is imprecise, but you can quite literally simulate a Turing machine with your speech. I don't see this as a major impediment, why can't LLMs come up with new words if needed, assuming there's a need for words at all?
I may or may not be an AI skeptic by your definition - I think it's quite likely that 2030 is a real year, and think it's plausible that even 2050 is a real year. But I think there genuinely is something missing from today's LLMs such that current LLMs generally fail to exhibit even the level of fluid intelligence exhibited by the average toddler (but can compensate to a surprising degree by leveraging encyclopedic knowledge).
My sneaking suspicion is that the "missing something" from today's LLMs is just "scale" - we're trying to match the capability of humans with 200M interconnected cortical microcolumns with transformers that only have 30k attention heads (not perfectly isomorphic, you could make the case that the correct analogy is microcolumn : attn head at a particular position, except the microcolumns can each have their own "weights" whereas the same attn head will have the same weights at every position), and we're trying to draw an equivalence between one LLM token and one human word. If you have an LLM agent that forks a new process in every situation in which a human would notice a new thing to track in the back of their mind, and allow each of those forked agents to define some test data and fine-tune / RL on it, I bet that'd look much more impressive (but also cost OOMs more than the current stuff you pay $200/mo for).
LLMs are increasingly better at solving a particular subset of bugs, which does not perfectly intersect the subset of bugs which humans are good at solving. Concretely, LLMs are much better at solving bugs that require them to know or shallowly infer some particular fact about the way a piece of code is supposed to be written, and fix it in an obvious way, and much much worse at solving bugs that require the solver to build up an internal model of what the code is supposed to be doing and an internal model of what the code actually does and spot (and fix) the difference. A particularly tough category of bug is "user reports this weird behavior" - the usual way a human would try to solve this is to try to figure out how to reproduce the issue in a controlled environment, and then to iteratively validate their expectations once they have figured out how to reproduce the bug. LLMs struggle at both the "figure out a repro case" step and the "iteratively validate assumptions" step.
In principle there is no reason LLMs can't come up with new words. There is precedence for the straight-up invention of language among groups of RL agents that start with no communication abilities and are incentivized to develop such abilities. So it's not some secret sauce that only humans have - but it is a secret sauce that LLMs don't seem to have all of yet.
LLMs do have some ingredients of the secret sauce: if you have some nebulous concept and you want to put a name to it, you can usually ask your LLM of choice and it will do a better job than 90% of professional humans who would be making that naming decision. Still, LLMs have a tendency not to actually coin new terms, and to fail to use the newly coined terms fluently in the rare cases that they do coin such a term (which is probably why they don't do it - if coining a new term was effective for problem solving, it would have been chiseled into their cognition by the RLVR process).
In terms of why this happens, Nostalgebraist has an excellent post on how LLMs process text, and how that processing is very different from how humans process text.
So there's a sense in which an LLM can coin a new term, but there's a sense in which it can't "practice" using that new term, and so can't really benefit from developing a cognitive shorthand. You can see the same thing with humans who try to learn all the jargon for a new field at once, before they've really grokked how it all fits together. I've seen it in programming, and I'm positive you've seen it in medicine.
BTW regarding the original point about LLM code introducing bugs - absolutely it does, the bugginess situation has gotten quite a bit worse as everyone tries to please investors by shoving AI this and AI that into every available workflow whether it makes sense to or not. We've developed tools to mitigate human fallibility, and we will develop tools to mitigate AI fallibility, so I am not particularly concerned with that problem over the long term.
Absolutely not, at least by standards! You acknowledge the possibility that we might get AGI in the near-term, and I see no firm reason to over index on a given year. Most people I'd call "skeptics" deny the possibility of AGI at all, or rule out any significant chance of near-term AGI, or have modal timelines >30 years.
I agree that LLMs are missing something, but I'm agnostic on whether brute-force scaling will get us to undisputable AGI. It may or may not. Perhaps online learning, as you hint at, might suffice.
I wonder if RLHF plays a role. I don't think human data annotators would be positively inclined towards models that made up novel words.
Thank you for taking the time to respond!
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
The arc of history bends towards machine dominance in all tasks. Just the other day we had OpenAI's contender come second to some Polish genius who was practically sweating blood in an invite-only programming optimization contest.
https://officechai.com/ai/openai-places-second-behind-human-coder-at-atcoder-progmming-event/
But we can't retrain Psyho for 10,000 years of subjective time on more optimization. His brain is capped at 20 watts or so, just like the rest of us. God isn't going to release homo sapiens max (now with denser neurons and a bigger cerebrum!). The bioethics brigade won't let us step it up and nobody has the balls to ignore them, plus it's too late now. In contrast, Nvidia has a 2-year release timescale.
One would think after watching various chess masters get crushed in the 80s and early 90s we'd have learnt. But it's like you said, nobody learnt anything 'Oh it can beat an amateur but a master has deep conceptual understanding' -> 'oh it can beat a master but Kasparov has deep intuition' -> 'oh it's a nothingburger, let's move on to text'. We continue to not learn the trend even today when progress is much faster and in many more domains. Gary Marcus somehow still has a following, he's the Gordon Chang of AI.
More options
Context Copy link
I don't mean to diminish this, since it's thinking sand and that's incredible, but it does seem like they're now making progress by increasing inference costs by OOMs rather than training costs. This is kind of the opposite direction you want to be going for the vision of the future that came from that Ketamine trip with Fischerspooner doing the soundtrack.
Diminishing returns != no returns.
Per year, it costs more to send someone to college or uni than it does to send them to school. If they come out of it with additional skills, or even just the credentials to warrant that investment, it's worth it. Even if you need to go into temporary debt for that purpose, as long as it's something less stupid than underwater basket weaving..
Just look at the wage disparities within humans. A company might be willing to pay hundreds or thousands of times more for a leading ML researcher or quant than they would for a janitor. The same applies to willingness-to-pay for every more competent AI models. Could you not afford to pay for AI Einstein if your competitor will?
Training costs are still going up, it isn't all test time compute. I don't know if we're going to have super-intelligence too cheap to meter (as opposed to mere intelligence on par with an average human), but what can we do but hope?
I can hope too. I'm just imagining they had to spend like $100k in inference compute or whatever to really kick the asses of the high school students. They spent around $1,000 per question on ARC and that was stuff we expected ten year olds to solve.
If that's the world we're in, I see the bubble bursting long before we finish building up to superintelligence. Companies aren't going to invest $500b/y for decades on this when the payoff in the meanwhile is kinda maybe you can fire the dumbest Jr SWEs on your team.
This is also if we accept the argument that completing the Math Olympiad is Real Reasoning and if the model truly just used its own thinking.
We are experimenting and learning and inventing. Every modern AI is a brand new prototype, mass released to the public only because of how interesting and useful they are despite their newness.
Nearly every new invention is massively overpriced compared to its long term potential unless the "invention" is a refinement of an old invention optimized specifically for its affordability. Cars used to be crazy expensive luxury goods, now they're expensive but affordable staples of modern life, much cheaper than trying to walk across the country on the Oregon Trail. The literal first refrigerator was vastly expensive as the inventor prototyped it out without a factory to stamp them out, now everyone has one. The first GPT-4 quality LLM was vastly more expensive to design than GPT-4 quality LLMs will be 10 years from now. We have no idea where AI intelligence will plateau, and we have no idea what cost it will asymptote towards over the next few decades as people discover more and more efficient methods and technologies. Current quality is merely a lower bound, and current costs are an upper bound, not the true long term potential, and probably not anywhere close.
The answer to every (non-safety) criticism of AI is that we're not there yet. But we're getting somewhere.
Compared to where we were ten years ago, it looks like AGI is achievable now. It seems before we didn't even have an architecture that you could spend infinite compute on that would ever arrive at an answer. But now it seems like we do! It's clear that you can get them to do reasoning-like things and it's mainly a matter of how much compute you can throw at it. So that's amazing.
But the question that remains, is will this architecture get to AGI within economic feasibility? It doesn't quite seem like the right architecture. They use much, much more power than humans do to solve the same problems, for example.
If we have to continually 10x the amount of inference compute we throw at a model to cut the error rate in half, we might exhaust the capacity of the Earth before we reach AGI.
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
Wake me up when it's been independently verified that OpenAI didn't train on the test set again.
Beyond what @sarker pointed out, Grok 4 is practically deep-fried through training on test sets, and it only managed a 12% the last go around. OAI isn't the most trustworthy company around, but I don't think they're fucking around here.
https://x.com/VictorTaelin/status/1946619298669269396
More options
Context Copy link
It would be difficult to do that given that the questions were published this week and the answers weren't published until ~today.
More options
Context Copy link
More options
Context Copy link
It was actually AlphaProof that was the previous SOTA.
Thanks. Looks like AG came out six months before it.
More options
Context Copy link
More options
Context Copy link
Gary Marcus failing to beat the stopped clock benchmark of being right at least twice in a day:
https://x.com/scaling01/status/1946528532772909415
More options
Context Copy link
This is like saying that a Turing test is moving goalposts because the interrogator can suddenly decide in the middle of the test to ask the AI a new question that he hasn't talked about before and that the AI and its programmer has had no chance to prepare for. Except on a much bigger, slower, scale.
AI progress is moving goalposts because people are better able to figure out what they need to demand from the AI after seeing how it performs on previous demands rather than before.
The Turing Test explicitly allows for just about any query under the sun. Literally no one, including the people submitting their bots to such a challenge, would make such an objection. If they did, they'd be laughed out of the room by their peers. You're making up a hypothetical here.
I don't deny that some moving of goal-posts is justified. AI intelligence is far more spiky than their human counterparts, and a lot of unexpected weaknesses exist alongside clear strengths. If, in hindsight, the metrics did not correspond to the skills we imagined, it is fair to challenge said metrics. I might promise to buy a car that does >x MPG of diesel, but if you then give me a car that's only there because it uses petrol, then I don't want your car. Worse, it might require a solid rocket booster and fall apart when it gets to its destination. A hospital that rewards nurses in an NICU for ensuring that preemied gain weight won't be very moved if the latter argues that feeding them iron filings was an effective strategy.
Words can be imprecise.
There exists no human alive that has as much crystalline intelligence or general knowledge as even an outdated model like GPT-3, maybe even 2. Expectations existed that an AI with such grossly encompassing awareness of facts would be as smart/competent as a polymath human. This did not turn out to be the case. We have models that are superhuman in some regards, while being clearly subhuman in others, being beaten by small children in some cases.
They are still, as far as I'm concerned, clearly intelligent. Not intelligent in exactly the same way as humans, but approaching or exceeding peer status despite their alien nature. To deny this is to be remarkably myopic.
We have models that can:
The space of capabilities they lack is itself becoming increasingly lacking. If such an entity isn't intelligent, then neither am I, because I couldn't solve the IMO or play chess at 1800 Elo like GPT 3.5. If I still am somehow "intelligent" despite such flaws, then so are LLMs. I promise you that even if I were to flatter myself and claim I could get there and beyond with sufficient effort, the AI will be beat me to everything else. This hold true for you too.
What is your understanding of 'intelligence'?
I would endorse something like:
"Intelligence is the general-purpose cognitive ability to build accurate models of the world and then use those models to effectively achieve one's goals."
Or
"Intelligence is a measure of an agent's ability to achieve goals in a wide range of environments."
This, of course, requires the assessor to be cognizant of the physical abilities and sensory modalities available to the entity. Einstein with locked-in-syndrome would be just as smart, but unable to express it. If Stephen Hawking had been unlucky enough to be born a few decades earlier, he might have died without being able to achieve nearly as much as he did IRL.
Well, I will grant that on the latter definition, LLMs are 'intelligent'.
I don't think I would grant it on the former definition, because I take building a model of the world to be a claim about conscious experience, which LLMs don't have. LLMs are capable of goal-directed activity, for whatever that may be worth, but I think having a model of the world implies having some kind of mental space or awareness. You mention an entity being 'cognizant' of something, but I would have thought that's the thing obviously missing here. To be cognizant of something is to be aware of it - it's a claim about interiority.
I mention this because I notice in AI discourse a gulf where it seems that, for some people, LLMs are obviously intelligent, and the idea of denying that they are is ridiculous; and that for other people LLMs are obviously not intelligent, and the idea of affirming that they are is ridiculous. I'm in the latter camp personally, and the way I make sense of this is just to guess that people are using the word 'intelligent' in very different ways.
I am agnostic on LLMs being conscious or having qualia. More importantly, I think it's largely irrelevant. What difference to me does it make if an unaligned ASI turns me into a paperclip but doesn't really dislike me?
Is a horse happy about the fact that the tractor replacing it isn't conscious? It's destined for the glue factory nonetheless.
We have no principled or rigorous way to interrogate consciousness in humans. We have no way of saying with any certainty that LLMs aren't conscious, even if I am inclined to think that, if they are, it's a very alien form of consciousness.
I'm talking about whoever is doing the assessment of consciousness being "aware" of the fundamental limitations of the entity they're testing. I could, in theory, administer a med school final exam to Terence Tao, and he'd fail miserably. I would be a bigger idiot if I went on to then declare that Tao is thus proven to not be as smart as he seems. That meme about subjecting a monkey, fish and elephant to the same objective test of ability in the form of climbing trees, while usually misapplied, isn't entirely wrong.
I also don't mean to make any implications about "interiority" here. I would happily say that an LLM is "cognizant" of fact X, if say, that information was in its training data or within the context window. No qualia or introspection required.
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
To be fair, LLMs have been moving away from this towards coding, engineering and maths because their success is easier to judge and rewards for RL-produced reasoning are easier to define.
I should have put that in quotes. I'm not that much of a wordcel apologist, even if I'm a wordcel.
True, but what I mean is that LLMs have been moving AWAY from fluid verbal intelligence and back towards the comfort zone of code and maths IMO.
I value the kind of writing ability and ‘everyday intelligence’ that the models indicated and Claude 3.7 had but I don’t think that’s the direction they’re moving in.
To an extent, they're forced to be! In a lot of mushy-mushy realms like literature, if you ask ten people to choose the "best", you'll get eleven different and mutually exclusive answers. And there's no objective way to grade between them. The closest would be RLHF, which has obvious weaknesses.
(Is JK Rowling the best living writer because she made the most money off her books? That would be a rather contentious claim. So we don't even know what to optimize for there)
I believe the hope is that there's strong expectation that there's some degree of cross-pollination, that making these models great at code, maths or physics will pay dividends elsewhere. Seems true to me, but I'm no expert.
Oh, I agree. I spent a big part of last year trying to create a personal assistant and the biggest reason for its failure was that I had no real way to judge its output.
What annoys me is that they seem to have ignored all of the ways you might optimise for this, let alone produced different products that you could trade off against each other. I would love to have one AI optimised for being lauded by literary critics, one for maximum mid-wit upvotes, etc. And you could always mix and match weights afterwards.
I am skeptical that optimising for maths and engineering ability will produce intuitive social machines because, well…
So, an interesting part of this dynamic is that sometimes expanded capabilities spill over into seemingly less related areas more than you’d think. For example, you might naively think that limiting your model to English would make it better, smaller, and faster. It does make it smaller, but actually stripping away the foreign language capabilities degrades the pure English performance! It prevents overfitting, and there’s good reason to suspect that it also improves the more nebulous “reasoning” skills. So, it’s quite possible and maybe even probable that stripping away too much of one thing might degrade the whole model, rather than allowing it to “specialize”.
More options
Context Copy link
An interesting idea. I think it's not being actively pursued because, companies like OAI don't see the economic value in such niche specialization unless it's for something as lucrative as say, producing a superhuman programmer. There's not much money in winning the Nobel Prize for Literature.
They also seem to me to be hoping that it's better to have general capabilities, and then let the user elicit what they need through prompting. If you want high-brow literary criticism, ask for it specifically, but by default, they know that mid-brow LM Arena slop and fancy formatting wins over the majority of users. Notice how companies no longer make a big deal out of the potential to make private finetunes of their models, instead claiming that RAG or search is sufficient given their flexibility and large context lengths. Which is true, IMO.
OAI did kinda-sorta half-arse personalization with their custom GPTs, but found no traction. Just the standard model becoming better made them obsolete.
Heh. Good one. However, look at Elon Musk or Zuck for examples of people who definitely lean more on technical abilities instead of people skills.
More options
Context Copy link
More options
Context Copy link
Right, LLM writing is all about preference, but I find the Chinese models relatively witty.
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link