BigObjectPermanenceShill
No bio...
User ID: 4286
Combustion-based fuels are powering only about 30% of the grid in Texas today, per ERCOT
Not sure why you are emphasizing this in Texas. It's not that they're using little gas, it's that Texas has had an enormous solar buildout (based) for reasons of having good solar resource and being generally pro-building-stuff (based). In any case Trump himself, and his crew, are not representative of the entire population of Red Tribe or Conservatives but more a culture war caricature thereof, and seem to be solidly in the pocket of Big Fossil, so express only marginally less skepticism of solar than they do for wind (cf. Trump's repeated appeals to Chynese windmills that are a scam which Chyna supposedly doesn't use domestically).
The proliferation of models and harnesses times individual work styles, preferences and use cases creates an exponentially large space, making it a futile endeavor to diagnose the reason for your experience (Opus vs Sonnet, Claude Vs GPT, Claude Code vs Codex? Tool use configuration? Sun spots?) and give any advice. And besides, why engage in big-picture futuristic forecasting? Frontier labs should shill their product on their own dime, and their thesis will be proven right or wrong soon enough.
There's a clear object-level flaw in your writeup, however, and ironically it's the exact sort of confident slop we've come to associate with LLMs when they come short of the standard of human reasoning over novel context. This isn't to dunk: the standard, see, is very high, humans often need conscious effort to match it. That models can ever touch it is miraculous enough.
I mean this part:
The second happening is the ARC prize people releasing version 3 of their AGI test suite, a series of puzzle games. They released it within a few hours of Jensen Huang saying he thinks the latest and greatest models are capable of AGI. Humans were capable of solving 100% of the puzzles. The highest scoring AI couldn't complete more that 0.5%.
Here are the AGI puzzles for anyone interested in trying them out: https://arcprize.org/arc-agi/3
You've played the games and you've thought of making it an argument, but you weren't curious enough to read on the actual scoring rule. It's contentious enough that Chollet has to make excuses on Hackernews.
To be clear:
- Each submission (i.e., an attempt to output a solution for a task) counts as one action. Internal reasoning steps do not count.
- For each task, the baseline is the second‑best number of actions taken by humans who attempted the task for the first time. Using the second‑best reduces the impact of luck.
- If the AI solves the task: (human_baseline_actions / ai_actions)², yes squared. If the AI fails: 0.
- Maximum per task is 1.0, which is to say even if the AI beats the human baseline, the score is capped at one point zero.
- Scores are weighted by task difficulty (later tasks in a game count more), then averaged across all games.
So if the human baseline is 10 actions:
- AI uses 10 actions → (10/10)² = 1.0
- AI uses 20 actions → (10/20)² = 0.25
- AI uses 100 actions → (10/100)² = 0.01
A score of 0.5% (0.005) does not mean that the best AI only solves 1/200th of the problems. Same score is achieved by it being 14.5 times less sample-efficient. But should we care? How much of an opportunity cost comes with wasted AI samples? A white collar professional in the US earns a Claude Max (x10) subscription in 1-2 hours; Claude Max will generate ≈2 OOMs more tokens in a month than said professional can; even if they're 1000 times less useful per token, that's a massive bargain. We already routinely afford AI that's this inefficient. It'll be more efficient soon, though.
Which is not to say it'll be cheaper. Consider that car rentals in American cities go for roughly one monthly Claude Max subscription a day. Sure, the US is a tough place and a car preserves you from being stabbed in the neck on public transport, but we can quantify the millimorts and assign a cost to them; after that, does a car for a day provide as much economic value as a fully exploited Claude Max for a month – tens of millions of Opus-grade output tokens? Seeing how fast OpenAI and especially Anthropic revenues have been growing, what do you think will be their asking price when all dumping from also-rans is rendered irrelevant?
Right now the cost of tokens is suppressed by the lingering user base acquisition phase, hardware gains, rapid competitive model churn and, more importantly, by the threat of cheap open weights models, mostly Chinese, increasingly Nvidia. Should those fall far enough behind, together with other minor competitors, and we enter the territory of a Frontier Cartel, $1000/month subscriptions as baseline expectation. (This, fyi, is implicitly Dario Amodei's theory of victory – see him invoking Cournot equilibrium on Dwarkesh's podcast.) I pray we don't. But people would pay for it. These systems aren't a joking matter, being shut out of them will be quite literally existentially threatening for many businesses soon enough.
The best model+harness scores 36%, by the way. But I'm more impressed by the 12.58% scored by a 4-layer CNN called StochasticGoose. Read this piece, it contains some pretty neat analysis.
Chollet's evals are neat too, but he's pushing a narrative against machine superintelligence from within Deep Learning paradigm, and he's getting embarrassingly biased, with more and more abstract justifications for denying what looks like inevitability.
- Prev
- Next

My pessimistic hypothesis is that people use AI much more rarely, and less intensely, than paranoiacs think. I'm sometimes accused of AI use for allowing something of a purple prose aspect to my writing, and strongly suspect that the general tastelessness of AI and specific quirks like "it's not A — it's B" is downstream of cocksure, overwrought, incisive, journalistic op-ed prose having been used for RLHF as positive examples, because somewhere in 2022-23 someone a) had built a reranker for High Quality Data and b) had commissioned a lot of "powerful persuasive essays to make you think"/"dashing intelligent opinions" on MTurk/Fiverr. See this debate between two South Asians. They both write "like AI". I'm pretty sure that Human's posts at this point are an amalgamation of human text, AI text and human-interiorized AI-patterns, and Count even describes his workflow. Not being native speakers nor bearers of layman Anglophone culture, they know not what they do; and they never saw the issue with this manner of unnatural writing before the widespread hatred for "AI". And Konrad, well, Konrad is a dramatic Internet personality, he writes to persuade and to show off, he is another source of this pattern rot.
That said, to an extent it's just good, product-grade writing. Less abrasive than most human work on contentious topics (imagine the hissy fit Claude would throw over an offhand appeal to "South Asians" here), well-proportioned, avoiding too-rare words and concepts that readers might stumble upon, and almost too perfect, devoid of glaring ESLisms or identifying personal blemishes.
More options
Context Copy link