site banner

Transnational Thursday for November 27, 2025

Transnational Thursday is a thread for people to discuss international news, foreign policy or international relations history. Feel free as well to drop in with coverage of countries you’re interested in, talk about ongoing dynamics like the wars in Israel or Ukraine, or even just whatever you’re reading.

1
Jump in the discussion.

No email address required.

As an aside, this is the biggest source of my AI skepticism. AI will not be able to be useful at scale unless it is truly reliable, which the current state of the art emphatically is not. The problem is not merely that it can fail to complete a task, but that it confidently pretends to have succeeded. In fact the models do not seem to be capable of differentiating on their own between success and pretend-success. This puts a hard limit on what kind of tasks they can perform and at what scale. People like to talk about working with an LLM assistant as like having a fast-working junior employee always at your beck and call (you can offload your tasks but you’ll need to check its work), but for most applications it seems more like having a dodgy outsourcing firm on-call. Not only do you have to check its work, its errors are bizarre and can be deeply hidden, and it will always project total confidence whether the results are perfect or nonexistent.

The lack of progress on this front by any of the major LLM companies makes me think it’s going to take a fairly significant breakthrough to fix, not merely “moar compute,” which makes the aggressive push for AI-everything seem… premature, shall we say. Certainly it does not seem to me that AGI is just around the corner.

In fact the models do not seem to be capable of differentiating on their own between success and pretend-success.

Of course! If there were a way to evaluate the quality of the result, the hyper-smart people earning billions of dollars would think about a thing as trivial as inserting "if the result is of low quality, try doing better" at the end of the AI pipeline. If we, as the end users, see low quality results, it is a hard evidence that their best effort at evaluating the quality of the results are failing. Otherwise they'd build a perfect AI chat and move from billions to trillions.