@ControlsFreak's banner p

ControlsFreak


				

				

				
5 followers   follows 0 users  
joined 2022 October 02 23:23:48 UTC

				

User ID: 1422

ControlsFreak


				
				
				

				
5 followers   follows 0 users   joined 2022 October 02 23:23:48 UTC

					

No bio...


					

User ID: 1422

You know what? I don't think he is engaging with the article. The article specifically mentions GPT 5.2 Pro seven times, two of which seem, to my read, to imply that that's what he's using. There is one moment where he just says "GPT 5 Pro". Perhaps he just happened to leave off the ".X" in this one spot. Perhaps I'm reading the other seven mentions of GPT 5.2 Pro wrong, and the dirty secret is that he's using 5.0. I suppose he doesn't say in big bold highlighted words, "I'm definitely using 5.2 and not 5.0," so sure, maybe one could say that it would be nice to have a clear statement.

...but to come in, with one sketchy textual inference, and just boldly declare that the only way anyone could possibly be reporting the experience they're reporting is obviously just because they're using a six month old model, and that obviously it's now totally fixed... it's the same SMH annoyance at someone being annoying and arrogant.

In fairness, perhaps he only read my comment and not the article (thus, not engaging with the article), and in fairness, I did blockquote the one spot where he seemed to have left off the ".X". But yeah, "I didn't RTFA, but I'm going to boldly declare that I've diagnosed exactly what's going on, using the same tired objection," is pretty cold comfort.

The article discusses Erdos problems and Aletheia's performance on "First Proof".

Why is there always someone who blows up with such attitude, yet appearing to not really engage with anything?

But you will also notice the absense of issues you are facing.

Let's turn it around. What version mathematician are we dealing with here? What's your h-index? Have you used any particular LLMs, regardless of particular model/scaffold to solve components of your own publishable mathematics research? Can you personally attest to not encountering any issues like this? I just don't understand this insistence of not looking at the frontier, yet insisting where it is.

Math Prof Daniel Litt talks about LLMs and math proofs

It seems to me to be a balanced take. He's bullish and hopeful on the future, while trying to be accurate/realistic about current capabilities, while remaining somewhat concerned about possible problems. For example on the bullish/hopeful side:

I think I have been underrating the pace of model improvements. In March 2025 I made a bet with Tamay Besiroglu, cofounder of RL environment company Mechanize, that AI tools would not be able to autonomously produce papers I judge to be at a level comparable to that of the best few papers published in 2025, at comparable cost to human experts, by 2030. I gave him 3:1 odds at the time; I now expect to lose this bet.

For discussion the current state, he focuses on "First Proof", which is a set of ten lemmas from current researchers' unpublished papers. He discusses the performance of different groups, different models, different scaffolding. There are positive and negative notes. One personal example section from his own endeavors:

One of the ways I like to test the models is to give them a hard problem, and then see how long it takes me to cajole/guide/bully them into giving me a correct solution. For a lemma from one of my papers, it is typically quite difficult or impossible to get a complete proof without any hints. In one case I devoted, as an experiment, 8 hours (admittedly some of which I spent away from the keyboard in frustration) trying to get GPT 5 Pro to produce a relatively simple counterexample to some statement without hints. The models do much better if one gives them a hint. Frontier models can often execute arguments I would consider "routine" if one explains the general idea in a sentence or two. It's easy to take this as evidence for usefulness, but against automatability. This is wrong. Instead of saying, *it takes 8 hours of human labor, or giving away the main idea(, we should say all it takes is 8 hours of labor or the one-sentence main idea.

My sense is that he's doing this with problems where he knows the solution (to some level; I could probably write a whole post on the different levels of "knowing" a solution for a piece of mathematics). There is great promise here, but also a note of concern. To state that concern somewhat more concisely, he writes:

In the near term, we're in trouble. Models are able to produce both correct, interesting mathematics, as well as incorrect mathematics that is exceedingly labor-intensive to detect. Academic mathematics is simply not prepared to handle this.

This again seems reasonable to me, given my own experiences. Yes yes, I haven't used every model and every scaffold (some of the systems he discusses are not publicly available at any price). When I've known the solution, I can probably get it there. When I've not known the solution, I have to say that at best, it's been good at helping me find other results in the literature that might be helpful. It is, indeed, labor-intensive and quite frustrating to have to carefully pore over every detail, trying to see if it went astray when generating a mountain of text. Then, when you find something wrong, maybe not even having verified the rest of it, it'll happily produce another mountain of text, and it feels like you're starting from square one. When you're already confident that you know a method will work, then it's mostly just a test of will to see if you can get it to figure it out. When you don't know, the question of whether you potentially waste mountains of time on what may be a dead end or just proceed on your own becomes far more difficult, and you have to make that decision repeatedly along the way.

I hate to bring this up, but it's also quite frustrating that when I say things like this, the most common response is that it's a "skill issue" or that I'm just not paying the right quantity of dollars for so-and-so's preferred model. So, maybe this testimony will help allay some of those concerns.

And yeah, Sagan help us when it comes to reviewing the mountain of papers we're going to get submitted to journals/conferences that are more LLM than human in the meantime.

He ends very hopeful:

Let us take this to an absurd extreme. Suppose we had a library filled with proofs of every theorem of ZFC, as well as excellent guides that could, given a question, take us to the answer and explain it. What would a mathematician do in such a library?

If you ask the question this way, the answer becomes clear: they would be unbelievably excited, and immediately get to work. They would immediately start asking questions: how does one prove the Riemann hypothesis? The Hodge conjecture? Their own pet obsession (in my case, the Grothendieck-Katz p-curvature conjecture)? Then they would work until they understood the answer. The job would not be done, not even close.

I do not mean to suggest, even, that humans necessarily have an intrinsic edge in asking mathematical questions that are interesting to humans; that is certainly the case now (and I suspect it will be for some time), but I see no principled reason it should be true. I just mean that this is why we got into mathematics: we want to understand. That's the goal.

Totally agreed. And something like LLMs with automated theorem provers seem incredibly well-suited to potentially get us toward something like this. It seemed natural that they'd be great at translating between humans and machines in terms of code, and we've seen great strides there. It seems natural here, too. We're not there yet, but there's hope.

That's fair enough as a concern for this case, but I would say that it is a different argument from the way that 'the lesser power is included in the greater' is typically invoked. His formulation shows that, perhaps, more granularity is not even a 'lesser' power. It's still not completely conceptually clear to me, but there's something to be said for a more careful analysis.

What you seem to be saying is that, even if one supposes that the granular tariff power is, in some sense, a 'greater' power than shutting off trade entirely, there is still a sort of equivalent 'greater' power in quotas. Again, this is plausible, and I'd want more conceptual exploration of how law should treat cases where there seem to be roughly equivalent, but (I don't know what to call it) "different track" powers.

Alex Tabarrok just made this point as well, and he has useful analogies to illustrate it. I'm not 100% it's entirely correct, but it's definitely plausible and a point that is surely to be bouncing around in my mind for a while.

I don't think there is a sufficient quantity of drugs in the entire supply chain to explain.

It would turn off the purists. It's an indictment of our society that we haven't developed technology that allows the TV viewer to select whether they want a neon green simulpuck or not on their own TV. This is truly the most important technological challenge of our time.

It's the Star Wars meme.

Now that the Supreme Court got rid of the tariffs, the prices are going to go down!

...the prices are going to go down, right?

Conservatives know this deep down, but they don't want to admit it because it conflicts with the First Principle.

No, it really doesn't. At best, you've just found that some people aren't good at applying the First Principle. That doesn't mean the First Principle is wrong.

EDIT: In fact, I'd say that it's likely that you're committing the New Atheist error in thinking that if morality is a thing, it must obviously be an obvious thing that any decent (seemingly-similarly-inclined) person can easily just intuit. And thus, when one sees some number of one's co-(anti-)religionists go off the deep end, one concludes that morality didn't real in the first place.

Instead, it's actually somewhat difficult to cultivate and propagate. It doesn't help that the wickedness of man is great on the earth.