@ControlsFreak's banner p

ControlsFreak


				

				

				
5 followers   follows 0 users  
joined 2022 October 02 23:23:48 UTC

				

User ID: 1422

ControlsFreak


				
				
				

				
5 followers   follows 0 users   joined 2022 October 02 23:23:48 UTC

					

No bio...


					

User ID: 1422

Fair enough. Thanks for clarifying.

Do you have any thoughts that you'd be willing to share on what I wrote concerning the amount of knowledge work currently required to be input to do things like the task I was thinking about? I suppose I wasn't entirely clear, but I think it would likely fail to do the analysis task on its own. For clarity, this is a task that I thought, "It might be weird enough that no one's done it yet, but it's close enough to the standard stuff that I could almost certainly give it to a student who did well enough in their flight mechanics course, and they could almost certainly just do it." That seems to have been partly justified in that I found a publication in which a student did just do it (and skimming the paper, the analysis seems about on par with what I had expected; I guess my flaw was thinking the idea was sufficiently 'weird'; I guess it says something about the state of aerospace that someone out there has done almost every basic variant, sort of regardless of whether it makes sense to do). I'm probably <50% on whether it would make the "right" engineering implementation choices on its own. I don't have a precise number. I think it might get lucky, because there's a pretty large set of choices available, and I hadn't yet tailored the problem so that it requires it to really think conceptually about what's going on and only pick from a small subset; there's a good enough chance that it could guess somewhat randomly or pick a popular one that happens to work (though I'm not sure if it'll put the right context around it even if it does).

Perhaps, given your comment below, this is just something that you mostly don't care about. Does this sort of thing just bucket into, "No, it can't do this sort of knowledge work now, but with sufficient recursive self-improvement, it will be able to do it later"? (I guess, in line with your stated AGI timelines?)

I don't think you're quite clear in the post as to what camp you're actually in. Are you a straight bull? As in, do you think it can currently replace a sizeable portion of human knowledge work?

Moreover, it is not clear how knowledge work that is not coding qua coding fits into your schema. For example, I have in mind a flight dynamics simulation/control task. I'm not settled on it yet. My plan was to include a little twist that I had thought would likely not be in the published literature, but which I'm sure I could manage without too much difficulty, just pulling one book off of my shelf, confirming where exactly I need to make the modification and how (it's been a long time, but it's something I'm confident I could do without extreme effort), and then coding it. Unfortunately, I looked, and some darned student already published it (only minimal code published AFAICT, but they wrote out all the analysis in detail, so I can't really purely test its ability to do this aspect of the knowledge work on its own), so I'm trying to think of another good variant.

There are other little twists I had in mind, hoping to prevent it from being able to purely just pull code directly from others. These twists are things I've personally coded in the past, so I know they're doable. But the point is that they require sufficient knowledge to make choices along the way (for one example, choose this algorithm for this part, because I know it has certain characteristics) and I think they prevent it from being able to just use someone else's work for the core simulation components.

I guess, where does this fit within your schema, and where are you with respect to your own opinions? There is a lot of room between, "I personally know how to architect this code, what algorithms/assumptions to use, how to modify the analysis for the instant case, and then I use Claude to help with building the components", "I do the analysis, give it to it, tell it to code up the whole thing, then I go in and tell it to change things to make better choices that fit my knowledge-work-educated beliefs on how it should be done," and, "I tell it to code up the whole thing, maybe tell it that something's broken, but part of the test is whether it made the right analysis and knowledge-work-educated choices on its own along the way."

In other words, what I'm interested in is not so much about what it can do in terms of coding qua coding. It could be utterly magical at that, and that would be great. But how much of my own knowledge work do I need to input to get it to code the "right" thing, versus how much it's able to make the correct choices on its own about what the "right" thing is.

You know what? I don't think he is engaging with the article. The article specifically mentions GPT 5.2 Pro seven times, two of which seem, to my read, to imply that that's what he's using. There is one moment where he just says "GPT 5 Pro". Perhaps he just happened to leave off the ".X" in this one spot. Perhaps I'm reading the other seven mentions of GPT 5.2 Pro wrong, and the dirty secret is that he's using 5.0. I suppose he doesn't say in big bold highlighted words, "I'm definitely using 5.2 and not 5.0," so sure, maybe one could say that it would be nice to have a clear statement.

...but to come in, with one sketchy textual inference, and just boldly declare that the only way anyone could possibly be reporting the experience they're reporting is obviously just because they're using a six month old model, and that obviously it's now totally fixed... it's the same SMH annoyance at someone being annoying and arrogant.

In fairness, perhaps he only read my comment and not the article (thus, not engaging with the article), and in fairness, I did blockquote the one spot where he seemed to have left off the ".X". But yeah, "I didn't RTFA, but I'm going to boldly declare that I've diagnosed exactly what's going on, using the same tired objection," is pretty cold comfort.

The article discusses Erdos problems and Aletheia's performance on "First Proof".

Why is there always someone who blows up with such attitude, yet appearing to not really engage with anything?

But you will also notice the absense of issues you are facing.

Let's turn it around. What version mathematician are we dealing with here? What's your h-index? Have you used any particular LLMs, regardless of particular model/scaffold to solve components of your own publishable mathematics research? Can you personally attest to not encountering any issues like this? I just don't understand this insistence of not looking at the frontier, yet insisting where it is.

Math Prof Daniel Litt talks about LLMs and math proofs

It seems to me to be a balanced take. He's bullish and hopeful on the future, while trying to be accurate/realistic about current capabilities, while remaining somewhat concerned about possible problems. For example on the bullish/hopeful side:

I think I have been underrating the pace of model improvements. In March 2025 I made a bet with Tamay Besiroglu, cofounder of RL environment company Mechanize, that AI tools would not be able to autonomously produce papers I judge to be at a level comparable to that of the best few papers published in 2025, at comparable cost to human experts, by 2030. I gave him 3:1 odds at the time; I now expect to lose this bet.

For discussion the current state, he focuses on "First Proof", which is a set of ten lemmas from current researchers' unpublished papers. He discusses the performance of different groups, different models, different scaffolding. There are positive and negative notes. One personal example section from his own endeavors:

One of the ways I like to test the models is to give them a hard problem, and then see how long it takes me to cajole/guide/bully them into giving me a correct solution. For a lemma from one of my papers, it is typically quite difficult or impossible to get a complete proof without any hints. In one case I devoted, as an experiment, 8 hours (admittedly some of which I spent away from the keyboard in frustration) trying to get GPT 5 Pro to produce a relatively simple counterexample to some statement without hints. The models do much better if one gives them a hint. Frontier models can often execute arguments I would consider "routine" if one explains the general idea in a sentence or two. It's easy to take this as evidence for usefulness, but against automatability. This is wrong. Instead of saying, *it takes 8 hours of human labor, or giving away the main idea(, we should say all it takes is 8 hours of labor or the one-sentence main idea.

My sense is that he's doing this with problems where he knows the solution (to some level; I could probably write a whole post on the different levels of "knowing" a solution for a piece of mathematics). There is great promise here, but also a note of concern. To state that concern somewhat more concisely, he writes:

In the near term, we're in trouble. Models are able to produce both correct, interesting mathematics, as well as incorrect mathematics that is exceedingly labor-intensive to detect. Academic mathematics is simply not prepared to handle this.

This again seems reasonable to me, given my own experiences. Yes yes, I haven't used every model and every scaffold (some of the systems he discusses are not publicly available at any price). When I've known the solution, I can probably get it there. When I've not known the solution, I have to say that at best, it's been good at helping me find other results in the literature that might be helpful. It is, indeed, labor-intensive and quite frustrating to have to carefully pore over every detail, trying to see if it went astray when generating a mountain of text. Then, when you find something wrong, maybe not even having verified the rest of it, it'll happily produce another mountain of text, and it feels like you're starting from square one. When you're already confident that you know a method will work, then it's mostly just a test of will to see if you can get it to figure it out. When you don't know, the question of whether you potentially waste mountains of time on what may be a dead end or just proceed on your own becomes far more difficult, and you have to make that decision repeatedly along the way.

I hate to bring this up, but it's also quite frustrating that when I say things like this, the most common response is that it's a "skill issue" or that I'm just not paying the right quantity of dollars for so-and-so's preferred model. So, maybe this testimony will help allay some of those concerns.

And yeah, Sagan help us when it comes to reviewing the mountain of papers we're going to get submitted to journals/conferences that are more LLM than human in the meantime.

He ends very hopeful:

Let us take this to an absurd extreme. Suppose we had a library filled with proofs of every theorem of ZFC, as well as excellent guides that could, given a question, take us to the answer and explain it. What would a mathematician do in such a library?

If you ask the question this way, the answer becomes clear: they would be unbelievably excited, and immediately get to work. They would immediately start asking questions: how does one prove the Riemann hypothesis? The Hodge conjecture? Their own pet obsession (in my case, the Grothendieck-Katz p-curvature conjecture)? Then they would work until they understood the answer. The job would not be done, not even close.

I do not mean to suggest, even, that humans necessarily have an intrinsic edge in asking mathematical questions that are interesting to humans; that is certainly the case now (and I suspect it will be for some time), but I see no principled reason it should be true. I just mean that this is why we got into mathematics: we want to understand. That's the goal.

Totally agreed. And something like LLMs with automated theorem provers seem incredibly well-suited to potentially get us toward something like this. It seemed natural that they'd be great at translating between humans and machines in terms of code, and we've seen great strides there. It seems natural here, too. We're not there yet, but there's hope.

That's fair enough as a concern for this case, but I would say that it is a different argument from the way that 'the lesser power is included in the greater' is typically invoked. His formulation shows that, perhaps, more granularity is not even a 'lesser' power. It's still not completely conceptually clear to me, but there's something to be said for a more careful analysis.

What you seem to be saying is that, even if one supposes that the granular tariff power is, in some sense, a 'greater' power than shutting off trade entirely, there is still a sort of equivalent 'greater' power in quotas. Again, this is plausible, and I'd want more conceptual exploration of how law should treat cases where there seem to be roughly equivalent, but (I don't know what to call it) "different track" powers.

Alex Tabarrok just made this point as well, and he has useful analogies to illustrate it. I'm not 100% it's entirely correct, but it's definitely plausible and a point that is surely to be bouncing around in my mind for a while.

I don't think there is a sufficient quantity of drugs in the entire supply chain to explain.

It would turn off the purists. It's an indictment of our society that we haven't developed technology that allows the TV viewer to select whether they want a neon green simulpuck or not on their own TV. This is truly the most important technological challenge of our time.

It's the Star Wars meme.

Now that the Supreme Court got rid of the tariffs, the prices are going to go down!

...the prices are going to go down, right?

Conservatives know this deep down, but they don't want to admit it because it conflicts with the First Principle.

No, it really doesn't. At best, you've just found that some people aren't good at applying the First Principle. That doesn't mean the First Principle is wrong.

EDIT: In fact, I'd say that it's likely that you're committing the New Atheist error in thinking that if morality is a thing, it must obviously be an obvious thing that any decent (seemingly-similarly-inclined) person can easily just intuit. And thus, when one sees some number of one's co-(anti-)religionists go off the deep end, one concludes that morality didn't real in the first place.

Instead, it's actually somewhat difficult to cultivate and propagate. It doesn't help that the wickedness of man is great on the earth.