@birb_cromble's banner p

birb_cromble


				

				

				
0 followers   follows 0 users  
joined 2024 September 01 16:16:53 UTC

				

User ID: 3236

birb_cromble


				
				
				

				
0 followers   follows 0 users   joined 2024 September 01 16:16:53 UTC

					

No bio...


					

User ID: 3236

Are they lying? Was the kernel made up?

Cases like this, and the erdos problems, are exactly where LLMs shine. Problems with clear and unambiguous reward functions that are difficult to hack are perfect use cases. In the Alibaba case, they likely have an extensive set of characterization tests that guarantee consistent behavior. An LLM with a good harness can pound its head against those tests forever while simultaneously measuring the performance as a success metric. It will never get tired and it won't get sick of doing that kind of work.

There's definitely value there, but I don't know how much value. The combination of technical depth and strong guardrails make for a very schizophrenic kind of difficulty. Doing that kind of work is traditionally either the domain of a plucky junior with too much energy, or an insane wizard who claimed a broom closet as his office.

When we've experimented with that kind of optimization work at my employer, it tends to be very expensive, since most of the results come from the absolute tirelessness of the agent. In comparison, how much are you paying your junior? How much are you paying your wizard, and what is he doing if he's not doing that task? Security scans are a similar thing. Line audits aren't hard, but they're hella time consuming. As model costs rise (and they are rising per task completed when you compare any single vendor over time), it might legitimately be cheaper to throw interns at the problem than LLMs.

At least on the software side, I think there's a reasonable chance that what we're seeing is a temporary pop due to a lot of highly verifiable technical debt deadwood finally getting burned out, and that might not be a constant source of demand.

On the war side, I wish I knew more. The sensitive nature of the topic means that all parties are incentivized to obfuscate and dissemble as much as possible. It might legitimately be an ideal case. LLMs do well when you can accept 95% accuracy, and in something like intelligence analysis, 95% accuracy probably has the spooks all but shitting their pants.