This blog post uses a lot of sleight of hand to inflate the apparent significance of what is ultimately a pretty pissant finding. It may be that these benchmarks (most of which, incidentally, are relatively obscure - hardly justifying a conclusion about "most benchmarks") are hackable, but in practice models are not cheating on them. Anyone can easily independently run whatever Claude or Gemini or Openai model on these problems and verify that they're solving them the hard way.
- Prev
- Next

Programming is an extremely g-loaded activity. Technical interviews at silicon valley tech companies are not far from straight up IQ tests. When I taught programming, I encountered a lot of students who were very diligent and motivated but hit a brick wall because they just didn't have the cognitive equipment to think at the level of abstraction required to reason about non-trivial programs. I think that, prior to the age of LLMs, you would be hard pressed to find a working programmer with a 100 IQ. I doubt the same can be said of transistors.
More options
Context Copy link