site banner

Culture War Roundup for the week of March 31, 2025

This weekly roundup thread is intended for all culture war posts. 'Culture war' is vaguely defined, but it basically means controversial issues that fall along set tribal lines. Arguments over culture war issues generate a lot of heat and little light, and few deeply entrenched people ever change their minds. This thread is for voicing opinions and analyzing the state of the discussion while trying to optimize for light over heat.

Optimistically, we think that engaging with people you disagree with is worth your time, and so is being nice! Pessimistically, there are many dynamics that can lead discussions on Culture War topics to become unproductive. There's a human tendency to divide along tribal lines, praising your ingroup and vilifying your outgroup - and if you think you find it easy to criticize your ingroup, then it may be that your outgroup is not who you think it is. Extremists with opposing positions can feed off each other, highlighting each other's worst points to justify their own angry rhetoric, which becomes in turn a new example of bad behavior for the other side to highlight.

We would like to avoid these negative dynamics. Accordingly, we ask that you do not use this thread for waging the Culture War. Examples of waging the Culture War:

  • Shaming.

  • Attempting to 'build consensus' or enforce ideological conformity.

  • Making sweeping generalizations to vilify a group you dislike.

  • Recruiting for a cause.

  • Posting links that could be summarized as 'Boo outgroup!' Basically, if your content is 'Can you believe what Those People did this week?' then you should either refrain from posting, or do some very patient work to contextualize and/or steel-man the relevant viewpoint.

In general, you should argue to understand, not to win. This thread is not territory to be claimed by one group or another; indeed, the aim is to have many different viewpoints represented here. Thus, we also ask that you follow some guidelines:

  • Speak plainly. Avoid sarcasm and mockery. When disagreeing with someone, state your objections explicitly.

  • Be as precise and charitable as you can. Don't paraphrase unflatteringly.

  • Don't imply that someone said something they did not say, even if you think it follows from what they said.

  • Write like everyone is reading and you want them to be included in the discussion.

On an ad hoc basis, the mods will try to compile a list of the best posts/comments from the previous week, posted in Quality Contribution threads and archived at /r/TheThread. You may nominate a comment for this list by clicking on 'report' at the bottom of the post and typing 'Actually a quality contribution' as the report reason.

3
Jump in the discussion.

No email address required.

Proof or Bluff? Evaluating LLMs on 2025 USA Math Olympiad

Abstract: Recent math benchmarks for large language models (LLMs) such as MathArena indicate that state-of-the-art reasoning models achieve impressive performance on mathematical competitions like AIME, with the leading model, o3-mini, achieving scores comparable to top human competitors. However, these benchmarks evaluate models solely based on final numerical answers, neglecting rigorous reasoning and proof generation which are essential for real-world mathematical tasks. To address this, we introduce the first comprehensive evaluation of full-solution reasoning for challenging mathematical problems. Using expert human annotators, we evaluated several state-of-the-art reasoning models on the six problems from the 2025 USAMO within hours of their release. Our results reveal that all tested models struggled significantly, achieving less than 5% on average. Through detailed analysis of reasoning traces, we identify the most common failure modes and find several unwanted artifacts arising from the optimization strategies employed during model training. Overall, our results suggest that current LLMs are inadequate for rigorous mathematical reasoning tasks, highlighting the need for substantial improvements in reasoning and proof generation capabilities.

Background: The 'official' American competitive high school math circuit has several levels, progressing from AMC 10/12 (25 question, multiple choice, 75 minutes total) to AIME (15 questions, 3 hours, answers are in the form of positive 3 digit integers) to USAMO (2 days, 6 proof-based questions total, 3 questions with 4.5 hours per day), with difficulty increasing commensurate with the decrease in # of questions. While most AIME questions can be ground out using a standard set of high school/introductory college level math knowledge and tricks, the USAMO requires more depth of understanding and specialized techniques. For example, problem 1 (theoretically, the easiest) is as follows:

Let k and d be positive integers. Prove that there exists a positive integer N such that for every odd integer n > N , the digits in the base-2n representation of n^k are all greater than d.

This problem can be solved fairly simply using induction on k.

I've also noticed this when plugging grad-level QM questions into Gemini/ChatGPT. No matter how many times I tell it that it's wrong, it will repeatedly apologize and make the same mistake, usually copied from some online textbook or solution set without being able to adapt the previous solution to the new context.

I asked Gemini 2.5 Pro Thinking to solve it. It claimed to have a solution. I asked for the most concise summary it could provide'

Okay, here's a concise summary of the proof, avoiding technical jargon:

The Problem: Imagine you have a way of writing numbers that isn't based on powers of 10 (like our usual 1s, 10s, 100s), but on powers of $2n$, where $n$ is some odd number (like powers of 6, if $n=3$, or powers of 10, if $n=5$). We look at the number $n^k$ ($n$ multiplied by itself $k$ times). We want to show that if you pick $n$ to be a really large odd number, then when you write $n^k$ in this "base $2n$" system, all the "digits" you use will be bigger than some target value $d$.

How the Proof Works:

  1. Figuring out the Digits: The proof first finds a way to calculate the digits used to write $n^k$ in the base $2n$ system. Let's call the digits $c_0, c_1, c_2, \dots$.
  1. The Last Digit ($c_0$): It turns out the very last digit ($c_0$, the "ones" place in this system) is simply $n$ itself. So, to make this digit bigger than $d$, we just need $n$ to be bigger than $d$.
  1. The Other Digits: The proof finds a formula for the other digits ($c_1, c_2,$ etc.). It shows that each digit $c_i$ is calculated by taking $n$, multiplying it by some specific odd number (which depends on $k$ and which digit it is), dividing by a power of 2 (like 2, 4, 8, ...), and then rounding down.
  1. Making Digits Large: Since the odd number multiplier is at least 1, each digit $c_i$ is at least roughly $n$ divided by some power of 2 (like $n/1, n/2, n/4, \dots, n/2^{k-1}$).
  1. The Threshold: If we want all digits to be bigger than $d$, we need even the smallest of these estimates (which involves dividing $n$ by the largest power of 2, $2^{k-1}$) to be bigger than $d$. The proof calculates that if $n$ is larger than a specific value $N$ (which depends on $k$ and $d$, specifically $N = 2^{k-1}(d+1)$), then every digit is guaranteed to be greater than $d$.
  1. Conclusion: Therefore, such a threshold $N$ exists. Pick any odd number $n$ larger than this $N$, and all the digits of $n^k$ in base $2n$ will be larger than $d$.

Was it Induction?

The main argument of the proof is not primarily by induction. It's a direct proof that calculates the digits and then finds a condition on $n$ (making it large enough) to satisfy the requirement.

However, induction was used for one small intermediate step: proving that $n^k$ always leaves a remainder of $n$ when divided by $2n$ (which showed that the last digit $c_0$ is indeed $n$). But the overall structure connecting the size of $n$ to the size of all digits didn't rely on induction.

Here's the raw answer (minus reasoning trace):

https://rentry.org/5s6q6nxe

Hmm, it's maybe coming close to something that works, but seems to fuck up at the important junctures. After a couple of paragraphs where it doesn't find anything useful, in paragraph 7 it concludes that we can break down n^(k - i - 1) into q*2^(i+1) + r_i, where r_i = n^(k - i - 1) mod 2^(i+1). But then later it declares that r_i = n^(k - i - 1) and the proof follows from there. Unfortunately I don't think this would get any points, although maybe it could figure something out if you keep telling it where it fucks up.

/images/17435421010140572.webp

I copied your comment, and it insisted it was correct. I then shared the image, and it seems to think that the issue is imprecise terminology on its part rather than an actual error.

Here's the initial response:

https://rentry.org/yzvh9n47

After putting the image in:

https://rentry.org/c6nrs385

The important bit:

The proof never claims $r_i = n^{k-i-1}$. It uses $r_i = n^{k-i-1} \pmod{2^{i+1}}$ and the derived property $r_i \ge 1$.

Conclusion: The confusion likely arises from either the slightly ambiguous notation in the highlighted sentence (which should explicitly state "fractional part of ... is ...") or a misreading of the later step where the lower bound $r_i \ge 1$ is applied. The mathematical logic itself appears sound.

I kinda respect it doubling down, but it's scrambling to cover its ass. Also, I noticed it forgot the "mod 2n" part of c_i, which also throws a wrench into things.

Ah... I get it now. Thank you! I'm disappointed to see hallucination and confabulation here, but it you're inclined, do keep trying out Gemini 2.5 Pro Thinking in particular. It's a good model.