This is an excellent comment, and I largely agree with your taxonomy and framing. In particular, I think you’re exactly right that reference-class forecasting shines most when you have (a) stable baselines and (b) a well-posed question to begin with. Your distinction between known unknowns and unknown unknowns maps very cleanly onto where forecasting techniques feel powerful versus where they feel brittle in practice.
Your intelligence-analysis perspective also rings true to me. Using the outside view as a stabilizer against excited inside-view narratives is, in my experience, one of the highest-leverage applications of forecasting. In most real-world settings, the dominant failure mode isn’t underreaction but overreaction to new, salient information, and reference classes are a very effective corrective.
Where I’d push back slightly—and I mean this as a nuance rather than a rejection—is on COVID as an example of a true black swan in the Taleb sense.
I agree completely with your café-owner framing: for many individuals, COVID was effectively unaskable ex ante, and therefore indistinguishable from an unknown unknown. At the decision-maker level, it absolutely behaved like a black swan. That’s an important and underappreciated point.
However, at the system level, I’m less convinced it was unforeseeable. A number of people did, in fact, raise the specific risk in advance:
Bill Gates publicly warned in 2015 that global pandemic preparedness was dangerously inadequate and that a fast-moving virus was a more realistic threat than many conventional disaster scenarios.
The Wuhan Institute of Virology had been criticized multiple times prior to 2020 for operating at biosafety levels below what many thought appropriate for the research being conducted.
More broadly, pandemic risk had a nontrivial base rate in the epidemiology and biosecurity literature, even if the exact trigger and timing were unknown.
On a more personal note (and not meant as special pleading), I discussed viral and memetic contagion risks repeatedly in The Dark Arts of Rationality: Updated for the Digital Age, which was printed several months before COVID.
All of which is to say: COVID may not have been a black swan so much as a gray rhino—a high-impact risk that was visible to some, articulated by a few, but ignored by most institutions and individuals because it didn’t map cleanly onto their local decision models.
I think this distinction matters for forecasting as a discipline. It suggests that one of the core failures isn’t predictive ability per se, but attention allocation: which warnings get surfaced, amplified, and translated into actionable questions for the people whose decisions hinge on them. In that sense, I think you’re exactly right that Tetlock’s next frontier—teaching people how to ask better questions—is the crux.
So I’d summarize my position as: Forecasting works best in domains with history and well-posed questions, struggles at the edges, and fails catastrophically when important questions never get asked. But some events we label “unpredictable” may actually be predictable but institutionally invisible—which is a slightly different (and potentially more tractable) failure mode.
Curious whether that distinction resonates with your experience in intelligence work, or if you think I’m still underestimating the true weight of the unknown-unknown problem.
Facts
- Prev
- Next

This is a very thoughtful comment—thank you for taking the time to lay it out so clearly. Also, thanks for the reading recommendations; I’m familiar with Psychology of Intelligence Analysis, but I haven’t read all three you listed, and I appreciate the pointers. The intelligence-community framing is very much adjacent to how I think about this problem.
Let me try to respond to both the theoretical and practical questions in turn.
Theoretical question: what assumptions are superforecasters actually making?
I think your concern is a real one, and I don’t think there’s a fully satisfying, formally rigorous answer yet.
You’re right that most forecasting implicitly assumes something like: there exists a stable-enough probability distribution over futures that can be approximated and scored. And you’re also right that if the underlying distribution is heavy-tailed, discontinuous, or adversarial in the wrong ways, then many common scoring and evaluation methods can look “good” right up until they catastrophically fail. Finance is full of examples of exactly this dynamic.
Two clarifications about my own claims:
I did not use leverage. The 40% average annual return I mentioned was achieved without leverage. I agree completely that high apparent performance with hidden ruin risk is trivial to generate, and I’m very wary of arguments that don’t control for that.
I don’t have a clean statistical confidence interval for my forecasting ability. I wish I did. What I can say—without pretending it’s a theorem—is that when I pitched this approach to VCs last year, several were interested in investing on the order of ~$2M. That’s not proof of correctness, but it does suggest that sophisticated actors found the combination of reasoning and track record at least plausible. (For the record, I embarrassed myself by not having the proper licenses lined up before pitching a hedge fund idea, which is a lesson I learned the hard way.)
More broadly, I think the honest answer is that superforecasting rests on a weak ontological assumption rather than a strong one: not that the world is well-behaved, but that some environments are predictable enough, often enough, to beat naive baselines. The goal isn’t asymptotic optimality; it’s persistent edge.
Where I personally diverge from the “pure scoring-rule” framing is that I don’t think of forecasting as approximating a single global distribution. Instead, I think of it as model selection under uncertainty, where the models themselves are provisional and frequently discarded. That doesn’t fully resolve the Cauchy-vs-Gaussian problem you raise—but it does mean I’m less committed to any single assumed distribution than the formalism might suggest.
Practical question: forecasting in a narrow, expert domain
Your North Korea example is excellent, and I agree with your diagnosis of the problem. If all you ask are first-order, low-entropy questions (“Will war break out this year?”), you get almost no learning signal, even if your answers are technically correct.
This is where my approach probably diverges from how most superforecasters would describe their own methods, and I want to be clear that I’m not claiming this is canonical.
Very roughly, my technique is to lean heavily on macro-level regularities and treat individuals as if they were particles—subject to incentives, constraints, and flows—rather than as unique narrative agents. At that level of abstraction, societies start to behave less like chess games and more like fluid systems. You can’t predict the motion of a single molecule, but you can often predict pressure gradients, bottlenecks, and phase transitions.
Applied to your case, that suggests focusing less on isolated facts (rice prices, phones) and more on questions that proxy for stress, throughput, and constraint relaxation. The exact phrasing matters less than whether the question sits on a causal pathway that connects to higher-level outcomes you care about.
You’re also right that the skill of asking good questions is the real bottleneck. My (imperfect) heuristic is to ask:
Does this variable aggregate many micro-decisions?
Is it constrained by hard resources or incentives?
Would a large deviation here force updates elsewhere?
Those questions won’t necessarily predict war directly—but they can tell you when the system is moving into a regime where war becomes more or less likely.
Finally, I agree with you that the intelligence community is one of the few places where calibration is actually rewarded rather than punished. In many ways, I think superforecasting is a partial rediscovery—by civilians—of techniques analysts have been developing for decades, albeit with better scoring and feedback loops.
I don’t think your concerns undermine forecasting as a practice. I think they correctly point out that it’s a tool with sharp edges, and that the hardest problems aren’t about probability math but about question selection, regime change, and institutional attention.
If you’re open to it, I’d actually be very interested in how you decide which NK-related variables are worth tracking at all—that feels like exactly the frontier Tetlock is gesturing at.
More options
Context Copy link