site banner

Attention is Bayesian Inference

medium.com

I found this insightful, particularly bringing back LLMs into the Yudkowski / Sequences fold, whereas many have claimed the rise of LLMs has shown decades of Yudkowskian AI speculation to be way off base. I don't have enough technical knowledge to evaluate the accuracy of this post, but I am hopeful that large parts of it are true.

The brute-force training process naturally sculpts Transformers into inference engines. They don’t just approximate the math; they build a physical geometry — orthogonal hypothesis frames and entropy-ordered manifolds — that implements Bayesian updating as a mechanical process.

They aren’t Bayesian by design; they are Bayesian by geometry.

To the extent the article has merit, it does seem to explain why CoT and Reasoning models are able to "outperform". The 20 questions model, where we are not merely bisecting the information space, but looking to maximize rejection or filtering, offers a lot of insight into the nature of the problem. When a fixed number of layers gets exhausted, is this where normal models hallucinate? With CoT or reasoning, we can feed the smaller space back into the first layer, and continue filtering down.

1
Jump in the discussion.

No email address required.

One of the fundamental results of machine learning theory states (very informally) that every learning algorithm has both a bayesian and a frequentist interpretation.

Is there any specific theorem you're referring to here?

In any case, I also get the sense that equally it seems like everything in machine learning has a third information theoretic interpretation. For instance the correspondence between variational bayes and coding theory with the bits-back argument (link). I sometimes wish I had a personality better suited to working as a researcher (and perhaps a few more IQ points) and to really have pulled on all of these threads. Not to make it sound more mystical than it really is, but I've always had a gut feeling that there are some powerful, unifying ideas beneath this weird amalgamation of concepts from statistics, information theory, statistical mechanics and computer science we're seeing in the development of learning theory.

Here's an example of the correspondence between Bayesian/frequentist interpretations for linear regression: https://stats.stackexchange.com/questions/283238/is-there-a-bayesian-interpretation-of-linear-regression-with-simultaneous-l1-and.