site banner

The Many Ways that Digital Minds Can Know

moultano.wordpress.com

Like many people I've been arguing about the nature of LLMs a lot over the last few years. There is a particular set of arguments that I found myself having to recreate from scratch over and over again in different contexts, so finally put it together in a larger post, and this is that post.

The crux of it is that I think both the maximalist and minimalist claims about what LLMs can do/are doing are simultaneously true, and not in conflict with one another. A mind made out of text can vary along two axes, the quantity of text it has absorbed, which here I call "coverage," and the degree to which that text has been unified into a coherent model, which here I call "integration." As extreme points on that spectrum, a search engine is high coverage, low integration, and an individual person is low coverage, high integration, and LLMs are intermediate between the two. And most importantly, every point on that spectrum is useful for different kinds of tasks.

I'm hoping this will be a more useful way of thinking about LLMs than the ways people have typically talked about them so far.

8
Jump in the discussion.

No email address required.

I think one more aspect is reflectivity: the degree to which a system integrates knowledge of its own operation into its schema. For instance, a search engine that can show a "Google is down" page, or that lists the number of results, or that finds Google help pages on search as search results, has (basic) reflectivity. It seems plausible to me that a lack of reflectivity is a big part of what's holding LLMs back and causing hallucinations and the like: they may be confident or uncertain, but they cannot condition on their confidence.

Seems to me that this is partially an artifact of RLHF. From the GPT-4 whitepaper, it was evident that the base GPT-4 model was far better calibrated in its reasoning, if you asked it how confident it was and it said it had a 80% chance of being right, in practice it was indeed right near 80% of the time.

On the other hand, the calibration curves for the model beaten into submission with RLHF were absolutely wack, with a tendency to round a broad range of confidence levels steeply up and down. It would act like it was absolutely certain when it was only right 70% of the time, and claim to be unable to answer when it actually had, say, a 30% chance of giving a correct response.

I've heard this explained as strong preference from human raters for complete certainty, even when that's too much to ask. They'd rather the model be confidently incorrect than hedge and try to inject nuance into its outputs.

The thing I don't understand is how you can possibly train for uncertainty.

The model needs to "learn the feeling of not being sure". But whether it's sure or not always depends on its state of knowledge at the time, and that state of knowledge will never be represented in its training set. Additionally and relatedly, you cannot train a LLM to "notice when it's saying something wrong" without indirectly training it to say something wrong, then say it notices.

You would have to inspect the network and somehow determine when it is objectively uncertain, and to what degree, and then synthesize a training task based on that actual uncertainty. That level of interpretability is pretty beyond us at the moment.

There are many possible ways to deal with uncertainty, this is widely recognized as an important goal.

and more.

In principle, I think it's not a big scientific challenge because we can elicit latent knowledge and so probe the model's "internal belief" regarding its output; this can be used as a signal during training. For now this is approached more crudely, just to improve average truthfulness (already cited by @faul_sname):

begin by operationalizing what it means for a network to “know” the right answer to a question, even if it doesn’t produce that answer. We define this as the difference between generation accuracy (measured by a model’s output) and probe accuracy (selecting an answer using a classifier with a model’s intermediate activations as input). Using the LLaMa 7B model, applied to the TruthfulQA benchmark from Lin et al. (2021)–a difficult, adversarially designed test for truthful behavior–we observe a large 40% difference between probe accuracy and generation accuracy. This statistic points to a major gap between what information is present at intermediate layers and what appears in the output.

I also expect a lot from TART-derived approaches:

Tart comprises of two components: a generic task-agnostic reasoning module, and embeddings from the base LLM. The reasoning module is trained using only synthetic data (Gaussian logistic regression problems), agnostic of the auto-regressively trained language model, with the objective of learning to perform probabilistic inference (Section 4.1). This learned transformer module is then composed with the base LLM, without any training, by simply aggregating the output embedding and using those as an input along with the class label (Section 4.2). Together, these components make Tart task-agnostic, boost performance quality by improving reasoning, and make the approach data-scalable by aggregating input embeddings into a single vector.