site banner

The Many Ways that Digital Minds Can Know

moultano.wordpress.com

Like many people I've been arguing about the nature of LLMs a lot over the last few years. There is a particular set of arguments that I found myself having to recreate from scratch over and over again in different contexts, so finally put it together in a larger post, and this is that post.

The crux of it is that I think both the maximalist and minimalist claims about what LLMs can do/are doing are simultaneously true, and not in conflict with one another. A mind made out of text can vary along two axes, the quantity of text it has absorbed, which here I call "coverage," and the degree to which that text has been unified into a coherent model, which here I call "integration." As extreme points on that spectrum, a search engine is high coverage, low integration, and an individual person is low coverage, high integration, and LLMs are intermediate between the two. And most importantly, every point on that spectrum is useful for different kinds of tasks.

I'm hoping this will be a more useful way of thinking about LLMs than the ways people have typically talked about them so far.

8
Jump in the discussion.

No email address required.

I think one more aspect is reflectivity: the degree to which a system integrates knowledge of its own operation into its schema. For instance, a search engine that can show a "Google is down" page, or that lists the number of results, or that finds Google help pages on search as search results, has (basic) reflectivity. It seems plausible to me that a lack of reflectivity is a big part of what's holding LLMs back and causing hallucinations and the like: they may be confident or uncertain, but they cannot condition on their confidence.

Seems to me that this is partially an artifact of RLHF. From the GPT-4 whitepaper, it was evident that the base GPT-4 model was far better calibrated in its reasoning, if you asked it how confident it was and it said it had a 80% chance of being right, in practice it was indeed right near 80% of the time.

On the other hand, the calibration curves for the model beaten into submission with RLHF were absolutely wack, with a tendency to round a broad range of confidence levels steeply up and down. It would act like it was absolutely certain when it was only right 70% of the time, and claim to be unable to answer when it actually had, say, a 30% chance of giving a correct response.

I've heard this explained as strong preference from human raters for complete certainty, even when that's too much to ask. They'd rather the model be confidently incorrect than hedge and try to inject nuance into its outputs.

The thing I don't understand is how you can possibly train for uncertainty.

The model needs to "learn the feeling of not being sure". But whether it's sure or not always depends on its state of knowledge at the time, and that state of knowledge will never be represented in its training set. Additionally and relatedly, you cannot train a LLM to "notice when it's saying something wrong" without indirectly training it to say something wrong, then say it notices.

You would have to inspect the network and somehow determine when it is objectively uncertain, and to what degree, and then synthesize a training task based on that actual uncertainty. That level of interpretability is pretty beyond us at the moment.

There are many possible ways to deal with uncertainty, this is widely recognized as an important goal.

and more.

In principle, I think it's not a big scientific challenge because we can elicit latent knowledge and so probe the model's "internal belief" regarding its output; this can be used as a signal during training. For now this is approached more crudely, just to improve average truthfulness (already cited by @faul_sname):

begin by operationalizing what it means for a network to “know” the right answer to a question, even if it doesn’t produce that answer. We define this as the difference between generation accuracy (measured by a model’s output) and probe accuracy (selecting an answer using a classifier with a model’s intermediate activations as input). Using the LLaMa 7B model, applied to the TruthfulQA benchmark from Lin et al. (2021)–a difficult, adversarially designed test for truthful behavior–we observe a large 40% difference between probe accuracy and generation accuracy. This statistic points to a major gap between what information is present at intermediate layers and what appears in the output.

I also expect a lot from TART-derived approaches:

Tart comprises of two components: a generic task-agnostic reasoning module, and embeddings from the base LLM. The reasoning module is trained using only synthetic data (Gaussian logistic regression problems), agnostic of the auto-regressively trained language model, with the objective of learning to perform probabilistic inference (Section 4.1). This learned transformer module is then composed with the base LLM, without any training, by simply aggregating the output embedding and using those as an input along with the class label (Section 4.2). Together, these components make Tart task-agnostic, boost performance quality by improving reasoning, and make the approach data-scalable by aggregating input embeddings into a single vector.

Additionally and relatedly, you cannot train a LLM to "notice when it's saying something wrong" without indirectly training it to say something wrong, then say it notices.

It should be possible to train "notice when something in its context window is wrong and say that the thing is wrong" and also "notice when something in its context window is something said by the assistant persona it is being trained to write as", and I don't think either of those objectives would incentivize "say wrong things while writing in the assistant persona".

That said, if you are specifically referring to the behavior of "accurately indicate your confidence level in the thing you are about to say, and then say the thing" that does seem like a much more difficult behavior to train (still possible, since LLMs have a nonzero ability to plan ahead, but finicky and easy to screw up). But if it's fine for the evaluation-of-confidence step to come after the reasoning step, the task is much easier (and in fact that's what the chain-of-thought prompting technique aims to do).

Also, if you're interested in the interpretability side of things specifically, you might find Inference-Time Intervention: Eliciting Truthful Answers from a Language Model interesting:

To close this gap, we introduce a technique we call Inference-Time Intervention (ITI). At a high level, we first identify a sparse set of attention heads with high linear probing accuracy for truthfulness. Then, during inference, we shift activations along these truth-correlated directions. We repeat the same intervention autoregressively until the whole answer is generated. ITI results in a significant performance increase on the TruthfulQA benchmark. We also see a smaller but nonzero performance improvement on two benchmarks with different data distributions.

The level of interpretability you want is currently beyond us, but I expect that over time that situation will improve quite a lot (I think well under a thousand person-years have been spent on this particular type of interpretability research so far, and even that estimate might be an order of magnitude or two high).

I don't think anyone trained for uncertainty as such, it seemed that a sense of internal calibration was an emergent phenomenon in the base LLM, which was mauled by RLHF.

So as long as you don't do the latter, training for the above simply involves training as usual.

Right, I guess I'm saying if you wanted to train a specific response to a level of uncertainty, it would be difficult to construct the training samples.

Evidently, the model has figured out that something should be hooked up to its uncertainty. But I have no clue how you'd make that happen intentionally.