site banner

Small-Scale Question Sunday for June 15, 2025

Do you have a dumb question that you're kind of embarrassed to ask in the main thread? Is there something you're just not sure about?

This is your opportunity to ask questions. No question too simple or too silly.

Culture war topics are accepted, and proposals for a better intro post are appreciated.

1
Jump in the discussion.

No email address required.

How do you best verify Large language model output?

I hear lots of people say they use LLM's to search through documents or to get ideas for how something works, but my question is how do people verify the output? Is it as simple as copy-pasting keywords onto google to get the actual science textbooks? Or is there some better set of steps to take that I miss. I also wonder how you do that for looking through a document, is there some sort of method for getting the LLM to output page citations so you check those (maybe it's in settings or something)

they use LLM's to search through documents

I ask it to include page #s or text snippets so I can CTRL-F and confirm they exist (sometimes they don't!)

get ideas for how something works

This is more situational. A lot of the time I am trying to re-remember something I already knew, so I know if the answer is wrong or right once I read it and my buried memory of the thing resurfaces. Where I can't fact check internally, usually the LLM has given you enough info you can quickly hop onto google/youtube and corroborate the thing with a non LLM source.

Take the output from one LLM and feed it to a different LLM from a different company for verification. Not perfect but works more often than it should do.

Verifying the output depends on the use. Code gives you a pretty easy time verifying. Searching docs depends: if you’re trying to find some info in the docs, this can be sufficient to get keywords and navigate to the section you want.

Lots of current LLMs are pretty good at copying text out of prompts when told to, e.g. page numbers. That can help a lot, since verifying is very quick.

Hallucinations and other errors are still very common and you must account for them.

I just post LLM findings to social media and then delete the post if anyone fact checks it /s

An easy trick is to get another model to review/critique it. If both models disagree, get them to debate each other till a consensus is reached.

These days, outright hallucinations are quite rare, but it's still worth doing due diligence for anything mission-critical.

As George suggested, you can also ask for verbatim quotes or citations, though you'll need to manually check them.

These days, outright hallucinations are quite rare, but it's still worth doing due diligence for anything mission-critical.

I asked chatgpt what the 4 core exercises of the LIFTMOR routine were and it didn't get a single one correct. It's a simple question to google so I am not sure how it got it so wrong. When I changed the question to specify the LIFTMOR routine to help counteract osteoporosis, it got it right. Google doesn't require the additional context.

Oh man, if you think that's bad, AI studio will drive you mad. I was asking it what it could do (using 2.5 pro always of course since it's free) and we got onto its TTS abilities. I checked the list of dozens of star names like achernar and fenrir and asked it why none of them mention which ones are male or female and it rattled off a wall of text about how Google wanted to promote inclusivity, avoid gender stereotyping and focus on function over form.

After it refused to reply to my 'lol fuck you' I developed my argument into "actually people have been able to readily distinguish male voices from female for thousands of years, so it doesn't matter what lofty goals Google has, what they have done is reduce function due entirely to form." after some more sparring it admitted that the function of a Google TTS bot is to optimise its immediate task, not shape future behaviour and it agreed to tell me which names were which gender.

Victory? No, not even close. All of its voices were named after their accent followed by a person's name - not star names. It apologised and explained how to find the voices named the way it said they were. Incorrectly. Those names did not exist, nor did the pulldown menu it told me to use. I explained that, and it apologised again and explained google had rolled out the new chirp3 system and so the actual names were Vega, Sirius, maia and so on. Incorrect again. None of those names were available to me. By now it was beating itself up pretty hard, and the conclusion it came to was that Google were a/b testing and it asked if I could tell it some of the names I saw in the list so it could piece together our disconnect. I mentioned achernar and fenrir and gacrux and acherd and it finally managed to give me a list of voices that sounded male and voices that sounded female. One of them was clearly an effiminate man, but the rest were spot on.

It was a lot of effort for very little reward, but I was just fucking around with AI studio anyway, and I found the entire thing much more interesting than frustrating. This was the best version of Google's ai looking at another part of itself and whiffing so completely I was beginning to feel sorry for it. And yet the tech is still so much better than it was a year ago that I can't help but be optimistic about it. I use ai instead of search now pretty much every day and I have only been blindsided by a hallucination once so far. Search is still better for... things you already know the answer to, I agree. No, just kidding, search is better for simple stuff like that for sure, the big benefit of ai imo is it collates all the information you would usually have to browse multiple sources for into one place - then you check the sources and one might be nonsense but the others are usually good.

I asked it right now, and it got everything right. How long ago was this?

I even used the basic bitch 4o model. There are better ones that I have access to as a paid user.

https://github.com/vectara/hallucination-leaderboard

The current SOTA LLMs hallucinate as little as 0.8% of the time for well grounded tasks, text summarization in this particular case. Of course, the rate can vary for other tasks, and the results worsen when getting into obscure topics.

I asked it last week. My husband has a highly tuned llm (granted, he buys access) and we have an ongoing friendly argument about how useful (him) or useless (me) they are. So whenever it comes up I ask chatgpt some dead simple question to see if it gets in the right ballpark. In this case (and often) it didn't - it gave me bodyweight stuff like deadbug. Don't get me wrong, deadbug is useful! But the whole point of LIFTMOR is that us oldsters need to be lifting heavy (safely) to increase bone strength. Stretching and bodyweight is helpful but not enough.

I would imagine it depends on the kind of thing you want to verify. In the old days (meaning last year) I would often simply ask, after an answer had been produced: "Really?" and the LLM would double check itself and at times respond with really annoying phrases like "You caught me!" and proceed to explain why what it had just reported to me as accurate was, in fact, inaccurate. Again, it depends on what it's doing for you, and how it's been calibrated by you to do that (though calibration is not perfect. I've long inserted that it should not fabricate or embroider, and at times it still does.)

The easiest thing to do is just ask it. "Can you produce the pages and precise quotes of xyz?" Depending on the response, continue questioning it until you're where you want to be.

Others will very likely be able to suggest a more efficient strategy.

I have similar experiences, but the LLMs will correct their correct answer to be incorrect. I now just view the whole project as useful for creative idea generation, but any claims on the real world need to be fact checked. No lab seems to be able to get these things to stop confabulating, and I'm astonished people trust them as much as they seem to.

Just to round out the space of anecdotes a little more: when I've called out LLMs in the past I've sometimes had them "correct" their incorrect answer to still be incorrect but in a different way.

(has anyone seen an LLM correct their correct answer to be correct but in a different way? that would fill the last cell of the 2x2 possibility space)

They're still very useful in cases where checking an answer for correctness is much easier than coming up with a possible answer to begin with. I love having a search engine where my queries can be vague descriptions and yet still come up with a high rate of reasonable results. You just can't skip the "checking an answer for correctness" step.

Yes this used to be commonplace in my experience. One should always at the very least triangulate results with other sources if the stakes are high.