site banner

Small-Scale Question Sunday for June 15, 2025

Do you have a dumb question that you're kind of embarrassed to ask in the main thread? Is there something you're just not sure about?

This is your opportunity to ask questions. No question too simple or too silly.

Culture war topics are accepted, and proposals for a better intro post are appreciated.

1
Jump in the discussion.

No email address required.

How do you best verify Large language model output?

I hear lots of people say they use LLM's to search through documents or to get ideas for how something works, but my question is how do people verify the output? Is it as simple as copy-pasting keywords onto google to get the actual science textbooks? Or is there some better set of steps to take that I miss. I also wonder how you do that for looking through a document, is there some sort of method for getting the LLM to output page citations so you check those (maybe it's in settings or something)

An easy trick is to get another model to review/critique it. If both models disagree, get them to debate each other till a consensus is reached.

These days, outright hallucinations are quite rare, but it's still worth doing due diligence for anything mission-critical.

As George suggested, you can also ask for verbatim quotes or citations, though you'll need to manually check them.

These days, outright hallucinations are quite rare, but it's still worth doing due diligence for anything mission-critical.

I asked chatgpt what the 4 core exercises of the LIFTMOR routine were and it didn't get a single one correct. It's a simple question to google so I am not sure how it got it so wrong. When I changed the question to specify the LIFTMOR routine to help counteract osteoporosis, it got it right. Google doesn't require the additional context.

I asked it right now, and it got everything right. How long ago was this?

I even used the basic bitch 4o model. There are better ones that I have access to as a paid user.

https://github.com/vectara/hallucination-leaderboard

The current SOTA LLMs hallucinate as little as 0.8% of the time for well grounded tasks, text summarization in this particular case. Of course, the rate can vary for other tasks, and the results worsen when getting into obscure topics.

I asked it last week. My husband has a highly tuned llm (granted, he buys access) and we have an ongoing friendly argument about how useful (him) or useless (me) they are. So whenever it comes up I ask chatgpt some dead simple question to see if it gets in the right ballpark. In this case (and often) it didn't - it gave me bodyweight stuff like deadbug. Don't get me wrong, deadbug is useful! But the whole point of LIFTMOR is that us oldsters need to be lifting heavy (safely) to increase bone strength. Stretching and bodyweight is helpful but not enough.