site banner

Small-Scale Question Sunday for June 15, 2025

Do you have a dumb question that you're kind of embarrassed to ask in the main thread? Is there something you're just not sure about?

This is your opportunity to ask questions. No question too simple or too silly.

Culture war topics are accepted, and proposals for a better intro post are appreciated.

1
Jump in the discussion.

No email address required.

How do you best verify Large language model output?

I hear lots of people say they use LLM's to search through documents or to get ideas for how something works, but my question is how do people verify the output? Is it as simple as copy-pasting keywords onto google to get the actual science textbooks? Or is there some better set of steps to take that I miss. I also wonder how you do that for looking through a document, is there some sort of method for getting the LLM to output page citations so you check those (maybe it's in settings or something)

An easy trick is to get another model to review/critique it. If both models disagree, get them to debate each other till a consensus is reached.

These days, outright hallucinations are quite rare, but it's still worth doing due diligence for anything mission-critical.

As George suggested, you can also ask for verbatim quotes or citations, though you'll need to manually check them.

These days, outright hallucinations are quite rare, but it's still worth doing due diligence for anything mission-critical.

I asked chatgpt what the 4 core exercises of the LIFTMOR routine were and it didn't get a single one correct. It's a simple question to google so I am not sure how it got it so wrong. When I changed the question to specify the LIFTMOR routine to help counteract osteoporosis, it got it right. Google doesn't require the additional context.

Oh man, if you think that's bad, AI studio will drive you mad. I was asking it what it could do (using 2.5 pro always of course since it's free) and we got onto its TTS abilities. I checked the list of dozens of star names like achernar and fenrir and asked it why none of them mention which ones are male or female and it rattled off a wall of text about how Google wanted to promote inclusivity, avoid gender stereotyping and focus on function over form.

After it refused to reply to my 'lol fuck you' I developed my argument into "actually people have been able to readily distinguish male voices from female for thousands of years, so it doesn't matter what lofty goals Google has, what they have done is reduce function due entirely to form." after some more sparring it admitted that the function of a Google TTS bot is to optimise its immediate task, not shape future behaviour and it agreed to tell me which names were which gender.

Victory? No, not even close. All of its voices were named after their accent followed by a person's name - not star names. It apologised and explained how to find the voices named the way it said they were. Incorrectly. Those names did not exist, nor did the pulldown menu it told me to use. I explained that, and it apologised again and explained google had rolled out the new chirp3 system and so the actual names were Vega, Sirius, maia and so on. Incorrect again. None of those names were available to me. By now it was beating itself up pretty hard, and the conclusion it came to was that Google were a/b testing and it asked if I could tell it some of the names I saw in the list so it could piece together our disconnect. I mentioned achernar and fenrir and gacrux and acherd and it finally managed to give me a list of voices that sounded male and voices that sounded female. One of them was clearly an effiminate man, but the rest were spot on.

It was a lot of effort for very little reward, but I was just fucking around with AI studio anyway, and I found the entire thing much more interesting than frustrating. This was the best version of Google's ai looking at another part of itself and whiffing so completely I was beginning to feel sorry for it. And yet the tech is still so much better than it was a year ago that I can't help but be optimistic about it. I use ai instead of search now pretty much every day and I have only been blindsided by a hallucination once so far. Search is still better for... things you already know the answer to, I agree. No, just kidding, search is better for simple stuff like that for sure, the big benefit of ai imo is it collates all the information you would usually have to browse multiple sources for into one place - then you check the sources and one might be nonsense but the others are usually good.

I asked it right now, and it got everything right. How long ago was this?

I even used the basic bitch 4o model. There are better ones that I have access to as a paid user.

https://github.com/vectara/hallucination-leaderboard

The current SOTA LLMs hallucinate as little as 0.8% of the time for well grounded tasks, text summarization in this particular case. Of course, the rate can vary for other tasks, and the results worsen when getting into obscure topics.