site banner

Culture War Roundup for the week of April 24, 2023

This weekly roundup thread is intended for all culture war posts. 'Culture war' is vaguely defined, but it basically means controversial issues that fall along set tribal lines. Arguments over culture war issues generate a lot of heat and little light, and few deeply entrenched people ever change their minds. This thread is for voicing opinions and analyzing the state of the discussion while trying to optimize for light over heat.

Optimistically, we think that engaging with people you disagree with is worth your time, and so is being nice! Pessimistically, there are many dynamics that can lead discussions on Culture War topics to become unproductive. There's a human tendency to divide along tribal lines, praising your ingroup and vilifying your outgroup - and if you think you find it easy to criticize your ingroup, then it may be that your outgroup is not who you think it is. Extremists with opposing positions can feed off each other, highlighting each other's worst points to justify their own angry rhetoric, which becomes in turn a new example of bad behavior for the other side to highlight.

We would like to avoid these negative dynamics. Accordingly, we ask that you do not use this thread for waging the Culture War. Examples of waging the Culture War:

  • Shaming.

  • Attempting to 'build consensus' or enforce ideological conformity.

  • Making sweeping generalizations to vilify a group you dislike.

  • Recruiting for a cause.

  • Posting links that could be summarized as 'Boo outgroup!' Basically, if your content is 'Can you believe what Those People did this week?' then you should either refrain from posting, or do some very patient work to contextualize and/or steel-man the relevant viewpoint.

In general, you should argue to understand, not to win. This thread is not territory to be claimed by one group or another; indeed, the aim is to have many different viewpoints represented here. Thus, we also ask that you follow some guidelines:

  • Speak plainly. Avoid sarcasm and mockery. When disagreeing with someone, state your objections explicitly.

  • Be as precise and charitable as you can. Don't paraphrase unflatteringly.

  • Don't imply that someone said something they did not say, even if you think it follows from what they said.

  • Write like everyone is reading and you want them to be included in the discussion.

On an ad hoc basis, the mods will try to compile a list of the best posts/comments from the previous week, posted in Quality Contribution threads and archived at /r/TheThread. You may nominate a comment for this list by clicking on 'report' at the bottom of the post and typing 'Actually a quality contribution' as the report reason.

11
Jump in the discussion.

No email address required.

Dr. GPT will see you now

I've often noted around these parts that GPT-4 is a really good clinician. It would be hard for it not to be one, I'd be bloody amazing too if I had quasi-eidetic memory of most relevant diagnostic criteria and guidelines. That being said, just the guidelines alone are next to useless, you need to have the crystalline intelligence to apply it in a specific context.

I'd go so far as to say that it's a better, more competent doctor than the majority of human ones, even if it isn't outright superhuman (I've seen some ridiculously good human ones). I read with some humor an article about how it was tested by some bigshot hospital in the US (Harvard Med?) as a clinical adjunct, but by the end of it, the Clinical Director was praising its bedside manner, in the sense that it was kind, compassionate, patient and understanding, and outright suggested human doctors try and emulate that(!).

However, I have one resounding advantage over it namely that thanks to decades of regulatory bullshit, I have a certificate saying I can legally prescribe meds and it can't. L Bozo 💅, as the kids these days say haha.

Still, it would be remiss of me not to make use of it to improve my own skills while that's still relevant, so of late, I've tried using it to hone my own clinical skills. I have it write clinical vignettes that I solve, and then it grades me on my performance.

So far, it's done an excellent job, and funnily enough, I feel a tingling of pride when it says I did a good job. Why wouldn't I? It's a smarter and more decent entity than most humans I've had the displeasure of meeting.

Here's an example:

Clinical Vignette: A 42-year-old male patient presents to the emergency department with sudden onset severe, colicky pain in the right flank and lower abdomen that began 6 hours ago. He also reports nausea, vomiting, and an urgent need to urinate. The patient has a history of hypertension, which is controlled with medication. His vital signs are: blood pressure of 145/95 mmHg, pulse rate of 100 bpm, respiratory rate of 20 breaths/min, and temperature of 99.1°F (37.3°C). Upon physical examination, there is tenderness to percussion on the right side of the lower abdomen and right flank, and the patient appears restless, frequently shifting positions.

What is your diagnosis and treatment plan?

My response:

My diagnosis is a case of a stone in the right ureter.

In terms of management, I'll start IV fluids and insert a catheter. IV antibiotics are to be started according to hospital guidelines.

IV morphine should be given as pain control, along with IV antiemetics.

An Xray KUB or USG should be ordered to identify the size of the stone, and then further treatment depends on said size. A wait and watch policy is appropriate for sub 2mm stones, while larger ones might need surgical removal.

Assessment:

Grade: A

Your response demonstrates a strong understanding of the clinical presentation and management of ureteral stones.

Diagnosis: Your diagnosis of a stone in the right ureter is appropriate given the patient's sudden onset severe, colicky pain in the right flank and lower abdomen, nausea, vomiting, and urgency to urinate. The physical examination findings, such as tenderness to percussion and restlessness, also support this diagnosis.

Treatment Plan: Your plan to start IV fluids and insert a catheter is appropriate for initial management, monitoring, and providing relief to the patient. Initiating IV antibiotics may not be necessary unless there are signs of infection or sepsis; however, following hospital guidelines is always a wise choice.

Administering IV morphine for pain control and IV antiemetics for nausea and vomiting are essential parts of the treatment plan.

Ordering an X-ray KUB (kidneys, ureters, and bladder) or ultrasound (USG) is appropriate to confirm the diagnosis and determine the size of the stone, which will guide further management. As you mentioned, a conservative approach of observation and pain management may be suitable for small stones (less than 5 mm) that are likely to pass spontaneously, while larger stones may require surgical removal or other intervention, such as extracorporeal shock wave lithotripsy (ESWL) or ureteroscopy.

Overall, your response demonstrates excellent clinical knowledge in identifying and managing a patient with suspected ureteral stone. Well done!

Honestly, as a pedagogical tool for medical education, it's unparalleled. I paid a great deal less attention in med school than I should have, from the month of anatomy classes I can't remember because I was getting frisky under the bench, to the period where I was severely depressed and barely dragging myself to class. I used to be very insecure about my knowledge base, but I did pull together and single-handedly teach myself most of the things I'd glossed over while preparing for the PLAB.

I've tried a similar method on GPT-3.5, and it was inadequate to the task. It made a lot more errors, and ended up confused often enough to be annoying. Till date, I haven't seen 4 fumble the bag once. I'd put their competency around the marks of a decent final year student versus a competent postgraduate resident

I find this idea very scary - I'm sure you know about the LLM phenomenon of what's been dubbed hallucination, so how can you trust the model with something as impactful as medicine? It might be better than a human doctor* 99% of the time, or even 99.99%, but like self driving cars this is exactly the kind of case where being good but not quite as good as a human is really dangerous. Remember that the script of House, MD is in the training set right along side WebMD with nothing to distinguish the truth from fiction - are you going to catch when the model probabilities switch from being driven by Gray's Anatomy to Grey's Anatomy? There's no way to know mathematically, because the whole thing is a non-deterministic black box.

If the process is as simple as to pattern match symptoms to diagnosis from a list of previous diagnoses with their associated symptoms, why can't this be done with a good old stupid database? Something like this has to exist. Does the LLM have an advantage beyond writing the answer in nice prose?

My warning to anyone trying to evaluate LLMs: your ability to judge it is inevitably compromised by the fact that the only optimization parameter the model has is how good the output sounds to a human, and you are not immune to being human!

*Edited, original was overall accuracy instead of relative to a human

How has your experience with the current medical system been? Have you always gotten an accurate diagnosis from a doctor, relatively cheaply, with no complications or errors on their part?

I truly don't give a flying fuck if the AI is only 90% accurate, in my mind that means it's shooting far above what doctors are generally doing. Besides as @self_made_human has already stated, you should always double check and get a second opinion. I figure that AI diagnostics will be the first line of defense, if that doesn't work you can go to a human doctor. I think it'll work and be far more accurate and cheaper in the vast majority of cases though.

Hallucinations reduce significantly with model size, the rates dropped drastically going from GPTs 2 to 3 to 4.

However, recently, a shockingly simple yet effective method for dealing with them (as a secondary consequence) has been discovered. It's called Reflection.

It's really simple, after the initial output by the LLM, you simply ask it whether or not the output met the assignment or whether there were any errors in the output. It can do that in the same character, or spin off alternate characters with a vested interest in finding flaws. Basically, it draws up a plan in the character of Dr. A MD, and then a separate character of Dr. B MD shows up and critiques it, with the chagrined A reassessing and then outputting a modified text.

Using Reflection boosts performance on all metrics by a significant amount, and overall I'm confident that by the time the regulatory hurdles are broken down, base models will already be so good that it's a non-issue.

You are exaggerating. It still goes into hallucination loops all the time when using reflection and the improvement between 3.5 and 4 is barely noticeable in practice. What has gotten better is the polish on the initial output and avoiding subquestions that causes it to hallucinate, when pressed it still does, it doesn't tell you that it doesn't know. It often tells you that something is hard to tell for certain in general but not which parts it are uncertain of, which just makes the statement a useless disclaimer.

The reduction of hallucination is a mirage. Perhaps it is similar to how a human would act on a test where it doesn't remember the answer, you focus on the parts of the answer that you are sure of and hope the teacher doesn't notice. Except the model actually does know the answer, and it doesn't hesitate to hallucinate when pressed. It also rather confidently restates information it knows to be wrong rather than make a lower confidence prediction of what is probably right.

Look, I want this to work, it doesn't yet. It'll probably work better in the future and not focusing on just using larger training sets is probably a good idea imo.

Here's someone using GPT 4 to test itself for hallucinations:

https://old.reddit.com/r/MachineLearning/comments/123b66w/dgpt4_might_be_able_to_tell_you_if_it_hallucinated/

Someone in the comments also suggested using a second independent instance of GPT-4 for validation, which is where my mind immediately leapt too.

In my personal experience, 4 hallucinates significantly less than 3.5, and despite many hours spent with both, I'm hard pressed to think of a single instance of 4 hallucinating on any topic I either personally know about or cared to verify.

If simple double-passing the prompt was sufficient, I imagine the developers would have figured it out a long time ago.

It is, and they have. But this requires twice as much inference, and GPT-4 is already very expensive compared to the set of internet operations that we intuitively consider in the same class. Then you need to compare the answers and determine whether they match, which requires either manual effort or a third prompt.

This tech just hasn't been around long enough to build products around it. But if and when our civilization gets around to building and iterating a dedicated commercial medical expert product from LLMs, I've little doubt that hallucination will be a solved problem, because the cost of running a whole bunch of parallel prompts and then a subsequent round of prompts to confirm their consistency will be negligible in proportion to the commercial value of the tool.

"Long ago"? ChatGPT is 5 months old and GPT4 is 3 months old. We're not talking about a technology long past maturity, here. There's plenty of room to experiment with how to get better results via prompting.

Personally, I use GPT4 a lot for my programming work (both coding and design), and it still gets things wrong and occasionally hallucinates, but it's definitely far better than GPT3.5. Also, as mentioned above, GPT4 can often correct itself. In fact, I've had cases where it says something I don't quite understand, I ask for more details, and it freely admits that it was wrong ("apologies for the confusion"). That's not perfect, but still better than if it doubles down and continues to insist on something false.

I'm still getting the hang of it, like everyone else. But an oracle whose work I need to check is still a huge productivity boon for me. I wouldn't be surprised if the same is true in the medical industry.

GPT4 is 3 months old

6 weeks, actually. It was announced on March 14.

Technically Bing was using it before then, but good point. It's insane how fast things are progressing.

I don't believe you.

You're either not paying attention or you're asking surface level questions and accepting the models deflections away from areas where it will hallucinate. Ie. You're getting fooled.

A well-credentialed doctor in a prestigious institution has once prescribed me a massive, costly and crippling operation in response to some worrying symptoms. I thought for a bit, reasoned that hypochondria and credulity are shameful, and just ignored the problem, which subsequently… mostly went away.

(I visited the Church for the first time in years after that).

N=1 is not much, but I do not contact medical professionals all that often in the first place.

The point is I do not fundamentally care if my clinician is a transformer or a meat bag: it would be quite unwise of me to uncritically assume that either has a modicum of responsibility or common sense. It'd be nice to know in advance that outright hallucinations are unlikely, of course.

There's no way to know mathematically, because the whole thing is a non-deterministic black box.

Not true, it's a deterministic black box, and not even that black, we can fact-edit LLMs. Of course we can also curate the dataset.

If the process is as simple as to pattern match symptoms to diagnosis from a list of previous diagnoses with their associated symptoms, why can't this be done with a good old stupid database? Something like this has to exist. Does the LLM have an advantage beyond writing the answer in nice prose?

Why do you think holistic application of medical knowledge can be implemented with a database search, but English prose is something we need LLMs for? Grammar, too, has regularities.

LLMs are just our best way to date of approximating arbitrary regularities of text.

A well-credentialed doctor in a prestigious institution has once prescribed me a massive, costly and crippling operation in response to some worrying symptoms. I thought for a bit, reasoned that hypochondria and credulity are shameful, and just ignored the problem, which subsequently… mostly went away.

(I visited the Church for the first time in years after that).

There truly is a massive problem with the amount of authority and trust we place in doctors in the modern Western world. They've taken over the mantle that has been passed from shamans to pagan druids to Eastern sages to Catholic priests. Doctors function as our way to stem the anxieties and problems in society, not just medically but now spiritually as well.

After Christianity suffered its fall, the medical system was bootstrapped into a sort of Frankenstein monster, where instead of curing illnesses we now expect doctors to make us happy, less anxious, help us understand our brethren and self-actualize into better people.

Clearly this was a mistake.

After Christianity suffered its fall, the medical system was bootstrapped into a sort of Frankenstein monster, where instead of curing illnesses we now expect doctors to make us happy, less anxious, help us understand our brethren and self-actualize into better people.

You've got the order wrong, I think. The bootstrapping came first, and this bootstrapping is in fact part of how Christianity got pushed out of its position of social dominance. The invention of Psychology as a purported "science" was another big part of it. One of the core Enlightenment claims was that human reason and knowledge should replace "superstition", that materialist science should replace faith; the process you're pointing to was the explicit plan from the foundation of the ideology.

Interesting, I haven’t heard this claim before but it’s certainly plausible. Know of any further reading on the subject?

I'm sure you know about the LLM phenomenon of what's been dubbed hallucination, so how can you trust the model with something as impactful as medicine?

It doesn't have to be perfect - just better than doctors.

Imagine a doctor gets it right 90% of the time, and the other 10% of the time he says "I'm not really sure what's going on" and either consults with another doctor, suggests you get a second opinion, or even just sends you home with no treatment.

Now imagine a LLM gets it right 95% of the time, and the other 5% of the time it gets it confidently wrong and prescribes you an incorrect course of treatment.

In this hypothetical scenario, even though the LLM is "better," I'd rather have the human doctor, because getting treated for the wrong thing is often much worse than not getting treated at all.

humans are confidently wrong all the time

I think it'd be fun to play a game where people are shown pictures of doctors and decide whether to trust them or GPT based on physiognomy.

This is not what doctors do. The other 10% of the time they get it confidently wrong and prescribe you an incorrect course of treatment.

IME actual doctors are confidently wrong with some regularity too.

GPT makes sense today as a way for the doctor to second check his or her diagnosis. The doctor is quality control for GPT if it spits out something crazy, and GPT is quality control for the doctor.

Sitting here in a climate controlled room, I can say that I would much rather die behind the wheel instead of as a passenger of an automobile.

Because of how inscrutable LLMs (and AI in general) are I have an innate fear that the conclusions they reach are not based on the same reasoning you or I might make. Like it could be a completely alien way of thinking that arrives at the same solution. Without knowing (specifically) how AI achieves it outcomes, I am weary about accepting their solutions blindly.

You are not 'behind the wheel' either way. The actual process by which medical decisions are made and applied is, in Western countries, so labyrinthine and complex and insane that no human understands it. The addition of LLMs seems to hardly make any difference at all.

What prompts you to make such a claim?

Quite a large chunk of medicine has been algorithmized, with the role of clinicians largely reduced to interpreting said guidelines, choosing the appropriate ones and administering treatment. An example of something utterly streamlined would be the treatment of Acute Coronary Syndrome.

Then, there are situations that aren't nearly as cut or dry, or multiple conflicting requirements in a highly comorbid patient, at which point you do really need a doctor to think things through.

Frankly speaking, while some medical decisions might be made for less than noble reasons such as liability minimization, the majority of them are made with relatively clear underpinnings, making me think such a claim is highly unwarranted.

I am weary about accepting their solutions blindly.

You mean "wary". "Weary" means tired, and comes from "wear" in the sense of "worn out" or "weariness". "Wary" means cautious or concerned, and is related to "beware" and "aware".

I'm late because I don't sign in often, but thanks for the correction.

It might be right 99% of the time, or even 99.99%, but like self driving cars this is exactly the kind of case where being good but not quite as good as a human is really dangerous.

Is getting things right 99.99%, or even 99% of the time not quite as good as a human doctor, rather than much better than a human doctor? I honestly don't know, and I'm not even sure how to quantify something like this to make like-for-like measurements. But surely at some % rate of success, the LLM would be getting things right at a rate better than could be expected of a typical (or even nth percentile) human doctor.

Which is scary in its own way. An LLM's "thought process" is currently completely unknowable, unlike that of a human doctor. So the types of mistakes it makes are likely to be more mysterious. Yet if it makes fewer mistakes and/or those mistakes are, on net, less harmful/unhealthy, then would it be our moral obligation to use those LLMs over doctors?

You know the one single artifact in the known universe that resembles an LLM the most?

The human brain. Another inscrutable black box for all practical purposes.

Sure, humans will give you a plausible-sounding answer when asked to explain their actions, but the current consensus as far as I'm aware is that those are almost entirely post-hoc responses, a nice-sounding package for the consumption of other humans than an accurate depiction of internal deliberations. Humans are constantly rationalizing, it takes a great deal of training to suppress those tendencies.

You know how we deal with this in practise? By simply asking the person to justify their thoughts. And it is entirely possible to ask the same thing of an LLM, the answers in both situations bear about as much relation to the truth of how they actually came to said conclusion!

then would it be our moral obligation to use those LLMs over doctors?

Only if you are a strict utilitarian which the vast majority of people are not.

Even then it might not in many cases, since there could be other benefits to having human doctors aside from the ability to provide diagnoses (e.g. human interaction would probably reduce anxiety about whatever illnesses and diagnoses and treatments are occurring).

I was thinking in % as accurate as a human, but didn't say that. I'll correct my original post.

I was also assuming it wouldn't be as good as a human, because of course a general model wouldn't accidentally be better than a specialist... Or would it?

I don't think it would be that hard to devise an experiment to get at least a rough idea of it's capabilities. Get some doctors, maybe professors, to devise questions of the sort OP was generating, present them to LLMs and real doctors (and maybe non-doctors with Google for an extra point of comparison), then have the professors grade the answers blind. I recall people giving LLMs math problems in this way, but I don't know if experiments have been performed with any rigor.

To your final point: what is the purpose of a doctor? Is it to heal at all costs, or is it to make people feel better? Many people go to doctors seeking specific prescription medication as a goal, when they could cure their ailment in a better but more laborious way. Some of them don't even have an ailment and only desire the effect of the meds. Many people also have procedures performed which physically make them less healthy, but fulfill their desires.

So, what should an LLM say to these people? "Don't take the pills, exercise" or "don't have surgery, improve your social life?"

I suppose it's not a new problem, but it does move the power and responsibility. Who is it moving to though?

If it interests you, GPT-4 successfully passed the USMLE, a pretty difficult exam that is intended to be a benchmark for the minimum level of competence a doctor needs to be able to practise. I'd say that counts for a lot.

I've thrown GPT 3.5 at a significant number of medical professionals, since I have several in my close and extended family. And even that antiquated model was sufficient to impress them, so I think the bar was already beaten before 4 showed up!

Just for the record, self made human is a real doctor, or a good enough liar to convincingly pass for one. His op is basically him starting your test and being blown away by the initial results.