Culture War Roundup for the week of April 24, 2023

Dr. GPT will see you now

I've often noted around these parts that GPT-4 is a really good clinician. It would be hard for it not to be one, I'd be bloody amazing too if I had quasi-eidetic memory of most relevant diagnostic criteria and guidelines. That being said, just the guidelines alone are next to useless, you need to have the crystalline intelligence to apply it in a specific context.

I'd go so far as to say that it's a better, more competent doctor than the majority of human ones, even if it isn't outright superhuman (I've seen some ridiculously good human ones). I read with some humor an article about how it was tested by some bigshot hospital in the US (Harvard Med?) as a clinical adjunct, but by the end of it, the Clinical Director was praising its bedside manner, in the sense that it was kind, compassionate, patient and understanding, and outright suggested human doctors try and emulate that(!).

However, I have one resounding advantage over it namely that thanks to decades of regulatory bullshit, I have a certificate saying I can legally prescribe meds and it can't. L Bozo 💅, as the kids these days say haha.

Still, it would be remiss of me not to make use of it to improve my own skills while that's still relevant, so of late, I've tried using it to hone my own clinical skills. I have it write clinical vignettes that I solve, and then it grades me on my performance.

So far, it's done an excellent job, and funnily enough, I feel a tingling of pride when it says I did a good job. Why wouldn't I? It's a smarter and more decent entity than most humans I've had the displeasure of meeting.

Here's an example:

Clinical Vignette: A 42-year-old male patient presents to the emergency department with sudden onset severe, colicky pain in the right flank and lower abdomen that began 6 hours ago. He also reports nausea, vomiting, and an urgent need to urinate. The patient has a history of hypertension, which is controlled with medication. His vital signs are: blood pressure of 145/95 mmHg, pulse rate of 100 bpm, respiratory rate of 20 breaths/min, and temperature of 99.1°F (37.3°C). Upon physical examination, there is tenderness to percussion on the right side of the lower abdomen and right flank, and the patient appears restless, frequently shifting positions.

What is your diagnosis and treatment plan?

My response:

My diagnosis is a case of a stone in the right ureter.

In terms of management, I'll start IV fluids and insert a catheter. IV antibiotics are to be started according to hospital guidelines.

IV morphine should be given as pain control, along with IV antiemetics.

An Xray KUB or USG should be ordered to identify the size of the stone, and then further treatment depends on said size. A wait and watch policy is appropriate for sub 2mm stones, while larger ones might need surgical removal.


Grade: A

Your response demonstrates a strong understanding of the clinical presentation and management of ureteral stones.

Diagnosis: Your diagnosis of a stone in the right ureter is appropriate given the patient's sudden onset severe, colicky pain in the right flank and lower abdomen, nausea, vomiting, and urgency to urinate. The physical examination findings, such as tenderness to percussion and restlessness, also support this diagnosis.

Treatment Plan: Your plan to start IV fluids and insert a catheter is appropriate for initial management, monitoring, and providing relief to the patient. Initiating IV antibiotics may not be necessary unless there are signs of infection or sepsis; however, following hospital guidelines is always a wise choice.

Administering IV morphine for pain control and IV antiemetics for nausea and vomiting are essential parts of the treatment plan.

Ordering an X-ray KUB (kidneys, ureters, and bladder) or ultrasound (USG) is appropriate to confirm the diagnosis and determine the size of the stone, which will guide further management. As you mentioned, a conservative approach of observation and pain management may be suitable for small stones (less than 5 mm) that are likely to pass spontaneously, while larger stones may require surgical removal or other intervention, such as extracorporeal shock wave lithotripsy (ESWL) or ureteroscopy.

Overall, your response demonstrates excellent clinical knowledge in identifying and managing a patient with suspected ureteral stone. Well done!

Honestly, as a pedagogical tool for medical education, it's unparalleled. I paid a great deal less attention in med school than I should have, from the month of anatomy classes I can't remember because I was getting frisky under the bench, to the period where I was severely depressed and barely dragging myself to class. I used to be very insecure about my knowledge base, but I did pull together and single-handedly teach myself most of the things I'd glossed over while preparing for the PLAB.

I've tried a similar method on GPT-3.5, and it was inadequate to the task. It made a lot more errors, and ended up confused often enough to be annoying. Till date, I haven't seen 4 fumble the bag once. I'd put their competency around the marks of a decent final year student versus a competent postgraduate resident

I find this idea very scary - I'm sure you know about the LLM phenomenon of what's been dubbed hallucination, so how can you trust the model with something as impactful as medicine? It might be better than a human doctor* 99% of the time, or even 99.99%, but like self driving cars this is exactly the kind of case where being good but not quite as good as a human is really dangerous. Remember that the script of House, MD is in the training set right along side WebMD with nothing to distinguish the truth from fiction - are you going to catch when the model probabilities switch from being driven by Gray's Anatomy to Grey's Anatomy? There's no way to know mathematically, because the whole thing is a non-deterministic black box.

If the process is as simple as to pattern match symptoms to diagnosis from a list of previous diagnoses with their associated symptoms, why can't this be done with a good old stupid database? Something like this has to exist. Does the LLM have an advantage beyond writing the answer in nice prose?

My warning to anyone trying to evaluate LLMs: your ability to judge it is inevitably compromised by the fact that the only optimization parameter the model has is how good the output sounds to a human, and you are not immune to being human!

*Edited, original was overall accuracy instead of relative to a human

It might be right 99% of the time, or even 99.99%, but like self driving cars this is exactly the kind of case where being good but not quite as good as a human is really dangerous.

Is getting things right 99.99%, or even 99% of the time not quite as good as a human doctor, rather than much better than a human doctor? I honestly don't know, and I'm not even sure how to quantify something like this to make like-for-like measurements. But surely at some % rate of success, the LLM would be getting things right at a rate better than could be expected of a typical (or even nth percentile) human doctor.

Which is scary in its own way. An LLM's "thought process" is currently completely unknowable, unlike that of a human doctor. So the types of mistakes it makes are likely to be more mysterious. Yet if it makes fewer mistakes and/or those mistakes are, on net, less harmful/unhealthy, then would it be our moral obligation to use those LLMs over doctors?

I was thinking in % as accurate as a human, but didn't say that. I'll correct my original post.

I was also assuming it wouldn't be as good as a human, because of course a general model wouldn't accidentally be better than a specialist... Or would it?

I don't think it would be that hard to devise an experiment to get at least a rough idea of it's capabilities. Get some doctors, maybe professors, to devise questions of the sort OP was generating, present them to LLMs and real doctors (and maybe non-doctors with Google for an extra point of comparison), then have the professors grade the answers blind. I recall people giving LLMs math problems in this way, but I don't know if experiments have been performed with any rigor.

To your final point: what is the purpose of a doctor? Is it to heal at all costs, or is it to make people feel better? Many people go to doctors seeking specific prescription medication as a goal, when they could cure their ailment in a better but more laborious way. Some of them don't even have an ailment and only desire the effect of the meds. Many people also have procedures performed which physically make them less healthy, but fulfill their desires.

So, what should an LLM say to these people? "Don't take the pills, exercise" or "don't have surgery, improve your social life?"

I suppose it's not a new problem, but it does move the power and responsibility. Who is it moving to though?

If it interests you, GPT-4 successfully passed the USMLE, a pretty difficult exam that is intended to be a benchmark for the minimum level of competence a doctor needs to be able to practise. I'd say that counts for a lot.

I've thrown GPT 3.5 at a significant number of medical professionals, since I have several in my close and extended family. And even that antiquated model was sufficient to impress them, so I think the bar was already beaten before 4 showed up!

Just for the record, self made human is a real doctor, or a good enough liar to convincingly pass for one. His op is basically him starting your test and being blown away by the initial results.