Medical AI’s Dirty Secret: Why Your LLM Can Lie - and What That Means for Patient Safety

Large language models answering the same medical case flip-flopped 40% of the time. Here’s what that means for your practice.

← Back to Blog
Medical AI clinical safety - why LLMs can produce inconsistent answers on the same medical case

One post on X recently caught the attention of clinicians and technologists alike. Rohan Paul shared that large language models answering the same medical case flip-flopped 40% of the time. The clinical details didn’t change. The model did.

It struck a nerve because those of us who audit AI for real-world clinical use have seen the same pattern play out: tools that sound confident and coherent, and even arrive at the right answer once… yet reason their way there in ways that would get a junior doctor hauled into a supervision meeting.

AI can be brilliant. AI can also be a mirage. And in healthcare, the gap between the two isn’t academic - it’s clinical risk.

The Instability Epidemic

A recent Springer study tested six leading models on common inpatient decision points - including anticoagulation. The results were near coin-toss territory: 50/50 splits on high-stakes recommendations.

For those of us trained in medicine, the idea that a life-altering therapeutic decision could oscillate depending on nothing more than the random seed of an algorithm should set alarm bells ringing. Variance is not a minor inconvenience; it’s a threat to safe practice.

From X Flames to Courtrooms

This isn’t theoretical any more. In the US, class actions have been filed against UnitedHealth alleging AI-driven denials causing harm to patients. Closer to home, The Guardian has raised the question everyone is quietly wrestling with: who carries liability when an AI-influenced decision goes wrong?

What we’re seeing is the early stage of something that will define the next decade of digital health: responsibility without clarity. Clinicians worry about medico-legal exposure. Organisations worry about brand and operational risk. And patients, understandably, assume an AI recommendation is as stable as a blood test - when in fact it isn’t.

The Edgy Angle: AI as the Unreliable Narrator

What unsettles many teams I advise is not that models are sometimes wrong - humans are too - but how they’re wrong.

We assume ‘90% accuracy’ means robust reasoning. But beneath the surface, LLMs are closer to eloquent pattern-matchers than genuine thinkers. An arXiv analysis recently demonstrated that tiny prompt tweaks produced accuracy drops of up to 38%.

In practice, this means:

When the model is effectively an unreliable narrator, the clinician must stay firmly in the driving seat.

Safeguarding Your Deployment - What Actually Works

This is where implementation matters. Within the AI strategy work I do with clinics, we’ve developed a set of stabilisation methods that dramatically reduce variance and reveal where models are drifting.

What helps:

A Final Thought

Reliable AI isn’t a luxury. It’s survival for organisations hoping to use these tools safely and credibly.

#healthtech#AIinHealthcare#ClinicalSafety#DigitalHealthUK#AIMedicine#LLMStability#MedTechUK

Ready to Grow?

Book a Discovery Call and discover how AI-powered systems can help your practice grow faster, run leaner, and maximise impact.

Book a Discovery Call →