Medical AI’s Dirty Secret: Why Your LLM Can Lie - and What That Means for Patient Safety

9 December 2025 Clinical Safety

Medical AI clinical safety - why LLMs can produce inconsistent answers on the same medical case

One post on X recently caught the attention of clinicians and technologists alike. Rohan Paul shared that large language models answering the same medical case flip-flopped 40% of the time. The clinical details didn’t change. The model did.

It struck a nerve because those of us who audit AI for real-world clinical use have seen the same pattern play out: tools that sound confident and coherent, and even arrive at the right answer once… yet reason their way there in ways that would get a junior doctor hauled into a supervision meeting.

AI can be brilliant. AI can also be a mirage. And in healthcare, the gap between the two isn’t academic - it’s clinical risk.

The Instability Epidemic

A recent Springer study tested six leading models on common inpatient decision points - including anticoagulation. The results were near coin-toss territory: 50/50 splits on high-stakes recommendations.

For those of us trained in medicine, the idea that a life-altering therapeutic decision could oscillate depending on nothing more than the random seed of an algorithm should set alarm bells ringing. Variance is not a minor inconvenience; it’s a threat to safe practice.

From X Flames to Courtrooms

This isn’t theoretical any more. In the US, class actions have been filed against UnitedHealth alleging AI-driven denials causing harm to patients. Closer to home, The Guardian has raised the question everyone is quietly wrestling with: who carries liability when an AI-influenced decision goes wrong?

What we’re seeing is the early stage of something that will define the next decade of digital health: responsibility without clarity. Clinicians worry about medico-legal exposure. Organisations worry about brand and operational risk. And patients, understandably, assume an AI recommendation is as stable as a blood test - when in fact it isn’t.

The Edgy Angle: AI as the Unreliable Narrator

What unsettles many teams I advise is not that models are sometimes wrong - humans are too - but how they’re wrong.

We assume ‘90% accuracy’ means robust reasoning. But beneath the surface, LLMs are closer to eloquent pattern-matchers than genuine thinkers. An arXiv analysis recently demonstrated that tiny prompt tweaks produced accuracy drops of up to 38%.

In practice, this means:

A model can give the correct answer for the wrong reasons
Two answers may sound equally polished while reflecting completely different chains of logic
Minor wording changes can destabilise entire clinical pathways

When the model is effectively an unreliable narrator, the clinician must stay firmly in the driving seat.

Safeguarding Your Deployment - What Actually Works

This is where implementation matters. Within the AI strategy work I do with clinics, we’ve developed a set of stabilisation methods that dramatically reduce variance and reveal where models are drifting.

What helps:

Multi-model sampling: Compare outputs across more than one model to identify consensus versus hallucination
Structured re-prompting: Require the model to justify steps, not just the final answer
Version locking: Ensure you’re not unintentionally switching between model updates with different behaviours
Variance auditing: Track instability over time so shifts don’t go unnoticed

A Final Thought

Reliable AI isn’t a luxury. It’s survival for organisations hoping to use these tools safely and credibly.

#healthtech#AIinHealthcare#ClinicalSafety#DigitalHealthUK#AIMedicine#LLMStability#MedTechUK