One post on X recently caught the attention of clinicians and technologists alike. Rohan Paul shared that large language models answering the same medical case flip-flopped 40% of the time. The clinical details didn’t change. The model did.
It struck a nerve because those of us who audit AI for real-world clinical use have seen the same pattern play out: tools that sound confident and coherent, and even arrive at the right answer once… yet reason their way there in ways that would get a junior doctor hauled into a supervision meeting.
AI can be brilliant. AI can also be a mirage. And in healthcare, the gap between the two isn’t academic - it’s clinical risk.
The Instability Epidemic
A recent Springer study tested six leading models on common inpatient decision points - including anticoagulation. The results were near coin-toss territory: 50/50 splits on high-stakes recommendations.
For those of us trained in medicine, the idea that a life-altering therapeutic decision could oscillate depending on nothing more than the random seed of an algorithm should set alarm bells ringing. Variance is not a minor inconvenience; it’s a threat to safe practice.
From X Flames to Courtrooms
This isn’t theoretical any more. In the US, class actions have been filed against UnitedHealth alleging AI-driven denials causing harm to patients. Closer to home, The Guardian has raised the question everyone is quietly wrestling with: who carries liability when an AI-influenced decision goes wrong?
What we’re seeing is the early stage of something that will define the next decade of digital health: responsibility without clarity. Clinicians worry about medico-legal exposure. Organisations worry about brand and operational risk. And patients, understandably, assume an AI recommendation is as stable as a blood test - when in fact it isn’t.
The Edgy Angle: AI as the Unreliable Narrator
What unsettles many teams I advise is not that models are sometimes wrong - humans are too - but how they’re wrong.
We assume ‘90% accuracy’ means robust reasoning. But beneath the surface, LLMs are closer to eloquent pattern-matchers than genuine thinkers. An arXiv analysis recently demonstrated that tiny prompt tweaks produced accuracy drops of up to 38%.
In practice, this means:
- A model can give the correct answer for the wrong reasons
- Two answers may sound equally polished while reflecting completely different chains of logic
- Minor wording changes can destabilise entire clinical pathways
When the model is effectively an unreliable narrator, the clinician must stay firmly in the driving seat.
Safeguarding Your Deployment - What Actually Works
This is where implementation matters. Within the AI strategy work I do with clinics, we’ve developed a set of stabilisation methods that dramatically reduce variance and reveal where models are drifting.
What helps:
- Multi-model sampling: Compare outputs across more than one model to identify consensus versus hallucination
- Structured re-prompting: Require the model to justify steps, not just the final answer
- Version locking: Ensure you’re not unintentionally switching between model updates with different behaviours
- Variance auditing: Track instability over time so shifts don’t go unnoticed
A Final Thought
Reliable AI isn’t a luxury. It’s survival for organisations hoping to use these tools safely and credibly.