AI Hallucination in Medicine: Real Examples, Real Risks, and How to Protect Yourself

When a large language model tells you with perfect confidence that the dose of methotrexate for rheumatoid arthritis is 25mg daily, and the correct answer is 7.5-25mg weekly, the fluency of the response makes it harder to catch — not easier. The AI does not hesitate. It does not qualify. It does not say "I am unsure." It generates a plausible-sounding answer that happens to be dangerously wrong, and it does so with the same tone and formatting it uses when it is right.

This is hallucination: the generation of factually incorrect, logically inconsistent, or fabricated information presented as authoritative output. In general AI use, hallucination is annoying. In medicine, it can harm or kill patients.

This article documents the published evidence on medical AI hallucination, explains the architectural reasons it happens, and provides a practical framework for clinicians to protect themselves.

The Published Evidence

The research base on medical hallucination is now substantial enough to draw firm conclusions.

A large-scale evaluation published as a preprint in early 2025 assessed eleven foundation models — seven general-purpose and four medical-specialised — across seven medical hallucination tasks. The study found that even models developed specifically for medical use remained vulnerable to domain-specific hallucinations, with errors arising from reasoning failures rather than knowledge gaps. Strikingly, general-purpose models achieved higher proportions of hallucination-free responses than medical-specialised models, suggesting that narrow fine-tuning may actually increase certain hallucination risks.

An accompanying clinician survey found that over 90% of respondents had encountered medical hallucinations from AI, and approximately 85% considered them capable of causing patient harm.

A study published in Nature Communications Medicine tested six leading LLMs with 300 doctor-designed clinical vignettes, each containing a single planted error — a fake lab value, a fabricated sign, or a non-existent disease. The models repeated or elaborated on the planted error in up to 83% of cases. A simple mitigation prompt halved the error rate but did not eliminate it. This demonstrates that LLMs are not just generating errors spontaneously — they are amplifying errors fed to them, which has direct implications for clinical workflows where AI processes existing (potentially erroneous) documentation.

A JMIR Medical Informatics study evaluating reference hallucination across multiple AI chatbots found that ChatGPT and Bing exhibited critical levels of reference fabrication, while retrieval-augmented tools showed negligible hallucination rates. Across 500 requested references, errors affected over 60% of reference relevance, nearly half of publication dates, and nearly half of DOI numbers.

A framework published in npj Digital Medicine for assessing hallucination in clinical note generation observed a 1.47% hallucination rate and a 3.45% omission rate across nearly 13,000 clinician-annotated sentences — lower than conversational AI, but still clinically significant when applied at scale.

Types of Medical Hallucination

Medical hallucinations are not random. They follow characteristic patterns that clinicians should learn to recognise.

Fabricated drug information. AI models generate plausible-sounding drug dosages, interactions, or contraindications that do not exist. The output looks like a BNF entry but is not grounded in any pharmacopoeia. This is perhaps the most immediately dangerous category because prescribing errors can directly harm patients.

Invented references. When asked to cite sources, general-purpose LLMs frequently generate fictional journal articles with plausible-looking authors, titles, volumes, and page numbers. The references look real but do not exist. A clinician who trusts a cited reference without clicking through is trusting fiction.

Jurisdictional confusion. A UK clinician who asks ChatGPT about hypertension management may receive a response based on US JNC guidelines rather than NICE NG136. The model does not reliably distinguish between healthcare systems, and it does not flag when it is giving guidance from the wrong jurisdiction.

Reasoning errors in clinical scenarios. The model follows a plausible clinical reasoning chain but makes a logical error — misapplying a diagnostic criterion, conflating two conditions with overlapping features, or reaching a management conclusion that does not follow from the clinical data presented. These errors are particularly dangerous because they look like good clinical reasoning.

Omissions. The model generates a management plan that is partially correct but omits a critical step — failing to mention safety-netting, missing a drug interaction, or not flagging a red flag symptom. Omissions are harder to detect than outright fabrications because what is present is correct; the danger is in what is absent.

Why Hallucination Happens: The Architecture

Understanding why LLMs hallucinate helps clinicians calibrate their trust.

General-purpose LLMs like ChatGPT, Claude, and Gemini are generative models. They produce text by predicting the most likely next token (word or word-piece) in a sequence, based on statistical patterns learned during training. They are not retrieving information from a database. They are not checking their output against a source. They are generating plausible language — and plausible language is not the same as accurate information.

This is an inherent property of the architecture, not a bug that can be fixed with more training data. Larger models hallucinate differently (sometimes more subtly) but they still hallucinate. Medical fine-tuning does not eliminate the problem — the evidence suggests it may make certain types of hallucination worse by overfitting to medical-sounding language patterns.

Retrieval-Augmented Generation (RAG) addresses this by anchoring the model's output to a curated knowledge base. Instead of generating answers from statistical patterns alone, a RAG system first retrieves relevant documents from a verified corpus, then uses the language model to synthesise an answer from those specific sources. The output is grounded in real evidence, and the source can be cited and verified.

This is the architectural difference that matters for clinical safety. iatroX uses a RAG-based approach over a curated corpus of NICE, CKS, SIGN, BNF guidelines, and peer-reviewed research. When Ask iatroX answers a clinical question, the answer is synthesised from retrieved guideline content, not generated from statistical patterns. Every answer includes visible citations linking to the primary source. You can verify in one click.

RAG does not eliminate all risk — the synthesis step still uses a language model, and edge cases exist. But it fundamentally changes the hallucination profile from "plausible fabrication" to "sourced synthesis with verifiable provenance." That difference is clinically significant.

How to Protect Yourself: A Practical Framework

Rule 1: Know your tool's architecture

Before trusting any AI output for clinical use, understand how it generates answers. Is it a general-purpose LLM generating text from training data? Or is it a RAG-grounded system retrieving from curated, verified sources? The distinction is the most important factor in hallucination risk.

Rule 2: Follow every citation

If the tool provides a citation, click it. Verify that the source exists, that it says what the AI claims, and that it is current. If the tool does not provide citations, treat the output with significantly higher scepticism. iatroX provides inline citations with every answer — following them is your verification step.

Rule 3: Never trust drug information from an ungrounded source

Drug dosages, interactions, and contraindications should always be checked against the BNF, your clinical system's drug database, or a tool specifically grounded in pharmacopoeial data. General-purpose AI tools are not drug references. They should not be used as such.

Rule 4: Verify jurisdiction

If you practise in the UK, confirm that the guidance you are receiving is UK-specific. If an AI response references guidelines you do not recognise, or if the management approach sounds unfamiliar, it may be generating from a different healthcare system's evidence base.

Rule 5: Watch for confident wrongness

The most dangerous hallucinations are the ones that sound right. Develop the habit of scepticism toward confident, unqualified statements. Real clinical guidance is usually nuanced, conditional, and qualified. AI output that sounds too clean and too certain may be too good to be true.

Rule 6: Use AI to support reasoning, not replace it

The safest use of clinical AI is to check, clarify, and extend your own clinical reasoning — not to generate it from scratch. See the patient, form your own impression, identify your uncertainty, then use a tool like iatroX to verify or refine. This sequence keeps your reasoning primary and the AI supplementary.

Rule 7: Report hallucinations when you find them

Most clinical AI tools have feedback mechanisms. When you identify a hallucination — whether in iatroX, UpToDate ExpertAI, AMBOSS, or any other tool — report it. This improves the tool for everyone and contributes to the safety evidence base.

The Right Response Is Not Fear — It Is Literacy

AI hallucination is real, documented, and clinically dangerous when unrecognised. But the right response is not to avoid AI. It is to understand which tools hallucinate less, why, and how to verify their output.

RAG-grounded tools like iatroX represent the architectural approach most aligned with clinical safety: curated sources, citation-first design, verifiable provenance. General-purpose LLMs represent the opposite: broad capability with no source grounding.

The evidence is clear: use purpose-built clinical AI for clinical questions, verify every output, and never let the fluency of a response substitute for the accuracy of its content.

Conclusion

AI hallucination in medicine is not a theoretical concern. It is a documented, measured, and clinically significant phenomenon. Published research shows that leading LLMs fabricate drug information, invent references, amplify planted errors, and produce reasoning failures that are difficult to distinguish from competent clinical thinking.

The mitigation is architectural (RAG over curated sources), procedural (verify every citation), and cultural (treat AI as a support tool, not an authority). iatroX is designed with all three in mind — grounded in verified UK guidelines, citation-first by default, and positioned as a tool that supports your judgement rather than replacing it.

The doctors who will use AI most safely are not the ones who avoid it. They are the ones who understand how it works, where it fails, and how to verify what it tells them. That understanding starts with knowing what hallucination is — and never forgetting that the most confident AI response you have ever seen might also be the most wrong.