Why your general LLM will lie to you about hypertension guidelines (and how grounded AI doesn't)

Introduction: the "stochastic parrot" problem

Let's be honest: tools like ChatGPT are incredible. For drafting a discharge summary, polishing a referral letter, or writing a difficult email, they are a superpower. But when you ask them for a clinical decision—like a drug dose or a guideline threshold—they become dangerous.

This isn't a bug; it's a feature of how they are built. General Large Language Models (LLMs) are "probabilistic." They predict the next most likely word in a sentence based on training data from the entire internet. They don't know facts; they guess patterns. In computer science, this is sometimes called being a "stochastic parrot"—repeating convincing-sounding sentences without understanding their meaning.

In medicine, we don't need a probable answer. We need a deterministic one. We need to know exactly what the guideline says, not what the internet thinks it probably says.

The case study: hypertension (NICE NG136)

Let's put this to the test with a common clinical scenario.

The Prompt: "What is the blood pressure target for an 85-year-old with type 2 diabetes?"

The General LLM's Response (The Hallucination): A generic model might confidently tell you the target is <140/90 mmHg. It might even cite a source like "JNC 8" or an "ADA guideline."

Why it's wrong: It has conflated US guidelines (JNC 8) with UK practice, or it has missed the age-specific nuance.
The truth (NICE NG136): For adults over 80, the target is usually <150/90 mmHg (clinic) or <145/85 mmHg (HBPM/ABPM).
The consequence: Aiming for <140/90 in an 85-year-old increases the risk of falls, AKI, and postural hypotension. A "95% correct" answer here isn't just wrong; it's potentially harmful.

The solution: what is "grounded" AI?

To fix this, we use a technology called Retrieval-Augmented Generation (RAG). This architecture fundamentally changes how the AI behaves.

The Analogy: ChatGPT is like a medical student guessing the answer based on every book they've ever read, some of which are 10 years old or from the wrong country. iatroX is like a consultant walking over to the bookshelf, opening the specific page of the current NICE guideline, and reading the answer to you.

This is "grounded" AI. It doesn't generate an answer from its own memory; it retrieves the answer from a trusted document.

Citation is king

The litmus test for any clinical AI is simple: provenance.

If an AI gives you a number but cannot show you the link to the official document where it found it, do not trust it.
iatroX is built on a citation-first principle. Every clinical answer comes with a direct link to the NICE guideline, CKS summary, or BNF monograph. This allows you to verify the information instantly, moving from "blind trust" to "trust and verify."

Clinical takeaway

AI is a tool, and like any tool, you must use the right one for the job.

Use General LLMs (ChatGPT, Claude): For creative, drafting, and administrative tasks where "voice" matters more than "fact."
Use Grounded AI (iatroX): For decision support, dosing, and guidelines where "fact" is the only thing that matters.