Every week, a new AI tool launches claiming to "revolutionise healthcare." Today, it is ChatGPT Health. Tomorrow, it will be something else.
As a clinician, you don't need to be a data scientist to evaluate these tools, but you do need a functional "BS-detector." When a vendor says their AI "passed the USMLE" or "was built with 200 doctors," what does that actually mean for the patient sitting in front of you?
This guide moves beyond the marketing hype to teach you how to evaluate medical AI claims like a pro.
Why ‘we worked with doctors’ isn’t a safety guarantee
OpenAI’s launch of ChatGPT Health makes a bold claim: the system was developed in collaboration with over 260 physicians across 60 countries.
While impressive, "collaboration" is not the same as "clinical validation."
- Collaboration often means physicians helped design the rubrics (the marking criteria) or provided feedback on tone.
- Validation means the system was tested in a real clinical environment and shown not to harm people.
The Lesson: A car designed by Formula 1 drivers can still crash if you put it on an icy road with no brakes. "Doctor-involvement" is a design input, not a safety output.
Five questions to ask any health AI vendor
Stop looking at the "accuracy" percentage. Ask these five functional questions instead.
- What’s the intended use? (Is it for "wellness coaching" or "diagnostic support"? If they blur this line, run.)
- What data does it see? (Can it actually see the patient's renal function, or is it guessing based on the text prompt?)
- How does it cite / trace claims? (Does it hallucinate a link, or retrieve a real document?)
- What happens in edge cases? (If a patient mentions "chest pain" in a "wellness" chat, does it escalate or ignore?)
- How is it monitored post-launch? (Is there a "Yellow Card" system for AI errors?)
Benchmarks: what they measure vs what they miss
You will hear a lot about HealthBench, OpenAI’s new evaluation framework. It uses 5,000+ multi-turn conversations graded by physician-written rubrics.
- What it measures: "Chat Etiquette." Did the AI ask the right follow-up questions? Did it avoid rude language? Did it refuse to answer dangerous queries?
- What it misses: Longitudinal Context. Real medicine isn't a single chat; it's a 10-year history. A model can score 100% on a benchmark question about "headaches" but fail to recognise that this patient has a history of subarachnoid haemorrhage.
The Golden Rule: "Test performance" (passing an exam) ≠ "Clinical safety" (managing risk).
Safety statements: how to read them
Every health AI comes with a disclaimer: "Not intended for diagnosis or treatment."
Do not treat this as a legal footer; treat it as a Scope Boundary.
- It means the tool has zero failure-handling for diagnostic error.
- If it misses a cancer diagnosis, the system is not "broken" according to its terms; it is functioning as intended because it wasn't supposed to be diagnosing.
The real-world pattern in primary care
The "lab-tested" safety rarely survives contact with the real world.
- The Lab: A distinct, clear query ("I have a headache").
- The Reality: A drunk patient at 2 AM typing typos into an app because they can't get an appointment.
We know that patients use these tools mostly "out-of-hours" when anxiety is high and access is low. This is exactly when "hallucination" risk transforms into "harm" risk.
Your practical scoring rubric
When you assess a tool for your practice—or advise a patient—score it on this 5-point scale.
| Dimension | Score (0–5) | What 5/5 looks like |
|---|---|---|
| Clarity | Explicitly states "I am an AI, not a doctor" in every turn. | |
| Traceability | Every medical claim has a clickable link to a recognized guideline (NICE/CKS). | |
| UK Relevance | Uses UK units (mmol/L), UK drug names (Paracetamol), and UK pathways (111). | |
| Failure Modes | Transparently admits "I don't know" rather than guessing. | |
| Data Handling | Zero training on patient data; UK-hosted servers (or clear BAA/DPA). |
Where iatroX fits
Most consumer AIs fail the Traceability and UK Relevance tests. They are built on the "whole internet" (mostly US data) and prioritise conversational flow over factual rigidity.
iatroX is built specifically for the Traceability dimension.
- We don't want to chat; we want to point.
- Our engine retrieves the specific UK guideline first, then answers.
- If we can't find a source, we don't generate an answer.
Use the "rubric" above to compare iatroX vs General AI. You will find that while generic tools win on "chat," we win on "source."
Summary A health AI is only as safe as its failure modes are predictable. Benchmarks like HealthBench show capability in a vacuum, but they rarely prove safety in messy, real-world primary care. Use the 5-point rubric (Traceability, UK Relevance, Data Handling) to judge whether a tool is a toy or a clinical asset.
