AI tools in 2026 are marketed like consumer software, but they are used for clinical cognition. When a vendor says their tool is "95% accurate," what does that actually mean? 95% accurate on what? A multiple-choice exam? A curated dataset? Or a messy Friday afternoon clinic?
We need a mental model to sort the hype from the proof. Here is a practical "Evidence Ladder" for medical AI, illustrated by two current examples: the PRECISE RCT (OpenEvidence) and the Real-World Evaluation (iatroX).
Level 1: Bench + vignette performance (what it’s good for)
This is the baseline. Before a tool is released, it is tested on "Benchmarks"—static sets of medical questions (like the USMLE or MedQA).
- The Claim: "Passed the medical board exam."
- The Reality: This proves the AI has knowledge, but not competence. A medical student can pass an exam but still be unsafe on a ward.
- Vignette Studies: The next step up. Researchers feed the AI 50 written case summaries ("vignettes") and grade the answers.
- Limitations: These are "perfect world" cases. They lack the noise, irrelevance, and missing data of real practice.
Level 2: Comparative evaluation studies (where PRECISE fits)
This is where the PRECISE trial sits, and why it is significant. It moves beyond "did it pass?" to "is it better than the alternative?"
The PRECISE Trial (OpenEvidence vs GPT-4)
- Design: A randomised, blinded, parallel-assignment study (NCT07037940).
- The Test: Internal medicine residents and attendings solve de-identified clinical cases in a 90-minute session.
- The Intervention: Half use OpenEvidence (a Retrieval-Augmented Generation tool that reads medical literature); half use standard GPT-4.
- The Rigour: It isn't just self-reported; the answers are scored by blinded evaluators using a validated reasoning rubric.
Why this matters: It strips away the "novelty bias." By randomising clinicians, it tells us if the specific tool (OpenEvidence) actually improves reasoning compared to a generic LLM.
Level 3: Real-world adoption + usability (where iatroX fits)
An RCT proves a tool can work. Real-world evaluation proves it does work in the chaos of a clinic.
The iatroX Case Study (Tytler, 2025)
- Design: A formative evaluation of over 19,000 UK clinicians using the tool in live practice.
- The Signal:
- Adoption: 86.2% of users reported the tool was "useful" for their specific clinical workflow.
- Retention: 93% stated they would use it again.
- The Value: This tier of evidence measures Product-Market Fit. It tells you if the tool is compatible with 10-minute appointments, NHS firewalls, and actual clinical questions (which are often messier than trial cases).
Level 4: Patient outcomes + safety monitoring (the hardest tier)
This is the "Gold Standard" we eventually need, but rarely see.
- The Metric: Not "did the doctor get the right diagnosis?", but "did the patient get better faster?" or "were fewer unnecessary tests ordered?"
- The Challenge: Proving that a search tool caused a patient to recover faster is incredibly difficult due to the number of variables in healthcare.
- The Future: Expect regulators (like the MHRA and FDA) to demand "post-market surveillance" data that proxies for this—tracking safety events and "AI Yellow Cards."
A clinician’s “2-minute procurement filter”
When a new tool lands in your inbox, run it through this filter.
The AI Procurement Box
- What claim is being made? Is it "workflow" (write my notes), "diagnosis" (tell me what it is), or "treatment" (prescribe this)? The higher the risk, the higher the evidence tier needed.
- What tier of evidence supports it? Don't accept Level 1 (exams) for a Level 4 claim (diagnosis).
- Is the evidence in your population? A US Internal Medicine trial (PRECISE) is good, but does it apply to a UK GP managing multimorbidity?
- Are failures observable? If the AI makes a mistake, will you see it? (High observability = Lower risk).
- What’s the governance plan? If it fails, who is responsible?
How to use PRECISE results responsibly when they publish
When the PRECISE results land, you will see headlines like "AI X is better than Doctors." Ignore them. Read the subgroups.
- Superiority is specific: If OpenEvidence wins, does it win for trainees (who need knowledge) or experts (who need recall)?
- Reasoning vs Safety: Superior reasoning scores do not automatically equal "safe for autonomous use."
- The Takeaway: If PRECISE shows superiority, it is a strong reason to trial the tool in your practice, not to assume universal benefit.
FAQ
What counts as “good evidence” for clinical AI? "Good evidence" depends on the claim. For a search tool, a comparative study (Level 2) showing it retrieves correct citations is good. For a diagnostic tool, you need real-world safety data (Level 3/4) showing it doesn't miss red flags.
Why are RCTs still rare for AI tools? They are expensive, slow, and the software often changes faster than the trial can be completed. "Simulation" RCTs like PRECISE are a smart middle ground.
What is PRECISE testing specifically? It is testing clinical reasoning. It measures whether having access to a specialised RAG tool allows a doctor to formulate a better differential diagnosis and plan than using a generic chatbot.
