For the last two years, the medical AI market has operated on "vibes"—software launches, press releases, and demo videos. But medicine is not software. In medicine, if you make a claim, you show the data.
That is why the PRECISE trial (Physician Response Evaluation with Contextual Insights vs. Standard Engines) is significant. It represents a shift from "feature lists" to "comparative evidence."
It is explicitly designed as a head-to-head evaluation of a retrieval-augmented clinical tool (OpenEvidence) against a general purpose LLM (GPT-4) in physician reasoning.
This post is not a verdict—the results are still being processed—but a guide on how to interpret the design so you can read the paper like a clinician when it lands.
What is PRECISE (in plain English)?
The one-paragraph summary
PRECISE is a randomised, parallel-assignment study registered on ClinicalTrials.gov (NCT07037940). In this trial, internal medicine residents and attendings are assigned to solve de-identified clinical cases in a single ~90-minute online session. Half use OpenEvidence, and half use standard ChatGPT (GPT-4). Their responses are then scored by blinded evaluators using a validated rubric to determine the quality of clinical reasoning.
Who is being studied?
The participants are internal medicine trainees and attendings.
- Why this matters: Decision support tools often provide the most value where uncertainty is highest—either due to lack of experience (trainees) or case complexity (attendings). Testing across this spectrum is crucial to see if the tool acts as a "leveller" or a crutch.
What’s being compared?
- OpenEvidence: A Retrieval-Augmented Generation (RAG) system. This means the AI searches a medical library first, then answers using only that retrieved context.
- GPT-4 (ChatGPT): A General LLM. This relies on its internal training data ("parametric memory") without necessarily looking up live medical sources.
(Note: "PRECISE" is a common acronym in medicine. Do not confuse this with the PRECISE-DAPT cardiology trial or various oncology studies. This is specifically the AI Clinical Decision Support trial).
What outcomes actually matter (and what doesn’t)
Primary outcome: quality of clinical reasoning
The trial does not just measure "did they get the diagnosis right?" It measures reasoning. Blinded raters use a rubric to score the process:
- Did they identify the correct differential?
- Did they justify the management plan?
- Did they identify key discriminators?
Secondary outcomes: speed + confidence
The trial also measures Time to Completion and Physician Confidence.
- Watch out: "Speed" is a dangerous metric in AI. A tool that produces a wrong answer instantly is "fast," but clinically disastrous. Look for speed maintenance with quality improvement.
What the trial cannot prove (by design)
This is a "simulation" trial, not a "field" trial.
- It does not measure patient outcomes (morbidity/mortality).
- It does not measure long-term adoption (do doctors keep using it after day 1?).
- It does not measure safety events (actual harm).
The Takeaway: PRECISE tests decision-quality under controlled conditions—an important rung on the evidence ladder, but not the whole ladder.
Why an RCT is a big deal for medical AI
In 2026, the critique of medical AI is that we have lowered the bar. We demand RCTs for a new statin, but accept a "blog post" for a new diagnostic algorithm.
PRECISE is an attempt to normalise RCT-grade evaluation for software. Even though it tests process (reasoning) rather than outcome (health), the rigour of blinding and randomisation separates it from the usual "we asked 5 doctors and they liked it" marketing.
The clinician’s “RCT reading checklist” for AI tools
When the results are published, do not just read the abstract. Use this checklist to stress-test the findings.
AI Trial Checklist
- [ ] Case Mix: Were the cases "routine" (e.g., simple pneumonia) where GPT-4 already excels, or "rare/complex" where RAG tools should win?
- [ ] Gold Standard: Who scored the answers? Were the raters experts in that specific sub-specialty?
- [ ] Blinding: Were the evaluators truly blind to which tool was used? (RAG tools often look different because they include citations—was this masked?).
- [ ] Prompting: Were participants trained on how to prompt? (Bad prompting can make a good tool look stupid).
- [ ] Verification Burden: Did using the tool increase the time spent checking citations? (A hidden cost of RAG).
- [ ] Net Effect: Did the tool reduce uncertainty, or just produce "plausible text" that the doctor still had to double-check elsewhere?
Where iatroX fits (quietly, as “UK evidence of perceived value”)
While PRECISE tests reasoning in a controlled US environment, we are building a complementary evidence base for the UK.
A parallel UK lens: formative evaluation + adoption signals
A recent preprint analysis of iatroX (Tytler, 2025) focused on real-world utility rather than simulated performance.
- The Signal: In a study of over 19,000 UK users, 86.2% reported the tool was "useful" for their clinical workflow, and 93% would use it again.
- The Difference: PRECISE tests "Can it reason better than GPT-4?" Our evaluation tests "Does it solve a problem for a busy NHS clinician?"
Positioning: PRECISE tests comparative reasoning performance; iatroX’s early evaluation tests adoption, usability, and value perception in live workflows. Both are needed to build trust.
Practical takeaways
If you are a clinician: Don't wait for the p-value. Adopt a "Verify-and-Decide" workflow now. Treat AI output—whether from OpenEvidence, iatroX, or GPT-4—as a "consultant's suggestion," not a command. Always verify the citation.
If you are a buyer/lead: Define what "success" means for your trust. Is it speed? Is it safety? Or is it reducing the variation in decision-making between your junior and senior staff?
FAQ
What is the PRECISE trial testing? It is testing whether physicians using OpenEvidence (a retrieval-augmented AI) demonstrate better clinical reasoning and management decisions compared to physicians using standard GPT-4 in a randomised, blinded study.
Does PRECISE prove OpenEvidence is safe? No trial proves a tool is "safe" in all contexts. PRECISE supports inference on the quality of reasoning in simulated cases. Real-world safety requires post-market surveillance and clinician judgement.
What’s the difference between RAG tools and general LLMs? RAG (Retrieval-Augmented Generation) tools like OpenEvidence and iatroX look up reliable sources (textbooks, journals) before answering. General LLMs like standard ChatGPT rely on their internal training data, which makes them more prone to hallucination and outdated information.
