Can AI answer nuanced clinical problems — and when should doctors trust the output? (UK/NHS, 2025)

Executive summary

As artificial intelligence matures, the questions we ask of it are becoming more complex. It's one thing for an AI to answer a simple factual query, but can it handle the truly nuanced clinical problems that define modern medicine—cases involving multi-morbidity, conflicting data, or atypical presentations? The evidence is mixed and illuminating. A landmark 2024 randomised controlled trial (RCT) found that giving physicians access to a large language model (LLM) did not, on its own, improve their diagnostic reasoning. Interestingly, the LLM alone scored higher than either the junior or senior physician arms, a finding that underscores the significant gap between raw AI capability and effective human-AI teaming (JAMA Network, PMC).

For UK clinicians and NHS leaders, this highlights a critical reality: trust in clinical AI cannot be assumed. It must be earned through a combination of technical mitigations, like Retrieval-Augmented Generation (RAG), and robust governance. Human factors, such as the well-documented risk of "automation bias" and the sensitivity of AI to the quality of clinical prompts, mean that trust must be conditional. The UK's guardrails—including the NICE Evidence Standards Framework (ESF), the MHRA's AI Airlock, and guidance from NHS England—provide a clear pathway for building this conditional trust and ensuring AI is adopted safely and effectively.

The question behind the question: what do we mean by “nuanced”?

In a clinical context, a "nuanced" problem is one that cannot be solved by a simple algorithm. It typically involves:

Multi-morbidity: Managing a patient with several interacting long-term conditions.
Atypical presentations: When a common disease presents in an uncommon way.
Conflicting data: When lab results, imaging, and the clinical picture do not align perfectly.
High context sensitivity: Decisions that must be heavily weighted by factors like patient frailty, polypharmacy, or social context.

These are precisely the situations where human cognitive biases, like anchoring on an initial diagnosis, can lead to error. While AI can help to counter these biases, the risk of automation bias—uncritically accepting an AI's output—can amplify the danger if the AI itself is wrong (qualitysafety.bmj.com).

What the literature actually says

A key randomised trial (2024): This study, published in JAMA Network Open, is essential reading. It found that while an LLM alone was highly accurate, giving that same LLM to physicians did not improve their diagnostic scores. This suggests that the primary challenge is not the AI's raw knowledge, but creating a workflow and user interface that allows for effective human-AI collaboration.
Prompt sensitivity & over-trust: Research from MIT has shown that even minor, non-clinical "perturbations" in a prompt—such as typos or informal phrasing—can degrade the quality of an AI's advice and even alter its care-seeking recommendations. This highlights how easily and unintentionally a clinician can steer an AI towards a less optimal output (MIT Media Lab).
Variability in accuracy: Studies across different domains, such as pharmacy and clinical Q&A, continue to show that the accuracy of LLMs is highly variable. Reproducibility remains a concern, reinforcing the need for caution (ACCP Journals).

When to trust: a five-test framework for clinicians (“The 5 Ps”)

Provenance: Is the AI's answer grounded in cited, authoritative sources? RAG-based systems that provide inline citations to trusted documents score highest here.
Precision: Have you provided a precise, structured prompt with the necessary clinical data? Remember, poor quality prompts lead to poorer quality outputs.
Plausibility: Does the AI's reasoning make clinical sense? Does it acknowledge uncertainty or conflicting evidence? Beware of "fluent nonsense" and guard against your own automation bias.
Performance Context: Has this specific tool been validated for this specific task in this specific setting? The JAMA trial warns us against assuming a powerful generic model will automatically perform well in any given clinical workflow.
Policy & Assurance: Is the tool aligned with UK governance? This means mapping to the NICE ESF, participating where needed in the MHRA AI Airlock, and being referenced in NHS England’s guidance or the AI Knowledge Repository.

How trustworthy systems are built

Curated corpus & hybrid retrieval: The most reliable systems restrict their knowledge to a vetted library of guidelines and use a combination of lexical and semantic search to find the most relevant information.
RAG with abstention: They use a RAG architecture that forces the model to answer only from the retrieved passages and, crucially, to "abstain" or refuse to answer if its confidence is low.
Transparent uncertainty: They show their work, providing clickable citations and timestamps to allow for human verification.
Operational safety: They have a documented plan for post-market monitoring and are designed with human-in-the-loop workflows, in line with WHO guidance for LMMs and UK policy.
Risk management: They apply established risk frameworks, like the NIST Generative AI RMF profile, alongside NHS-specific processes.

Red flags: when not to trust the output

High-stakes, low-data cases: For a crashing patient or a case with incomplete data, defer to established protocols and senior human review.
No citations or outdated links: An answer without a source is an opinion, not evidence.
A "one-true-answer" tone in a complex multi-morbidity context.
Conflicts with local policy or patient preferences.

Practical playbooks (UK/NHS)

At the elbow (clinicians)

Use a structured prompt (Context → Data → Task → Output format with citations).
Verify every critical claim against the primary sources (e.g., NICE guidelines).
Treat the AI as a "second reader" or a brainstorming partner. If you disagree with its output, document your reasoning. This helps to counter automation bias.

At the programme level (trusts/ICSs)

Start with low-risk, high-value tasks (like ambient scribing) under the official NHS England guidance.
For higher-risk clinical reasoning tools, use the MHRA AI Airlock or formal research protocols to pilot them, with pre-registered outcomes.
Publish your learnings—both positive and negative—to the NHS AI Knowledge Repository to build shared confidence and system-wide learning.

Buyer’s checklist

Requirement	Why it matters	What “good” looks like	Evidence to request
Citations & Provenance	Prevents hallucinations	Inline links to NICE/CKS/BNF	Validation pack & sample audit trail
Task-specific Validation	Ensures external validity	Results on local-style vignettes	Study reports; RCT/observational data
Abstention & Uncertainty	Safety over fluency	Clear "no answer" pathways	Design documents; system logs
UK Assurance	Governance and safety	NICE ESF mapping; DTAC; AI Airlock status	Certificates & registration documents

FAQs

Can AI solve complex diagnostic puzzles today?
- Sometimes, but effective human-AI teaming is the real challenge. The latest RCT evidence shows that simply giving a clinician access to an LLM does not automatically improve their performance. Careful workflow integration is key.
What makes an AI's answer trustworthy?
- A combination of factors: clear citations to authoritative UK sources, transparent reasoning, flags for uncertainty, and, most importantly, positive results from local validation studies. RAG architectures are a key technical enabler for this.
What governance applies in the UK for these tools?
- The key frameworks are the NICE ESF for evidence standards, the MHRA AI Airlock for novel medical devices, and NHS England’s guidance and the AI Knowledge Repository for implementation, all underpinned by WHO ethics guidance.