Executive overview
In 2025, artificial intelligence in healthcare is decisively moving from promising pilots to governed, real-world deployments across the NHS and globally. The conversation is no longer about if AI will be used, but how it will be used safely and effectively. New guidance from bodies like the World Health Organization and NHS England now sets clear expectations for the responsible use of technologies from ambient scribes to diagnostic aids (World Health Organization, NHS England).
This article provides a definitive map of the current landscape. We will explore what AI is good for in clinical practice today, where the critical risks of hallucinations, bias, and privacy lie, and the key regulatory pathways that clinicians and managers in the UK must navigate. This includes the NHS DTAC, the NICE Evidence Standards Framework, the MHRA AI Airlock, the landmark EU AI Act, and the US FDA's guidance on clinical decision support software (NHS Transformation Directorate, NICE, GOV.UK, European Parliament, U.S. Food and Drug Administration).
Definitions & taxonomy
To have a meaningful discussion, it's important to be precise with our terms.
- Artificial intelligence (AI) is the broad field of creating intelligent machines. Machine learning (ML) is a subset of AI where systems learn from data, and deep learning is a further subset that uses complex neural networks.
- Large language models (LLMs) are deep learning models trained on vast amounts of text. The latest large multimodal models (LMMs) are a significant evolution, as they can process and integrate information from text, images, and audio, which has profound implications for healthcare (World Health Organization).
- A key regulatory distinction exists between non-device CDS, which typically provides information for a clinician to interpret, and device CDS, which may analyse patient-specific data to provide a risk score or diagnostic suggestion, and therefore falls under medical device regulations (U.S. Food and drug Administration).
What “good” looks like: the evidence baseline
The concept of clinical decision support is not new. Decades of literature on traditional CDS show consistent improvements in the process of care, such as better adherence to guidelines and safer prescribing (PMC, American College of Physicians Journals). However, the evidence also shows that adoption and real-world impact hinge on two critical factors: how well the tool fits into the existing clinical workflow, and the non-negotiable requirement for human oversight (psnet.ahrq.gov, Nature).
High-impact use cases (today)
- Imaging (breast screening): The NHS has launched a world-first trial that will use AI to analyse approximately 700,000 mammograms, with the aim of improving accuracy and workforce efficiency. This builds on UK studies that have already shown the viability of double-reading strategies with AI (GOV.UK, PMC).
- Dermatology triage: The Skin Analytics’ DERM tool, which uses AI to assess skin lesions, has been conditionally recommended by NICE for use in the NHS while further evidence is gathered, with several real-world deployments already underway (NICE, PMC).
- Ambient scribing (documentation): Following a surge in adoption, NHS England has now issued specific guidance for the safe deployment of ambient scribe tools, covering the need for a clinical safety case, a Data Protection Impact Assessment (DPIA), and clear clinician oversight (NHS England).
- Evidence Q&A / knowledge retrieval: Retrieval-augmented generation (RAG) has emerged as the safest architecture for clinical Q&A, as it improves the grounding of AI answers in cited, verifiable sources, though it still requires final clinical review (Nature).
Risks & how to mitigate them
- Hallucinations and omissions: Well-documented in 2024–25 evaluations, the risk of an AI inventing facts or omitting key information is significant. Mitigation requires a "human-in-the-loop" workflow, the mandatory display of citations, and "abstain-when-unsure" behaviours built into the AI (Nature, JMIR).
- Bias, privacy, and misuse: The WHO's guidance on large multimodal models outlines the essential principles of governance, transparency, data protection, and accountability needed to mitigate the risk of AI tools exacerbating health inequalities (World Health Organization).
- Operational risks: To avoid "alert fatigue" and workflow friction, CDS prompts should be pushed at key decision points (e.g., during order entry) rather than as generic banners. Clinician overrides of AI suggestions should be audited to improve the system.
Governance & regulation (UK/EU/US—practical)
Navigating the regulatory landscape is essential for safe procurement and deployment.
Framework | Jurisdiction | Purpose |
---|---|---|
NHS DTAC | UK (NHS Buyers) | National baseline for procurement (cyber, clinical safety, interoperability, usability). |
NICE ESF | UK | Sets evidence standards for clinical and economic value of digital health tech. |
MHRA AI Airlock | UK (MHRA) | A regulatory sandbox for AI as a Medical Device to allow supervised real-world testing. |
EU AI Act | EU | Landmark legislation classifying AI by risk. Health AI often falls into "high-risk" categories with significant obligations. |
FDA CDS Guidance | US | Explains the criteria for what CDS is regulated as a medical device versus a non-device support tool. |
Architecture & integration patterns that work
- RAG “citation-first” stack: The safest architecture for informational AI is a RAG-based system that uses a curated library of sources, retrieves information at query time, and always shows links and source dates.
- Guardrails: Safe systems must include uncertainty handling, refusal modes for out-of-scope questions, safety filters, and a mandatory human sign-off for key clinical actions like orders or letters.
- Data & security: All systems must have robust encryption, access controls, and clear audit trails to meet the security and interoperability expectations of the NHS DTAC.
Procurement & due diligence checklist
- Is the vendor's DTAC pack passed and current?
- Is their evidence aligned with a clear NICE ESF tier?
- Are data flows and hosting arrangements clearly documented?
- Does the product show citations and "last-updated" dates by default?
- Has the product been tested in a sandbox like the AI Airlock or an equivalent?
- Is a comprehensive DPIA complete and available?
- Is there a clear training and change-management plan?
Evaluation plan & KPIs
- Safety/quality: Medication error intercepts; documentation error rates; guideline concordance.
- Performance: Time-to-answer; time-to-note; suggestion acceptance/override rates; citation-click-through rate.
- Equity/robustness: Performance stratified by population demographics; monitoring for "model drift" after guideline updates.
- Economic: Time saved per clinical encounter; cost per successful intervention.
Case spotlights
- Breast screening: The UK's national trial approach aims to prove both workforce efficiency and non-inferiority for early cancer detection.
- Dermatology triage: NICE's conditional recommendation for the DERM tool exemplifies an "evidence-generation-for-adoption" pathway for innovative AI.
- Ambient scribing: An NHS-compliant rollout requires a full clinical safety case, mandatory human verification of every note, and a clear audit trail.
Risks & how to mitigate them
- Hallucinations / wrong answers: Mitigate by preferring RAG systems with mandatory citations and requiring human verification before any clinical action.
- Alert fatigue: Mitigate by tuning thresholds and triggering advice at key decision points, like order signing, via CDS Hooks.
- Bias & generalisability: Mitigate by following the DECIDE-AI reporting guidance when piloting and continuously monitoring for equity impacts.
- Over-automation: Mitigate by maintaining human sign-off and clear audit trails as per NHS guidance.
FAQs
- What is the difference between large language models and large multimodal models?
- Large language models (LLMs) are trained primarily on text data. Large multimodal models (LMMs) are a newer evolution that can understand and process information from multiple types of input, such as text, images, and audio, all within a single model.
- Is artificial intelligence in healthcare “regulated” in the UK?
- Yes, through a multi-layered system. The DTAC provides a baseline for procurement into the NHS. The NICE ESF sets evidence standards. And the MHRA regulates any AI that functions as a medical device (AIaMD).
- Are ambient scribe tools allowed in the NHS?
- Yes, but only when deployed in line with specific NHS England guidance, which requires robust safeguards, a full clinical safety case, and diligent clinician oversight.
- Does retrieval-augmented generation (RAG) remove hallucinations?
- It significantly reduces them by grounding the AI in a set of facts, but it does not eliminate them entirely. Human review of outputs remains essential.
Closing
Generative artificial intelligence in healthcare is already demonstrating its value when it is citation-grounded, seamlessly integrated into clinical workflows, and governed by robust safety protocols. The most effective path to adoption is to start small with a single, well-defined use case, run a formal benefits-realisation plan, and scale only when both the clinical value and the safety standards, aligned with NHS guidance, have been clearly demonstrated.