DxGPT review 2025: is this AI the future of rare disease diagnosis in the NHS?

Executive summary

The “diagnostic odyssey” for patients with rare diseases remains a profound challenge for modern healthcare. In the UK and Europe, the journey from first symptoms to a confirmed diagnosis still averages a painful five to six years (EURORDIS-Rare Diseases Europe, Genomics Education Programme). This delay leads to significant preventable harm, patient distress, and inefficient use of NHS resources.

DxGPT, a GPT-4–based clinical assistant, has emerged as a potential tool to shorten this journey. It is designed to generate a ranked, reasoned top-five differential diagnosis to help clinicians counter common cognitive biases in complex, often paediatric, cases. Early evaluations report clinician-comparable accuracy, and a pilot in Spain’s public health system offers valuable lessons for any potential UK adoption. However, significant limitations around prompt sensitivity and the need for robust clinical guardrails remain. For the NHS, the path to adopting such a tool would require a rigorous journey through the established UK regulatory pathways, including the MHRA, NICE Early Value Assessment (EVA), and the DTAC procurement framework.

The rare disease “diagnostic odyssey”: why speed matters

There are over 7,000 known rare diseases, and while each is rare, collectively they affect 1 in 17 people in the UK. The long delay in diagnosis is not just a statistic; it represents years of uncertainty for families, multiple referrals, and often, a cascade of low-yield investigations. The conditions themselves are frequently low-prevalence, multi-system, and present with a sparse pattern of symptoms, making them prime candidates for a structured, AI-assisted approach that can help clinicians to "widen the net" and consider possibilities beyond their immediate experience.

What DxGPT is (and isn’t)

Definition & scope: DxGPT is a web application built on OpenAI's GPT-4, designed specifically to assist with differential diagnosis generation. It is currently positioned as being free to use for both clinicians and patients (dxgpt.app).
Design intent: The tool's core function is to take a clinical vignette and return five ranked and reasoned potential diagnoses. This is intended to mitigate common cognitive errors like anchoring bias (fixating on an early idea) and premature closure (ending the diagnostic process too soon).
Current availability: DxGPT was notably piloted by the Madrid Health Service (SERMAS) in its primary care centres starting in 2023, in a project developed with Fundación 29 and Microsoft.

Evidence snapshot: how well does DxGPT perform?

The most significant evaluation to date is a medRxiv pre-print study by Alvarez-Estapé et al. (2024). In a head-to-head comparison on complex paediatric and rare disease cases, the study reported that DxGPT's top-5 accuracy was similar to that of hospital clinicians. Performance improved significantly when the AI was given richer, more detailed clinical notes compared to simple case summaries (MedRxiv, ResearchGate).

This finding aligns with a broader 2024 meta-analysis in Nature, which found no significant overall difference between the diagnostic accuracy of physicians and AI, underscoring that the performance of any clinical AI diagnosis tool is highly dependent on the specific clinical case mix and the quality of the input data.

Strengths in the clinical workflow

A cognitive forcing function: By providing a structured, ranked differential, the tool can act as a prompt to consider "don’t-miss" conditions and key clinical discriminators that may not be immediately apparent.
Support in paediatrics & multi-system cases: The existing evidence base for DxGPT is centred on complex paediatric cases, where a broad initial differential is often more valuable than deep expertise in a single system.
Learning for trainees: For fellows and registrars, the tool can be used in an educational setting to rehearse clinical reasoning against difficult vignettes, helping to tighten their diagnostic process and hand-offs to specialist genetics or metabolic clinics.

Known limitations and open questions

Prompt sensitivity: The performance of DxGPT can degrade significantly if the input prompts are sparse or written in lay language. Richer, structured prompts yield better results, a finding consistent across the AI literature (MedRxiv, arXiv).
Safety behaviours: Some broader studies of LLMs have shown inconsistent or even unsafe "care-seeking" recommendations. This reinforces the principle that DxGPT must be treated as an assistive tool, not an autonomous one.
External scrutiny: The early SERMAS pilot in Spain faced media questions regarding its clinical validation and governance. These are valuable cautionary lessons for any potential NHS adoption, which would need a clear and transparent evaluation plan from the outset (Newtral).

Case study: Spain’s SERMAS pilot

The SERMAS pilot involved making the conversational assistant for rare diseases accessible to primary care clinicians. The project, built on Azure OpenAI with the non-profit Fundación 29, is a bold step in public-sector AI adoption. The key lesson for the NHS is that implementing such a tool at system scale requires clear validation protocols, deep clinician engagement, and proactive public communication to build trust and manage expectations.

Strategic assessment for the NHS: from sandbox to service

For a tool like DxGPT to be used in the UK, it would need to navigate a clear and rigorous regulatory and evidence pathway.

Regulatory route: As a tool that informs a diagnosis, it would be classified as a Software/AI as a Medical Device (SaMD/AIaMD) and fall under the MHRA framework. The MHRA AI Airlock provides a dedicated sandbox for testing novel algorithms in a supervised, real-world setting.
Evidence & adoption: The NICE Early Value Assessment (EVA) pathway is designed for promising technologies like this. It would allow for time-limited, conditional NHS use while prospective evidence is generated against key endpoints (e.g., time-to-diagnosis, referrals avoided, cost per correct diagnosis).
Procurement baseline: Before any NHS organisation could use the tool, it would need to meet the standards of the DTAC. The deploying organisation would also need to complete its own clinical safety case under the DCB0129/0160 standards.

Benefits & savings: where value could credibly accrue

Faster triage to Centres of Expertise: Shortening the "diagnostic odyssey" would lead to fewer repeat tests and a reduction in low-yield outpatient cycles.
Reduced low-yield investigations: By helping to narrow the diagnostic hypotheses earlier, the tool could reduce the number of tests performed per diagnosis.
Clinician time: There is potential to save clinician time in case synthesis and documentation, though this would need to be quantified in UK-specific pilots.

Pilot design for an ICS/region (90-day plan)

Settings: A pilot could be run in paediatric ambulatory genetics clinics or with complex GP referrals to specialist metabolic clinics.
Design: A stepped-wedge or A/B trial comparing DxGPT-assisted consultations versus usual care.
KPIs: Top-5 hit rate at the first specialist visit, time-to-diagnosis, number of unnecessary tests avoided, clinician trust/usability scores, and all safety-related events.

Governance & Safeguards

Human-in-the-loop: The clinician must remain in the loop with documented accountability. Unsupervised patient use for clinical decisions must be prohibited.

Guardrails: The tool must have mandatory source citation, clear uncertainty flags, and explicit "seek specialist advice" nudges.

Change control: The AI model and its version must be tracked to comply with MHRA expectations for significant software changes.

FAQs

Is DxGPT “as good as a doctor”?
- No. Early studies show it has comparable top-5 accuracy in specific paediatric settings, but it remains an assistive tool that requires expert clinician oversight and interpretation.
Has it been used in a public health system?
- Yes, the SERMAS health service in Madrid, Spain, piloted access for its primary care clinicians. Any UK adoption would need to proceed through the official MHRA, NICE, and DTAC pathways.
What about safety?
- All LLMs carry a risk of missing or mis-prioritising advice. Any UK pilot would need to meticulously measure safety events and mandate clear "seek urgent care" prompts where appropriate.

Conclusion & call to action

DxGPT should be viewed as a fascinating and promising case study for how specialised AI can be used to counter cognitive biases in the specific, challenging domain of rare diseases. It is not, and should not be seen as, a replacement for clinical reasoning.

The next step for NHS innovation leads is clear: a tool with this potential warrants a formal UK pilot. This should be commissioned under the strict guardrails of the MHRA AI Airlock and the NICE EVA pathway, with a transparent evaluation protocol published to the NHS AI Knowledge Repository to ensure that any learnings—positive or negative—are shared for the benefit of the entire system.