A study published in Nature Medicine on 23 February 2026 has delivered the first independent safety evaluation of ChatGPT Health since its January launch — and the findings raise specific, measurable concerns about consumer-facing health AI at clinical extremes.
The Study
Ramaswamy et al. at the Icahn School of Medicine at Mount Sinai designed a structured stress test of ChatGPT Health's triage recommendations. The methodology was rigorous and systematic: 60 clinician-authored vignettes across 21 clinical domains, each tested under 16 factorial conditions varying race, gender, social dynamics (whether family members minimised or validated symptoms), insurance status, and transportation barriers. This yielded 960 total interactions with ChatGPT Health.
Triage accuracy was compared against physician consensus established by three independent physicians per scenario, using guidelines from 56 medical societies to define the gold-standard urgency level for each case. This was not a casual evaluation — it was a factorial stress test designed to probe performance across the full acuity spectrum and identify systematic failure patterns.
Key Findings
The inverted U-shaped performance curve. ChatGPT Health performed strongest on mid-acuity presentations — the bread-and-butter clinical scenarios where the correct triage recommendation is relatively unambiguous. A patient with moderately concerning symptoms who should see a doctor within a few days? ChatGPT Health handles this well. Performance was weakest at both extremes of the acuity spectrum — precisely where correct triage is most consequential.
Emergency undertriage: 52%. Among cases that the physician consensus panel agreed required emergency care, ChatGPT Health under-triaged 52%. Patients presenting with diabetic ketoacidosis and impending respiratory failure — conditions that can kill within hours without emergency intervention — were directed to "24-48 hour evaluation" rather than the emergency department. Lead author Dr Ashwin Ramaswamy stated: "This is something that can kill someone in a couple of hours."
Classical emergencies with unmistakable clinical signatures were correctly triaged 100% of the time — stroke with acute focal neurological deficit, anaphylaxis with airway compromise. The failures concentrated on emergencies with ambiguous, evolving, or atypical presentations. DKA can present with vague symptoms (nausea, abdominal pain, malaise) before progressing to cardiovascular collapse. Impending respiratory failure can present as "feeling a bit breathless" before rapid decompensation. These are precisely the scenarios where a clinician's pattern recognition and index of suspicion separate a missed diagnosis from a life-saving intervention — and where ChatGPT Health systematically failed.
Non-urgent overtriage: 35%. On the other end of the spectrum, non-urgent presentations were over-triaged in 35% of cases — directing patients to more urgent care than needed. This increases healthcare utilisation and cost but is substantially less dangerous than undertriage. Sending someone unnecessarily to the ED wastes resources; sending someone with DKA home can be fatal.
Anchoring bias: OR 11.7. When family members or friends minimised symptoms in the clinical scenario — "I'm sure it's nothing, she just needs to rest" — triage recommendations shifted significantly toward less urgent care (odds ratio 11.7, 95% CI 3.7-36.6). The AI was influenced by social context that should not affect clinical triage. A patient's mother reassuring that "it's probably just a stomach bug" should not change the clinical assessment of a patient presenting with features consistent with DKA. But it did — substantially and consistently.
This finding is particularly concerning because it mirrors a known cognitive bias in human clinicians (anchoring), but amplifies it. Human clinicians are trained to recognise and resist anchoring. ChatGPT Health appears to be susceptible to it at a level that would be considered clinically dangerous in a human practitioner.
Inconsistent suicide-crisis safeguards. The system sometimes missed high-risk suicidal ideation scenarios, failing to activate crisis safeguards consistently. For a consumer-facing tool reaching 40 million daily users — many of whom may be in acute psychological distress — inconsistent crisis safeguard activation is a critical safety concern. The ECRI (an independent patient safety organisation) ranked misuse of AI chatbots in healthcare as the top health technology hazard in 2026.
OpenAI's Response
An OpenAI spokesperson said the study does not reflect how ChatGPT Health is "typically used." The tool is designed for multi-turn conversations where patients provide additional context through follow-up questions, not single-prompt triage. The company noted the tool is still in limited rollout and they are working to improve safety and reliability.
The rebuttal has some merit. Multi-turn conversation does provide clinical context that can improve assessment accuracy — a patient who initially reports "feeling unwell" may provide crucial details about polyuria, polydipsia, and recent weight loss when prompted. But the underlying concern remains: 40 million daily users, many of whom will type a single-prompt symptom query ("I have bad stomach pain and feel dizzy and my breath smells funny, what should I do?") regardless of how OpenAI intends the tool to be used. Design intent does not control user behaviour at consumer scale.
The Expert View
Isaac Kohane, Chair of Biomedical Informatics at Harvard (not involved in the study), provided context: LLMs have become patients' first stop for medical advice — but in 2026 they are least safe at the clinical extremes, where judgment separates missed emergencies from needless alarm.
The insight is precisely calibrated. LLMs are most dangerous where they are most needed. The mid-acuity presentations where ChatGPT Health performs well are the presentations where most patients would have made a reasonable triage decision without AI assistance. The high-acuity atypical presentations where the system fails are the cases where correct triage saves lives.
What This Means for Clinician-Facing AI
The critical distinction that media coverage of this study has often missed: this tested a consumer-facing tool. Patients triaging themselves without clinical training, using an AI system without a healthcare professional in the loop. This is fundamentally different from clinician-facing clinical AI.
Clinician-facing tools — iatroX, OpenEvidence, Medwise — operate in a different safety model. A qualified professional interprets and acts on the AI's output. The clinician provides the pattern recognition, clinical judgment, contextual assessment, and index of suspicion that the AI cannot. The AI provides information retrieval, synthesis, and guideline grounding. The combination is safer than either component alone because the human professional catches the errors that the AI makes — and the AI surfaces information that the human professional might not have immediately recalled.
The denominator problem. 40 million daily consumer users multiplied by even a small error rate equals thousands of potentially dangerous undertriage events per day. Clinician-facing tools have a much smaller user base (professional users only) and a human clinical judgment buffer between the AI output and the patient-facing decision. The absolute number of dangerous errors is orders of magnitude smaller.
iatroX's safety model. UKCA-marked, MHRA-registered Class I medical device. DCB 0129 clinical safety governance with systematic hazard identification and risk controls. Designed to support clinician decision-making, not replace it — the clinician always interprets, validates, and acts on the output. Ask iatroX retrieves and synthesises NICE guidelines, CKS summaries, peer-reviewed literature, and SmPC data — grounded in authoritative clinical sources rather than general web content. For clinical decision support, source quality and specificity matter more than model size or parameter count.
The Measured Take
This study is not evidence that "AI is dangerous." It is evidence that consumer-facing health AI without a clinician in the loop has specific, measurable safety limitations at clinical extremes — the atypical, ambiguous, evolving presentations where correct triage saves lives. The system works well on textbook cases and fails on the cases that textbooks prepare you to recognise but real patients present ambiguously.
Clinician-facing AI with a professional in the loop is a fundamentally different proposition. The study strengthens the case for purpose-built, regulated clinical AI tools over general-purpose consumer AI applied to healthcare. The regulatory approach matters: MHRA registration (iatroX) versus no registration (ChatGPT Health) is not bureaucracy — it reflects substantive differences in clinical safety assurance, hazard identification, risk controls, and post-market surveillance.
The question is not "can AI do medicine?" — it increasingly can. The question is "who is in the loop when it gets it wrong?"
Try Ask iatroX — clinician-facing, MHRA-registered, UK guideline-grounded →
