AI Tutors Can Make You Worse at Exams. Here's the Evidence — and How to Use One That Doesn't

Featured image for AI Tutors Can Make You Worse at Exams. Here's the Evidence — and How to Use One That Doesn't

Yes — used as an answer machine, AI can reduce the cognitive effort that produces durable learning. And you will not notice, because it feels like it is working. The fluent explanation sounds helpful. The immediate comprehension feels productive. The relief of understanding replaces the discomfort of not knowing. But when the exam arrives and the AI is not there, the knowledge is not either.

The mitigation is not to avoid AI. It is to use a tool that makes you retrieve before it explains — a Socratic tutor that diagnoses your specific gap, withholds the answer until you have attempted it, and grounds every clinical claim in a verifiable source rather than generating from training patterns.

This article lays out the evidence. It is not comfortable reading if you currently paste wrong answers into ChatGPT. But the evidence is clear, and understanding it will change how you revise.

The Strongest Evidence: The Wharton Maths RCT

Bastani et al. (2025), published in the Proceedings of the National Academy of Sciences, conducted a randomised controlled trial with nearly 1,000 high school maths students at a large school in Turkey during the 2023-2024 academic year. The study was approved by the University of Pennsylvania IRB and is one of the largest and most rigorous experimental evaluations of generative AI's impact on learning.

Three groups: one given unrestricted GPT-4 access ("GPT Base" — essentially a ChatGPT interface), one given a guardrailed tutor version ("GPT Tutor" — designed to provide hints and guidance rather than direct answers, with teacher input), and a control group with no AI access. Four 90-minute sessions covering approximately 15% of the semester's maths curriculum.

During AI-assisted practice sessions, both AI groups outperformed the control. GPT Base improved performance by 48%. GPT Tutor improved performance by an astonishing 127%. The AI clearly helped in the moment — dramatically so for the tutor version.

Then the AI was removed and students sat an exam on their own.

The GPT Base group performed 17% worse than the control group — significantly worse than students who never had AI at all. The GPT Tutor group performed comparably to the control — the harm was fully mitigated by the guardrails.

The finding is stark: unrestricted AI access during practice can actively harm learning outcomes. The authors' explanation: students used GPT-4 as a "crutch" — outsourcing the cognitive work of problem-solving to the model rather than engaging with the material themselves. When the crutch was removed, they could not perform independently. The guardrailed tutor version, which provided incremental hints rather than direct answers and preserved the student's cognitive effort, avoided this effect entirely.

As lead author Hamsa Bastani explained: "If we use it sort of lazily and kind of outsource the work that we're supposed to be doing and completely trust the machine learning model, then that's when we could be in trouble."

It Is Not One Study

The Wharton RCT is the most rigorous evidence, but the pattern is replicated across multiple studies.

A 2025 undergraduate study examined the effect of using ChatGPT as a study aid on longer-term retention. Students who used ChatGPT to help them study scored significantly lower on a surprise retention test administered 45 days later compared with students who used traditional study methods. The AI group did not report feeling less prepared — their subjective assessment of learning was comparable to the control group. But their actual retention was measurably worse.

The 45-day timeframe is important because it reflects the gap between revision sessions and exam day that real trainees face. A study method that produces comprehension in the moment but fails at 45 days is not a study method — it is a comprehension exercise that masquerades as one.

The pattern across multiple studies is consistent: AI as an answer machine can improve immediate performance (you understand the explanation), while degrading long-term retention (you cannot retrieve the knowledge independently). The mechanism is the same each time: the AI removes the effortful retrieval and error-correction that durable learning requires.

The Dangerous Part: The Felt-vs-Actual Gap

This is what makes the crutch effect insidious rather than merely disappointing. It is not that students notice they are learning less and persist anyway. It is that they cannot tell.

Students in these studies did not report feeling less prepared. They believed they had learned as much as their peers. Their subjective experience of studying with AI felt exactly as productive as studying without it — and in some cases more productive, because the AI made the material feel easier to understand.

The cognitive mechanism: reading a fluent AI explanation feels exactly like learning. The comprehension is genuine. The logic is clear. The explanation covers the topic thoroughly. The student finishes the session feeling that they have mastered the concept. But comprehension in the moment is not the same as retrieval under exam conditions days or weeks later. Recognising an explanation when you read it is a fundamentally different cognitive task from producing the knowledge from memory when prompted by a question stem.

The felt-vs-actual gap is the core danger. If students could detect that passive AI explanations were not producing durable learning, they would change their behaviour. But the illusion is perfect: the studying feels productive, the comprehension is real, and the failure only becomes visible when the exam reveals it — at which point it is too late to revise differently.

This is why tool design matters more than user intention. A trainee who intends to study actively but uses a tool that defaults to passive explanation will drift toward passive consumption — because the passive path feels equally productive and is far more comfortable. The design of the tool must enforce the active pattern.

Why This Happens: The Science

Two established principles from cognitive psychology explain the mechanism at a deeper level.

Retrieval practice (Roediger & Karpicke, 2006). The testing effect is one of the most robust findings in learning science. The act of pulling information out of memory — actively retrieving it rather than passively re-reading or receiving it — is one of the most powerful learning events available. Testing yourself on material produces stronger and more durable memory than re-studying the same material, even when re-study involves more time and feels more productive.

The mechanism: effortful retrieval strengthens the neural pathways connecting the cue (the question) to the target (the answer). Each retrieval attempt — successful or unsuccessful — reinforces the encoding. Failed retrieval followed by correction creates an error signal that updates the memory trace more effectively than simply reading the correct answer. The effort is not a side effect of learning. It is a primary cause.

For medical exam preparation, the implication is direct: answering practice questions and struggling to recall the answer produces more durable learning than reading explanations — even when the explanations are clear, correct, and comprehensive.

Desirable difficulty (Bjork & Bjork, 2011). Learning conditions that make performance feel harder in the moment — effortful retrieval, spacing between practice sessions, interleaving different topics, making and correcting errors — produce more durable long-term retention than conditions that feel easy and fluent. The difficulty is "desirable" because it drives the encoding processes that create lasting memory.

The counterintuitive finding: conditions that produce the best immediate performance (massed practice, blocked topics, immediate feedback, fluent explanations) produce the worst long-term retention. Conditions that produce worse immediate performance (spaced practice, interleaved topics, delayed feedback, effortful retrieval) produce the best long-term retention.

When a trainee gets a question wrong and immediately receives a fluent AI explanation, two things happen that undermine durable learning. First, the retrieval opportunity is bypassed — instead of struggling to recall the correct reasoning, the trainee reads it passively. Second, the error-correction process is passive — instead of actively identifying and correcting the misconception through their own reasoning, the trainee absorbs the correction through reading. Both remove the desirable difficulty that would have produced durable encoding.

What a Learning-Safe AI Tutor Does Differently

A tutor designed around these evidence-based principles would incorporate four specific safeguards.

Question-first / Socratic design. Before explaining anything, the tutor diagnoses the trainee's specific misconception. It asks questions to understand what the trainee got wrong and why — not what the textbook says in general, but what this specific person misunderstood about this specific question. It withholds the answer until the trainee has attempted to reason through the problem. Every question the tutor asks is a retrieval event — and every retrieval event strengthens the memory trace, whether the trainee's answer is correct or not.

The Socratic approach is not merely conversational — it is pedagogically functional. Each question forces the trainee to search memory, produce a response, and engage with the concept before receiving feedback. The effort of producing a wrong answer and then having it corrected is more effective for learning than the ease of reading a correct explanation.

Source grounding. Every clinical claim the tutor makes should be attributable to a vetted source — a guideline (NICE, CKS, SIGN), a medicines reference (SmPC/eMC), or the exam's own official explanation and reference — rather than generated from training data patterns. This matters for two reasons. First, fluent fabrication in a medical context creates false clinical knowledge that may persist into practice — a safety concern that extends beyond exam performance. Second, grounded answers are verifiable — the trainee can check the source, which builds trust and supports deeper understanding.

Loop-closing retrieval. After the Socratic exchange, the system should test the trainee on the same concept — not immediately (which tests short-term memory, not durable learning) but after an appropriate delay (which tests whether the knowledge has been durably encoded). Feeding a related question from the question bank back into spaced repetition scheduling closes the loop between the tutoring session and long-term retention. Without this closure, the Socratic exchange improves understanding in the moment but does not guarantee durable recall.

Confidence calibration. The felt-vs-actual gap is the core danger of passive AI study. A tutor that asks the trainee to rate their confidence before and after the session — and then tests whether that confidence is justified by actual performance — surfaces the gap rather than allowing it to persist undetected. Metacognitive scaffolding (helping trainees become aware of what they do and do not know) is itself a learning intervention that improves study strategy and exam performance.

Medical-Specific Validation

The general learning science is clear. But does it apply specifically to medical education?

The Dartmouth/Geisel School of Medicine study (published in npj Digital Medicine, November 2025) validated these principles for clinical learning. The research found that Socratic tutoring transforms a passive answer service into an active learning partner that promotes long-term retention. Critically, the study also found that the right design allows context-dependent mode switching: direct answers under exam time-pressure (when the trainee genuinely needs the information fast) but Socratic dialogue during regular study (when the goal is durable learning, not immediate information).

This is an important nuance. The evidence does not say "never give direct answers." It says "do not default to direct answers during the learning phase, because that is when the crutch effect operates." A well-designed tutor recognises when the trainee is studying (Socratic mode) versus when they are cramming (direct mode) — and defaults to the one that produces durable learning.

A separate study (Golchini et al., published on medRxiv, June 2025) demonstrated an open-source adaptive Socratic tutor specifically designed for clinical case-based reasoning, showing that the approach can be operationalised for medical education at scale — not just in theory, but in working systems that adapt to individual learner performance.

How to Tell If Your AI Tutor Is the Harmful Kind

Before trusting any AI study tool with your exam preparation, run it through these five checks.

Does it answer before you have attempted the question? If the tool provides the explanation before you have tried to retrieve the answer, it is removing the retrieval practice that produces durable learning. Every premature answer is a missed retrieval event.

Does it ever make you retrieve? Does the tool ask you questions — genuinely waiting for your response — rather than simply providing explanations? If it only explains, it is a passive tool regardless of how sophisticated, accurate, or well-structured the explanations are. Passive tools feel helpful. The evidence says they are not.

Is it grounded in verifiable sources, or generating from training data? For medical exam preparation, ungrounded generation creates a specific risk: false clinical knowledge that sounds authoritative, passes undetected, and may persist into clinical practice. A grounded tutor attributes its claims to identifiable sources that the trainee can verify.

Does it test you afterwards? If the tutoring session ends with an explanation and no follow-up retrieval event, the learning loop is not closed. Understanding an explanation is not the same as being able to retrieve the knowledge later. A closed loop includes a retrieval test after the explanation — ideally scheduled at an optimal spacing interval.

Does it help you see the gap between how confident you feel and how much you actually know? The felt-vs-actual gap is invisible without explicit calibration. If the tool does not ask you to rate your confidence and then test whether that confidence is justified, the gap goes undetected — and you revise in a state of false security.

iatroX as the Worked Example

The iatroX Socratic Tutor was built around these principles — not because "Socratic" is a marketing term, but because the evidence says passive explanation is the mode that quietly makes trainees worse at exams while feeling like it helps.

Socratic-first by default. The tutor asks you first. It diagnoses your specific misconception before teaching anything. It withholds the answer to force retrieval. Every exchange is a retrieval event that strengthens the memory trace.

Grounded in sources. Answers are anchored to NICE, CKS, SmPC/eMC, SIGN, NHS content, and the exam question's own official explanation and reference — not generated from training patterns. When the tutor makes a clinical claim, there is a source behind it.

Per-exam calibration. The tutor speaks differently for MRCGP AKT (primary care reasoning, guideline application) than for GPhC CRA (pharmaceutical calculations, medicines safety) or DTM&H (tropical differentials, parasite-specific management) or Italian SSM (Italian curriculum, Italian language). The register, depth, clinical examples, and pedagogical approach adapt to the specific exam.

"Just explain it" override. For legitimate crunch moments — the night before the exam, a concept revised five times that just needs confirming — the override exists. It provides direct explanation without the Socratic exchange. It is not the default, by design, because the evidence says the default should be question-first.

The tutor sits inside the iatroX Q-bank, bound to the specific question you just answered. It has the question stem, your answer, the correct answer, the official explanation, and the relevant guideline context. It is not a general chatbot with a medical skin. It is a tutor built on the question you got wrong, grounded in the sources the exam expects you to know.

Try the Socratic Tutor inside the iatroX Q-bank →

Share this insight