What the Harvard AI Tutor Trial Really Showed (and What It Means for Question Banks)

Featured image for What the Harvard AI Tutor Trial Really Showed (and What It Means for Question Banks)

A widely shared 2025 Harvard study found that a purpose-built AI tutor produced more than double the learning gains of an excellent active-learning class, in less time and with higher engagement. It is a genuinely striking result, but it is easy to read wrongly. The gains did not come from a powerful model answering questions; they came from specific pedagogy deliberately built into the tutor. That distinction is the whole lesson, and it is the honest frame for thinking about AI in question banks. Here is what the trial actually showed.

Key takeaways

  • A Harvard RCT found a custom AI tutor doubled learning gains versus an excellent active-learning class.
  • Students also learned in less time, with higher engagement and motivation.
  • The gains came from built-in pedagogy, not raw model capability.
  • Unguided chatbot use has shown null or negative effects elsewhere; scaffolding is the active ingredient.
  • A tutor that makes you reason is pedagogically different from a copilot that explains on demand.

The study

The trial, by Kestin and colleagues, was published in Scientific Reports in June 2025. It ran in Harvard's Physical Sciences 2, the university's largest introductory physics class, in autumn 2023, with 194 students in a randomised crossover design. Each student learned one topic through an in-class active-learning session led by experienced instructors, and another through a custom AI tutor at home, so each student served as their own control across two consecutive weeks covering surface tension and fluid flow. This is a well-designed study, which is what makes its result worth taking seriously.

The result

The finding was large. Students using the AI tutor learned more than twice as much as those in the active-learning class, and did so in less time, with a median of around 49 minutes against 75 minutes, while also reporting higher engagement and motivation. The comparison matters: this was not AI against a bad lecture, but AI against one of the best-evidenced classroom methods there is, taught in a class the researchers themselves described as already very well taught. Beating that bar by a factor of two is a serious result.

Why it worked

Here is the part that gets lost in the headline. The tutor, nicknamed PS2 Pal, was not just a chatbot with a good model behind it. It was instructed to reveal only one step at a time and never to give away the full solution in a single message, to prompt students to attempt the problem themselves before revealing anything, to keep responses brief to avoid cognitive overload, and it was supplied with correct solutions in advance to prevent hallucination. In other words, it was engineered to reproduce what a skilled tutor does: withhold the answer, make the learner work, and reveal understanding in steps. The pedagogy was the product.

The honest caveats

It is important not to overstate this. The study involved fewer than 200 introductory physics undergraduates at Harvard, across just two topics, with short-term tests, so the findings may not transfer wholesale to medical exam preparation, and the authors are appropriately measured about scope. It is strong, directional evidence that well-designed AI tutoring can work, not proof that any AI tutor will, or that it will replace teachers. Treat it as a compelling signal that pedagogy-plus-AI can be powerful, not as a settled result for medicine.

The design lesson

The generalisable takeaway is about design, not capability. The same period has produced cautionary evidence in the other direction: at least one well-designed study found that unguided use of a general chatbot for maths actually harmed achievement, because students used it to complete work without thinking. So the active ingredient is not the model; it is the scaffolding that forces engagement. An AI that hands over answers can reduce learning, while an AI that makes you reason can more than double it, using similar underlying technology. The design is what separates the two.

The landscape in 2026

This maps directly onto how AI now appears in question banks. Most banks have added AI copilots that explain on demand, and some are genuinely well built, with AMBOSS AI Mode Learning, launched in February 2026, a strong and well-integrated example that bundles explanations, Anki cards, and question sessions in one flow, which we cover in AMBOSS AI Mode, three months on. Question-first Socratic tutoring is a different pedagogy from an explain-on-demand copilot, and the Harvard design principles line up with it point for point: withhold the full answer, prompt the learner to attempt first, reveal one step at a time, ground responses in correct source material. iatroX's Socratic tutor is built on that question-first approach, with free sample questions to try at iatroX. The claim here is a difference in pedagogy, not a claim that one product is superior. For the underlying evidence on retrieval, see does spaced repetition actually work.

Frequently asked questions

What did the Harvard AI tutor study find? That a custom-built AI tutor produced more than double the learning gains of an excellent active-learning class, in less time and with higher engagement, in a 2025 randomised crossover trial of 194 physics students.

Does this prove AI tutors are better than teachers? No. It is a small, short-term study in introductory physics at one university, so it is strong directional evidence, not proof, and it does not show AI replacing teachers across contexts, including medicine.

Why did the AI tutor work so well? Because of its design, not the model. It was built to reveal one step at a time, withhold full solutions, prompt students to try first, and use supplied correct answers to avoid hallucination, reproducing skilled tutoring.

Is using ChatGPT to study the same thing? Not necessarily. Unguided chatbot use has shown null or negative effects in some studies, because it lets learners skip thinking. The benefit comes from scaffolding that forces engagement, not from raw chatbot access.

How does this relate to question-bank AI? Most banks add copilots that explain on demand, some very well. Question-first Socratic tutoring is a different pedagogy that matches the Harvard design principles, making the learner reason rather than handing over answers.

Share this insight