All posts
AI in Education9 min readJanuary 23, 2026

AI Tutoring: What the Research Actually Says

Headlines claim AI tutoring matches human instruction, but the Harvard study used custom AI with elite students vs. self-study—not ChatGPT vs. human tutors. Here's what research actually shows and what questions to ask vendors.

The headline vs. reality gap: When students used standard ChatGPT for math practice, they scored 17% worse on tests than students who practiced without AI. But a carefully designed AI tutor helped Harvard students learn twice as much in less time. The difference? Pedagogical design—not AI sophistication.

AI tutoring headlines are misleading. The widely-cited study by Harvard researchers published in Scientific Reports showing AI tutoring "as effective as human instruction" used custom-designed AI (not ChatGPT) with Harvard undergraduates (not K-12 students) compared to active learning (not human tutors). A systematic review of 28 K-12 studies concludes AI tutoring "should be considered complementary tools rather than replacements for educators." Here's what research actually shows—and what questions to ask before buying.

What Did the Harvard Researchers' Study Actually Find?

In June 2025, researchers from Harvard's physics department published a randomized controlled trial in Scientific Reports (a Nature journal) showing AI tutoring outperformed traditional active learning in physics courses, with effect sizes between 0.73 and 1.3 standard deviations. The study data is publicly available on GitHub, which is a positive sign of research transparency. But three critical details change the interpretation of the headline-grabbing results.

WHAT HEADLINES IMPLY
ChatGPT-style AI
Generic chatbot anyone can use
WHAT WAS ACTUALLY TESTED
Custom "PS2 Pal" system
Scaffolded questioning, targeted feedback, pre-written solutions, structured sequences
WHAT HEADLINES IMPLY
Typical students
K-12 or general population
WHAT WAS ACTUALLY TESTED
194 Harvard undergraduates
Highly motivated, academically strong, self-selected into physics course
WHAT HEADLINES IMPLY
Human tutors
One-on-one expert instruction
WHAT WAS ACTUALLY TESTED
Active learning classroom
Group instruction with peer work—not one-on-one tutoring

The researchers themselves note the comparison was between "an AI tutor" and "an active learning class"—not between AI and human tutoring. The pedagogical design of the AI mattered enormously. As the study authors state, "AI chatbots are generally designed to be helpful, not to promote learning."

⚠️ The headline trap

Vendors cite this study to sell products that bear little resemblance to what was tested. Ask: "Is your system designed like the one in the Harvard study?" The answer is almost always no.

What Happens When Students Use Regular ChatGPT?

A University of Pennsylvania study published in PNAS (June 2025) tested nearly 1,000 Turkish high school students using AI for math practice. The results were sobering.

Students using standard ChatGPT solved 48% more practice problems correctly—but then scored 17% worse on the subsequent test than students who practiced without any AI assistance. The researchers titled their paper "Generative AI Can Harm Learning" and found students were using the chatbot as a "crutch," simply asking for answers rather than working through problems.

Even a modified "GPT Tutor" with guardrails designed to scaffold learning (giving hints rather than answers) produced no significant improvement over practicing alone. The researchers concluded that without careful design, "generative AI can substantially inhibit learning."

This finding matters because most commercial "AI tutoring" products are closer to ChatGPT than to Harvard's carefully engineered PS2 Pal system.

What Do Systematic Reviews Across Many Studies Show?

When researchers analyze many studies rather than single experiments, the picture is more nuanced.

A May 2025 systematic review examining 28 K-12 studies with 4,597 students found that effects of intelligent tutoring systems on learning are "generally positive but are found to be mitigated when compared to non-intelligent tutoring systems." AI tutoring works—but often not much better than well-designed non-AI alternatives.

The same review concluded that intelligent tutoring systems "should be considered complementary tools rather than replacements for educators." Most effective implementations combine AI with human guidance.

Notably, "none of the reviewed studies seriously considered AI ethics." Data privacy, algorithmic bias, and appropriate use questions are largely ignored in effectiveness research.

What Does Work? Evidence from ASSISTments

ASSISTments, a free math homework platform, provides one of the strongest evidence bases for AI-assisted learning in K-12. Two large randomized controlled trials earned it the highest ESSA Tier 1 evidence rating.

A North Carolina study with nearly 6,000 seventh-graders found students using ASSISTments scored significantly higher on state math tests with an effect size of 0.10—equivalent to about 31% of a typical year's learning gains. Effects persisted one year after the intervention ended.

What makes ASSISTments different from generic AI tutoring? It provides immediate feedback on homework, gives teachers real-time reports on student performance, and integrates directly with classroom instruction. The AI assists teachers rather than replacing them.

When Does AI Tutoring Help?

Practice & Reinforcement
AI provides unlimited practice with immediate feedback—hard for teachers at scale
Standardized Content
Math procedures, grammar rules, basic facts have clear right/wrong answers AI handles well
Teacher Integration
AI handles drill; teachers use data to inform instruction and handle motivation/relationships
Extended Time
AI doesn't judge, doesn't tire, doesn't make struggling students feel like a burden

The pattern: AI tutoring works best as a supplement, not a replacement—particularly for practice, reinforcement, and personalized pacing.

When Does AI Tutoring Fail—Or Cause Harm?

Without Human Guidance
Students don't know what to work on, how to use the system, or when they're ready to move on
For Developing Understanding
AI helps students get correct answers; less clear it helps them understand why
As a Replacement
AI tutoring alone rarely works; the UPenn study showed it can actually harm learning
Open-Ended Work
AI struggles with reasoning that lacks clear right/wrong answers

A student might improve test scores with AI tutoring while developing only shallow understanding. Performance gains don't always equal learning gains.

BEFORE
"Let's adopt AI tutoring to solve our differentiation challenges."
AFTER
"Let's use AI tutoring to provide additional practice time, while teachers focus on conceptual instruction and relationship-building."

What Questions Should You Ask AI Tutoring Vendors?

When vendors make effectiveness claims, demand specifics:

"What does your evidence show?" Require specific studies with specific populations. "Research shows..." isn't good enough. Which research? What populations? What comparison conditions? The Wharton researchers found that students overestimate their learning when using AI—so don't trust self-reported satisfaction.

"How was the AI designed pedagogically?" A chatbot isn't a tutor. What instructional principles are built in? How does it scaffold learning? How does it handle misconceptions? The Harvard PS2 Pal included pre-written expert solutions to avoid hallucinations—does the vendor's product?

"What teacher integration do you recommend?" If the answer is "none needed," be skeptical. The best evidence from ESSA-rated programs supports AI tutoring with teacher involvement and dashboard reporting.

"What data do you collect, and how is it protected?" AI tutoring systems collect extensive data about student performance, behavior, and patterns. Understand what's collected and who has access.

"What happens when students get stuck or frustrated?" Good systems detect and respond to student affect. Cheap ones just keep presenting problems regardless of student state.

💡 The pilot mindset

The best way to evaluate AI tutoring isn't reading vendor materials—it's running a small pilot with careful measurement. Let teachers try the tool with a few students and gather real data before scaling.

What's the Honest Bottom Line?

AI tutoring can work—not as magic, not as a teacher replacement, but as a supplement with genuine value for practice, reinforcement, and personalized pacing.

If you're considering AI tutoring tools:

  1. Look for systems designed with learning principles built in—not just chatbots with education branding
  2. Demand evidence from populations similar to yours—Harvard undergrads aren't typical K-12 students
  3. Require clear integration with teacher instruction—the evidence for standalone AI tutoring is weak
  4. Be skeptical of vendor claims—even well-designed AI tutors show modest effect sizes (0.10-0.20) in real-world K-12 settings
  5. Pilot before you scale—and measure outcomes not tied to the tutoring platform itself

The pedagogical design matters more than the AI sophistication. A well-designed non-AI system often outperforms a poorly-designed AI one—and as the UPenn study showed, poorly-designed AI can actually harm learning.


Frequently Asked Questions

Can ChatGPT work as a tutor?

Generic ChatGPT is designed to be helpful, not pedagogically effective. It gives answers rather than scaffolding learning. The UPenn research showed students using ChatGPT for math practice scored 17% worse on tests—they were using it as a "crutch" rather than engaging in productive struggle. Purpose-built AI tutoring systems with learning principles built in can outperform generic chatbots, but most commercial "AI tutoring" products are closer to ChatGPT than to what research studies actually tested.

What effect sizes should I look for in vendor research?

An effect size of 0.4 is considered educationally meaningful. The Harvard study showed 0.73-1.3, but under very specific conditions with elite students. Real-world K-12 implementations like ASSISTments show more modest but still meaningful effect sizes around 0.10-0.18. Be skeptical of any vendor claiming Harvard-level results with general K-12 populations using generic AI.

Should we use AI tutoring for students who are behind?

Potentially yes—for targeted practice on specific skills. But struggling students often need the relationship and motivation that human teachers provide. AI tutoring as catch-up support works better than AI tutoring as remediation replacement. The systematic review found AI tutoring most effective when combined with teacher involvement.

What about AI tutoring for advanced students?

Advanced students often need challenge and creativity that AI handles poorly. AI tutoring can provide extra practice, but enrichment usually requires human guidance for open-ended exploration.

How do we know if AI tutoring is actually helping our students learn?

Don't rely on the AI's own metrics. Compare performance on assessments not tied to the tutoring system. Look for transfer—can students apply learning in new contexts? Performance gains within the AI system don't always translate to genuine understanding. The UPenn study found students performed great during AI-assisted practice but poorly on independent tests.


References

  1. AI Tutoring Outperforms Active Learning - Kestin et al., Scientific Reports (Nature), June 2025 | Study Data on GitHub
  2. Generative AI Without Guardrails Can Harm Learning - Bastani et al., PNAS, June 2025
  3. Systematic Review of AI-Driven ITS in K-12 - Létourneau et al., npj Science of Learning, May 2025
  4. ASSISTments ESSA Evidence - Evidence for ESSA
  5. Without Guardrails, Generative AI Can Harm Education - Knowledge at Wharton, August 2024
  6. Kids Who Use ChatGPT Do Worse on Tests - Hechinger Report, September 2024
  7. AI Tutor Helped Harvard Students Learn More Physics - Hechinger Report, September 2024
  8. The Algorithmic Turn: Emerging Evidence on AI Tutoring - Carl Hendrick, November 2025
  9. Supporting Middle School Math with Technology-Based Intervention - WestEd/ASSISTments Study, 2024
Benedict Rinne

Benedict Rinne, M.Ed.

Founder of KAIAK. Helping international school leaders simplify operations with AI. Connect on LinkedIn

Want help building systems like this?

I help school leaders automate the chaos and get their time back.