Hourly · 2026-07-05 20:00 UTC

Textbook Compliance Was 10%. An AI Tutor Hit 90% — and Test Scores Rose With It.

A Dartmouth Statistics course replaced optional readings with an AI-graded quiz platform called Phosphor. Voluntary engagement jumped from roughly 10% to 90%, and heavy users gained up to 1.30 standard deviations on the final exam.

Image: EV1A014_(1).jpg: RightBrainPhotography (Rick Rowen) (license)

When Dartmouth College deployed an optional AI tutoring platform called Phosphor in its Introductory Statistics course, the results upended expectations. Textbook reading compliance, estimated at 10–15%, was replaced by 90.2% voluntary platform adoption among the 151 enrolled students.

The platform — which mixed AI-graded written-response quizzes with multiple-choice questions and was evaluated at this week's Intelligent Textbooks 2026 workshop — delivered measurable gains. Students who fully engaged with all 24 lessons and three cumulative Module Reviews scored 14.7 points higher on the final exam, an effect size of 1.30 standard deviations. Controlling for prior midterm performance, the advantage remained 0.71 SD — roughly 8 points on a 100-point scale.

But the learning mechanism was specific. When the course temporarily switched to multiple-choice-only quizzes for Module 2, the dose-response relationship vanished: completing more lessons predicted no additional gain. Constructed-response questions, graded by Claude Sonnet 4.6 against instructor rubrics, were the active ingredient. Students who passed all three cumulative Module Reviews — requiring written answers — saw the single largest effect: 7.1 points on the final exam (d = 0.66).

A built-in RAG chat assistant was almost entirely ignored — 72 total queries across the entire semester, with only 14 students using it more than once. Students told researchers that general-purpose LLMs were faster and more capable for their questions, and that the course content itself was sufficient without a separate chat interface.

The study is non-randomized, and author Jonah Bard flags self-selection as the central threat to causal interpretation. But the MCQ-to-CRQ natural experiment within the same course provides a cleaner signal: the format of assessment, not just the act of engaging, drove outcomes.

Bard positions the findings against prior research showing that unrestricted GPT-4 access with no guardrails harmed student test performance by 17% when the tool was removed. Phosphor suggests a different path: embed AI inside structured, rubric-graded formative assessment, and students will not only show up — they will learn.

Sources: Phosphor: Balancing Efficacy and Engagement in Interactive Texts (Bard, 2026)