AI & Learning

How Should We Measure a Tutor?

A tutor that keeps students busy is not the same as a tutor that helps them learn. We argue for measuring AI tutors by learning gains and transfer — and against the engagement metrics that quietly reward the wrong thing.

The EuraStudy Team8 min readD·04
Fig. 01 · A gain is a difference, not a level: the only honest claim a tutor can make is how far it moved a learner from where they started.
AbstractClaims that an AI tutor "works" are routinely supported by the wrong evidence: sessions started, minutes logged, messages exchanged, lessons marked complete. These engagement signals are easy to collect and easy to inflate, and a system optimised against them learns to be sticky rather than effective. We argue that the load-bearing outcomes for a tutor are learning gains — measured against a baseline — and transfer, the ability to apply what was learned to genuinely new problems. We sketch the methodology a credible claim requires: pre/post designs with a comparison condition, delayed post-tests that probe retention, transfer items that share deep structure but not surface form, and explicit attention to construct validity. The same standards should discipline how we describe our own product.

It is easy to make a tutor look good. Count the sessions a student starts, the minutes they stay, the messages they send, the lessons they tick off as complete, and you can draw a chart that climbs reassuringly to the right. None of those numbers answers the only question that matters, which is whether the student can now do something they could not do before. A tutor is not a destination a learner visits; it is a function that should change them. If we cannot see the change, we have not measured the tutor at all — we have measured its capacity to hold attention.

That distinction is the whole argument of this piece. The headline outcomes for any tutor, human or machine, are learning gains and transfer. Everything else — engagement, satisfaction, time-on-task — is at best a leading indicator and at worst an actively misleading one. The history of one-to-one tutoring sets the bar high: Bloom's much-cited estimate put individual tutoring around two standard deviations above conventional classroom instruction 1, and while later work has been more sober about how often that ceiling is reached, the better intelligent tutoring systems do post real, replicated gains over business-as-usual 2913. Those gains were never demonstrated by counting minutes. They were demonstrated by testing what students knew, before and after.

A gain is a difference, not a level

The first discipline is conceptual. A score is a level; a gain is a difference. A student who arrives already able to integrate by parts and leaves still able to do so has learned nothing from us, however high their score, however happy they were. The quantity a tutor is responsible for is the change between a baseline and a later measurement — and a change cannot be inferred from a single snapshot.

This is why a credible claim begins with a pre-test. Measure the relevant competence before the intervention, deliver the teaching, then measure again. The arithmetic difference is the raw gain; normalised gain — the fraction of the available headroom actually closed — is usually the more honest figure, because closing the last ten points of a topic is harder than closing the first ten. None of this is exotic. It is the ordinary machinery of education research 29, and it is conspicuously absent from most product claims.

A pre/post pair on its own is still weak, because students improve for reasons that have nothing to do with us: they mature, they study elsewhere, they get a good night's sleep before the post-test. To attribute the gain to the tutor we need a comparison condition — another group, or the same students on a matched topic, who did not receive the intervention. Without a counterfactual, "they improved" is a fact about time passing, not about teaching.

The vanity-metric trap

Why does any of this need arguing? Because the easy metrics are not merely uninformative; they pull in the wrong direction. The moment a number becomes a target, it stops measuring what it once did — Goodhart's law, imported wholesale into edtech 8. Optimise a tutor to maximise time-on-task and you will get a tutor that is reluctant to let a student leave: it pads explanations, withholds the clean summary, manufactures one more practice item. Optimise for messages exchanged and you reward a chatty system over a clear one. Each of these makes the dashboard greener and the learning worse.

Fig. 02 · Two outcomes, two costs. Optimising for the cheap one (time-on-task) silently trades away the dear one (durable, transferable understanding).

The trap is seductive because engagement is cheap to collect and learning is expensive to measure. Time-on-task arrives for free in the logs; a transfer test has to be written, piloted, and scored. So the metric that is easy crowds out the metric that is right, and a whole field quietly agrees to be graded on attendance. The corrective is not to abolish engagement metrics — a tutor nobody opens teaches nobody — but to demote them to diagnostics. Engagement is a precondition for learning, never evidence of it.

A tutor optimised for time-on-task learns to be hard to leave. A tutor optimised for learning gains learns to make itself unnecessary. Only one of those is on the student's side.

Transfer is the real exam

Even an honest pre/post gain can flatter a tutor, because there is a cheap way to produce one: teach to the test. Drill a student on the exact items they will be re-tested on and the post-test will rise without any understanding having formed. The score moves; the competence does not. The defence against this is transfer — testing on problems that share the deep structure of what was taught but not its surface form 410.

Transfer comes in degrees. Near transfer asks the student to apply a method to a new instance of the same problem type: a different function to differentiate, a different data set to test. Far transfer asks them to recognise the method's relevance in a context that does not announce it — to see a related-rates problem hiding inside a word problem about a draining tank. Far transfer is hard to teach and hard to measure, and honesty requires saying so; the literature is candid that far transfer is rare and easily overclaimed 4. But near transfer is both achievable and testable, and a tutor that cannot produce even near transfer has taught a ritual, not a concept.

The practical instruction is simple: the post-test must never be the practice set. Items should be isomorphic — same underlying principle, changed cover story and surface features — so that a student who merely memorised the worked examples scores no better than chance on the structure they were supposed to learn. This is also where good feedback earns its keep: feedback that explains why a step works builds the schema that transfers, whereas feedback that only reveals the answer builds a lookup table that does not 311.

Retention: the test you give later

There is one more way to be fooled, and it is a temporal one. Performance measured immediately after teaching is systematically inflated, because the material is still sitting in working memory. The conditions that make learning feel fast and easy in the moment often make it fragile — and conversely, the "desirable difficulties" that feel effortful tend to produce learning that lasts 12. A tutor that maximises the immediate post-test may be optimising for exactly the fluency that decays by the weekend.

Fig. 03 · What a defensible learning claim has to carry: a baseline, a comparison, a transfer task, and a delay before the post-test.

The remedy is a delayed post-test: re-measure after a deliberate gap of days or weeks. This is also why two of the most robust findings in the science of learning belong in any serious evaluation. Retrieval practice — having the student recall, not merely re-read — strengthens durable memory 5. Spacing that retrieval out over time strengthens it further 6. A tutor that schedules effortful recall across sessions and is then validated against a delayed, transfer-laden test is making a claim worth believing. A tutor validated only against an immediate re-quiz is, at best, measuring how warm the memory still is.

Validity: are we measuring the right thing at all?

Underneath every one of these designs sits a question that is easy to skip and fatal to get wrong: does the test actually measure the competence we claim to teach? This is construct validity, and it is the hinge on which the whole evaluation turns 7. A multiple-choice item that can be solved by eliminating absurd distractors measures test-wiseness, not mathematics. A coding task auto-graded only on output measures whether the program ran, not whether the student understood it.

Validity is not a property a test has once and forever; it is a property of the inference we draw from a score, and it has to be argued case by case 7. For a tutor evaluation this means stating plainly what the post-test is evidence of, and being willing to find that a gain on the instrument is not a gain in the construct. It is the least glamorous part of measurement and the part most often waved through, which is precisely why it deserves to be named.

Fig. 04 · The measurement we want to make is small and old-fashioned: did the explanation leave the student able to do something new, unaided?

What this means for us

We hold EuraStudy to these standards in how we describe it, which mostly means describing it carefully. The platform covers four exams — Austria's Matura and Germany's Abitur are live, France's Baccalauréat and Spain's Selectividad are on the waitlist — and rests on a reviewed bank of roughly three thousand questions, a serif editorial design language, a figure engine for honest diagrams, and motion that respects a reader's reduced-motion preference. Those are facts about what the system is. They are not, and we will not present them as, evidence of learning gains.

The honest position is that the outcomes this piece argues for — measured gains, demonstrated transfer, retention across a delay — are claims we have to earn with the right studies, not assert from usage logs. The reviewed question bank gives us the raw material for isomorphic transfer items and delayed re-tests; the discipline this article describes is the standard we intend those studies to meet. Until then, the responsible thing is to say what we have built and stay quiet about what we have not yet proven. A tutor worth measuring should, in the end, be able to show that the student no longer needs it — and we would rather be measured by that than by how long anyone stayed.

References

  1. 1.Bloom, B. S. (1984). The 2 Sigma Problem: The Search for Methods of Group Instruction as Effective as One-to-One Tutoring. Educational Researcher, 13(6), 4–16.
  2. 2.VanLehn, K. (2011). The Relative Effectiveness of Human Tutoring, Intelligent Tutoring Systems, and Other Tutoring Systems. Educational Psychologist, 46(4), 197–221.
  3. 3.Hattie, J., & Timperley, H. (2007). The Power of Feedback. Review of Educational Research, 77(1), 81–112.
  4. 4.Barnett, S. M., & Ceci, S. J. (2002). When and Where Do We Apply What We Learn? A Taxonomy for Far Transfer. Psychological Bulletin, 128(4), 612–637.
  5. 5.Roediger, H. L., & Karpicke, J. D. (2006). Test-Enhanced Learning: Taking Memory Tests Improves Long-Term Retention. Psychological Science, 17(3), 249–255.
  6. 6.Cepeda, N. J., Pashler, H., Vul, E., Wixted, J. T., & Rohrer, D. (2006). Distributed Practice in Verbal Recall Tasks: A Review and Quantitative Synthesis. Psychological Bulletin, 132(3), 354–380.
  7. 7.Messick, S. (1995). Validity of Psychological Assessment: Validation of Inferences from Persons’ Responses and Performances as Scientific Inquiry into Score Meaning. American Psychologist, 50(9), 741–749.
  8. 8.Goodhart, C. A. E. (1984). Problems of Monetary Management: The UK Experience. In Monetary Theory and Practice (pp. 91–121). Macmillan.
  9. 9.Kulik, J. A., & Fletcher, J. D. (2016). Effectiveness of Intelligent Tutoring Systems: A Meta-Analytic Review. Review of Educational Research, 86(1), 42–78.
  10. 10.Koedinger, K. R., Corbett, A. T., & Perfetti, C. (2012). The Knowledge-Learning-Instruction Framework: Bridging the Science-Practice Chasm to Enhance Robust Student Learning. Cognitive Science, 36(5), 757–798.
  11. 11.Shute, V. J. (2008). Focus on Formative Feedback. Review of Educational Research, 78(1), 153–189.
  12. 12.Bjork, R. A., & Bjork, E. L. (2011). Making Things Hard on Yourself, but in a Good Way: Creating Desirable Difficulties to Enhance Learning. In Psychology and the Real World (pp. 56–64). Worth.
  13. 13.du Boulay, B. (2016). Recent Meta-Reviews and Meta-Analyses of AIED Systems. International Journal of Artificial Intelligence in Education, 26(1), 536–537.

Start preparing with EuraStudy

Join the waitlist to be first in when your curriculum opens.

Next dispatch · D·05

One Platform, Four National Exams

The Austrian Matura and the German Abitur are live; the French Baccalauréat and Spanish Selectividad are on the waitlist. The hard part was never the content — it was deciding what four exams could share without flattening any of them.

More dispatches