AI & Learning

How a Machine Reads What You Know

Every adaptive tutor rests on a quiet act of inference: guessing the knowledge it cannot see from the answers it can. We trace that idea from Bayesian Knowledge Tracing to its deep-learning successors — and the honest places where the deeper model is not the better one.

The EuraStudy Team9 min readD·01
Fig. 01 · Bayesian Knowledge Tracing as it really is: a two-state hidden Markov model. The state — known or not-known — is never observed; only the answer is, and slip and guess stand between the two.
AbstractAn adaptive learning system never observes what a student knows; it observes only what they answer. Knowledge tracing is the family of models that bridges that gap — inferring a latent, evolving competence from a stream of correct and incorrect responses, so the system can decide what to practise next and when a skill has been mastered. We trace the idea from Corbett and Anderson’s Bayesian Knowledge Tracing, a two-state hidden Markov model with four interpretable parameters, to Piech et al.’s Deep Knowledge Tracing, which replaced the per-skill model with a recurrent network and reported large gains in next-answer prediction. We then dwell on the less-cited corrective literature: that much of the deep model’s advantage narrows once the classical model is given the same information, and that a deep model can report a knowledge state a tutor should not act on — non-monotonic and internally inconsistent. The argument is that for a tutor, the right model is not the best next-answer predictor but the one whose estimate of knowledge can be trusted and acted upon.

A tutor never sees what a student knows. Knowledge is not a quantity you can read off a face or a screen; it is a hidden state, and the only window onto it is behaviour — a sequence of answers, some right, some wrong, arriving one at a time. Everything an adaptive system does downstream of that window — choosing the next problem, deciding a topic is finished, flagging a gap before an exam — rests on a single quiet act of inference: estimating the knowledge it cannot see from the answers it can. The name for that inference is knowledge tracing, and it is the unglamorous machinery underneath every claim that a tutor "knows where you are."

This piece is about how that machine works, and about a disagreement at its heart. For two decades the standard answer was a small, transparent statistical model. In 2015 a neural network beat it on the one number the field had agreed to optimise, and a great deal of work pivoted to deep learning. The interesting part of the story is what happened next — the careful papers that asked whether the deep model was actually better, or merely better at the metric, and found the honest answer to be it depends. For a tutor, the distinction is not academic.

A two-state guess

The model that defined the field is Bayesian Knowledge Tracing (BKT), introduced by Corbett and Anderson in 1995 for the Cognitive Tutors 1. Its premise is almost austere. For a single skill, the learner is in one of two latent states — they either know it or they do not — and that state is a hidden Markov model: unobserved, evolving over time, glimpsed only through the answers it emits 12. Four numbers govern the whole thing, and each one means something a teacher would recognise:

  • p(L₀), the prior — the probability the skill is already known before the first question.
  • p(T), the transit or learning rate — the probability that, at any practice opportunity, an unknown skill becomes known. Crucially, classic BKT assumes no forgetting: once learned, a skill stays learned.
  • p(S), the slip — the probability of answering incorrectly despite knowing the skill.
  • p(G), the guess — the probability of answering correctly despite not knowing it.

The first two describe how knowledge changes; the last two describe how knowledge becomes an answer. Together they let the model do the only thing that matters: after every response, update its belief about the hidden state. The update is Bayes' rule, run forward over the sequence 1. A correct answer raises the estimated probability of mastery — but not to certainty, because the student might have guessed. A wrong answer lowers it — but not to zero, because the student might have slipped. Then the learning rate nudges the estimate upward again, modelling the practice itself.

Fig. 02 · The same four parameters, run forward. After each answer the posterior probability of mastery is updated by Bayes’ rule; a correct answer raises it, a wrong one lowers it, and the learning rate nudges it up even after a slip.

What you get is a curve, not a verdict: a posterior probability of mastery that climbs with correct answers, dips with wrong ones, and is never quite 0 or 1. That curve is the tutor's read of the learner. It is also why a single answer is such weak evidence, and why the model refuses to treat it as decisive.

Slip and guess

The two parameters that make BKT honest are also the two that make it subtle. Without slip and guess, an answer would be a transparent readout of knowledge: right means knows, wrong means doesn't. Real students break that mapping in both directions. A student who has genuinely mastered integration by parts will still drop a sign under time pressure — that is a slip. A student who has mastered nothing can still eliminate three absurd multiple-choice options and land on the fourth — that is a guess.

Fig. 03 · Why a single answer is weak evidence. Slip lets a student who knows answer wrong; guess lets a student who does not answer right. The model only ever sees the columns — the answer — and has to infer the rows: the hidden state behind it.

The model only ever observes the answer; it has to infer the state behind it. Slip and guess are the admission that the inference is uncertain, and they are what stops one stray wrong answer from wiping out a hard-won mastery estimate — a behaviour any teacher would call sensible and any naïve scoring scheme gets wrong.

But this is also where BKT can quietly fail. Fit those four parameters carelessly to data and they can drift into values that are statistically convenient and pedagogically absurd: a "slip" probability above one-half, or a "guess" so high that getting the answer right counts as evidence against knowing. Baker, Corbett and Aleven named this model degeneracy and showed it is common, recommending that slip and guess be bounded — typically below 0.5 — and estimated with care rather than left to a blind fit 5. A model whose parameters have stopped meaning what they say is no longer interpretable, and interpretability is most of why a tutor uses BKT at all.

A wrong answer is not a verdict and a right answer is not a certificate. The whole apparatus of slip and guess exists to keep a tutor from over-reading a single response — the same restraint a good human tutor shows by instinct.

When to stop teaching

A knowledge estimate is only useful if it licenses a decision, and BKT's headline decision is when a skill has been learned well enough to move on. The Cognitive Tutors set a mastery threshold: keep giving a student practice on a skill until the model's estimated probability of mastery crosses 0.95, then stop 6. This is mastery learning 11 made operational — every student practises a skill for as long as they, individually, need to, and no longer.

The threshold is where the inference earns its keep. Set it too low and the tutor declares victory over a fluke run of correct answers; set it too high and it drills a student who has plainly understood. The number 0.95 is a deliberate, conservative choice, and the fact that it is a single legible knob — one you can defend, audit, and explain to a teacher — is exactly the property the next generation of models would trade away.

Letting the network decide

In 2015, Piech and colleagues asked what happens if you stop hand-specifying the model and let a neural network learn the dynamics from data. Deep Knowledge Tracing (DKT) feeds the sequence of (skill, correct?) events into a recurrent network — an LSTM — and trains it to predict the probability of a correct answer on whatever comes next 2. Where BKT keeps a separate two-state model for each skill, DKT keeps a single high-dimensional hidden state that carries everything the student has done, across all skills at once.

Fig. 04 · Two ways to model the same sequence. BKT fits one small, legible model per skill, independently. Deep Knowledge Tracing replaces them with a single recurrent network whose hidden state carries everything it has seen — more accurate, far less transparent.

That shared state is the whole point. BKT, by construction, cannot know that a student who has just mastered the chain rule is now better placed to learn implicit differentiation, because it treats every skill as an island. A recurrent network can pick up exactly that kind of cross-skill structure from the data, with no one telling it the skills are related. On the standard benchmarks, DKT reported a large jump in next-answer prediction — area-under-the-curve rising from the high-0.6s typical of BKT to the mid-0.8s on the headline dataset 2 — and the result was influential enough to turn knowledge tracing into a deep-learning subfield. Its attention-based successors, from self-attentive knowledge tracing 8 to context-aware attentive models 9, pushed the predictive numbers further still.

How deep is knowledge tracing?

And then the careful papers arrived. The first question was whether the comparison had been fair. Khajah, Lindsey and Mozer reran it under the title "How Deep is Knowledge Tracing?" and showed that much of DKT's advantage came not from depth but from information the classical model had simply been denied — forgetting, the student's individual ability 7, the recency of practice. Give BKT those same extensions and the gap all but closes — in their experiments the augmented classical model became, in the paper's own word, indistinguishable from the deep one 3. A later, broader study reached a compatible verdict: deep models win clearly on large datasets with long, rich interaction histories, while on smaller or sparser data simpler methods — logistic-regression models with well-chosen features especially — are competitive or better. When deep learning is the best approach is an empirical question, not a default 10.

The second question was sharper, because it was about trust rather than accuracy. Yeung and Yeung showed that DKT, for all its predictive skill, can report a knowledge state a tutor should not act on. Two failures in particular: the model can fail to reconstruct its own input — lowering the estimated mastery of the very skill a student has just answered correctly — and its estimate of what a learner knows can wander non-monotonically over time, rising and falling between consecutive steps in ways no theory of learning would predict 4.

Fig. 05 · The catch a tutor cannot ignore. A deep model can predict the next answer well while reporting a knowledge state that wobbles — even dropping the estimated mastery of a skill the student just got right. BKT’s estimate is monotone by construction.

This is the crux of the whole field for anyone building a tutor rather than topping a leaderboard. BKT's mastery estimate is monotone and legible by construction: it goes up with evidence of learning, down with evidence against, and you can read its four parameters like a sentence. A deep model can be a better predictor of the next answer while being a worse estimate of knowledge — wobbly, opaque, occasionally self-contradicting. The two are not the same target, and a tutor that schedules practice, declares mastery, and tells a student they are ready is acting on the second. Yeung and Yeung's fix is to regularise the network toward consistency 4; the deeper lesson is that predictive accuracy on the held-out answer is the wrong thing to optimise alone.

What we read from it

We treat knowledge tracing the way this literature recommends: as inference to be trusted, not just fit. EuraStudy is built on a reviewed bank of roughly three thousand exam-style questions, each tagged to a topic — the unit a knowledge-tracing model would treat as a skill — which is the raw material any tracing model needs: a clean stream of (skill, correct?) events to read. The discipline is in what we let the model conclude from them. A mastery estimate that schedules a student's practice, or tells them a topic is exam-ready, has to be interpretable and monotone — it has to behave like knowledge, not merely correlate with the next answer — and it has to be honest about its own uncertainty, which is exactly what slip and guess encode.

So the honest position is the same one this whole article argues toward. The most accurate next-answer predictor is not automatically the best tutor; a model you cannot read is a model you cannot responsibly act on; and a knowledge estimate is a claim about a person, to be made carefully or not at all. A tutor's job is not to predict your next answer. It is to know, well enough to act on, what you have actually learned — and to be able to show its working when it says so.

References

  1. 1.Corbett, A. T., & Anderson, J. R. (1995). Knowledge Tracing: Modeling the Acquisition of Procedural Knowledge. User Modeling and User-Adapted Interaction, 4(4), 253–278.
  2. 2.Piech, C., Bassen, J., Huang, J., Ganguli, S., Sahami, M., Guibas, L. J., & Sohl-Dickstein, J. (2015). Deep Knowledge Tracing. Advances in Neural Information Processing Systems (NIPS), 28, 505–513.
  3. 3.Khajah, M., Lindsey, R. V., & Mozer, M. C. (2016). How Deep is Knowledge Tracing? Proceedings of the 9th International Conference on Educational Data Mining (EDM), 94–101.
  4. 4.Yeung, C.-K., & Yeung, D.-Y. (2018). Addressing Two Problems in Deep Knowledge Tracing via Prediction-Consistent Regularization. Proceedings of the Fifth Annual ACM Conference on Learning at Scale (L@S), Article 5.
  5. 5.Baker, R. S. J. d., Corbett, A. T., & Aleven, V. (2008). More Accurate Student Modeling through Contextual Estimation of Slip and Guess Probabilities in Bayesian Knowledge Tracing. Intelligent Tutoring Systems (ITS), 406–415.
  6. 6.Corbett, A. (2001). Cognitive Computer Tutors: Solving the Two-Sigma Problem. User Modeling (UM), 137–147.
  7. 7.Pardos, Z. A., & Heffernan, N. T. (2010). Modeling Individualization in a Bayesian Networks Implementation of Knowledge Tracing. User Modeling, Adaptation, and Personalization (UMAP), 255–266.
  8. 8.Pandey, S., & Karypis, G. (2019). A Self-Attentive Model for Knowledge Tracing. Proceedings of the 12th International Conference on Educational Data Mining (EDM), 384–389.
  9. 9.Ghosh, A., Heffernan, N., & Lan, A. S. (2020). Context-Aware Attentive Knowledge Tracing. Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD), 2330–2339.
  10. 10.Gervet, T., Koedinger, K., Schneider, J., & Mitchell, T. (2020). When is Deep Learning the Best Approach to Knowledge Tracing? Journal of Educational Data Mining, 12(3), 31–54.
  11. 11.Bloom, B. S. (1968). Learning for Mastery. Evaluation Comment, 1(2). UCLA Center for the Study of Evaluation of Instructional Programs.
  12. 12.Rabiner, L. R. (1989). A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proceedings of the IEEE, 77(2), 257–286.

Start preparing with EuraStudy

Join the waitlist to be first in when your curriculum opens.

Next dispatch · D·02

On the Making of a Quiet Machine

A study of the obsessions behind a learning platform built for four national examinations — where nothing is accidental, and restraint is the most exacting discipline of all.

More dispatches