AI & Learning
Twenty Questions
A good adaptive test can pin down what you know in a dozen questions, not fifty — because it chooses each one to be the most revealing it can ask. We trace the quiet mathematics of item response theory and computerized adaptive testing, from the shape of a single question to the loop that learns you in real time, and the places where adaptivity has to be reined in.
Hand a hundred students the same fifty-question test and you have, in a quiet way, asked each of them the wrong question many times over. A question far above a student tells you almost nothing you did not already know — they will miss it, as expected. A question far below them is just as silent — they will get it, as expected. The only genuinely revealing question is the one poised at the edge of what a student can do, where the outcome is close to a coin-flip and the answer actually carries news. A good examiner feels this and adjusts on the fly, easing off after a stumble, pressing harder after a fluent run. A computerized adaptive test does the same thing arithmetically, and it is one of the most quietly successful ideas in the measurement of learning.
The framing that does it justice is an old parlour game. A well-played round of twenty questions can single out one object from millions, because each question is chosen to split the remaining possibilities roughly in half. An adaptive test plays the same game against a single unknown — your ability — and this piece is about the mathematics that lets a machine play it well. It has two halves: a model of how a single question behaves, and a rule for choosing the next one. The first is item response theory (IRT); the second is the adaptive loop built on top of it. We will also be honest about the part the textbooks rush past — the constraints that keep a clever question-picker from being a foolish one.
The paradox of the shorter test
Start with the claim that makes the whole enterprise worth the trouble. A well-constructed adaptive test can reach the same precision as a conventional fixed-form test using roughly half the number of items — and sometimes far fewer — while measuring strong and weak students with equal care 810. That is not a free lunch; it is the result of refusing to waste questions. A fixed test spends most of its length asking things that are, for any particular student, foregone conclusions. Move the questions to where the outcome is uncertain for this student and each one suddenly earns its place.
To do that, two things are needed. You need to know, for any question, how a student of a given ability is likely to answer it — the shape of the question. And you need a principled way to say which unanswered question would tell you the most, right now, about the student in front of you. Item response theory supplies the first; the idea of information supplies the second.
A question has a shape
Item response theory rests on a single, almost austere, premise: a student’s performance is driven by one latent trait — call it ability, — placed on a scale conventionally centred at zero with a standard deviation of one, so that in practice nearly everyone falls between and 46. Each question, in turn, is described by a few numbers, and together they give the probability that a student of ability answers correctly. Plot that probability against and you get the item characteristic curve: a smooth S that rises from near-zero for the weakest students to near-certain for the strongest.
The models differ in how many knobs the curve has. The one-parameter, or Rasch, model gives every item the same shape and lets it differ only in difficulty 2. The two-parameter (2PL) model adds a second knob, discrimination:
The three-parameter (3PL) model adds a third — a floor for guessing 3:
Each parameter means something a teacher would recognise. The difficulty slides the curve horizontally: it is the ability at which a 2PL item is answered correctly half the time (with guessing, the curve instead passes through at ). The discrimination controls how steeply the curve rises through that midpoint — a precise detail worth keeping: the maximum slope of a 2PL curve is exactly , not 5. A steep item sharply separates students just above and just below its difficulty; a shallow one barely distinguishes them. The guessing parameter lifts the lower tail off the floor — on a five-option multiple-choice item it sits near , because even a student who knows nothing will sometimes land on the right answer.
A small piece of history hides in that logistic curve. The original model was the normal-ogive, written with the cumulative normal distribution (Thurstone in the 1920s; Lord in 1952); the logistic was adopted later because it is far easier to compute, and Birnbaum brought it into the theory in 1968 3. To keep the two interchangeable, item parameters are often multiplied by a scaling constant — a value chosen (by Haley, in 1952) precisely so the logistic and normal curves never differ by more than about . None of this is decoration: it is what lets a difficulty estimated under one model be read on the other’s scale. And it is the first sign of IRT’s real gift over older test theory — item properties and student ability live on the same scale, each estimable, in principle, independently of the particular sample of people or questions you happened to use 14.
What a question is worth
Here is the idea the whole article turns on. If you already have a rough estimate of a student’s ability, not every remaining question is equally useful — and you can say exactly how useful each one is. The measure is item information, and for a 2PL item it has a compact form:
The name is not a metaphor. It is Fisher information — R. A. Fisher’s notion of the precision a single observation buys you about an unknown quantity, defined as the reciprocal of the variance with which you could estimate it 5. Read the formula and the strategy of adaptive testing falls out of it. The product is largest when — that is, when the item’s difficulty matches the student’s ability (), the coin-flip question from the opening. Push the question too far above or below the student and races toward or , the product collapses, and the item tells you almost nothing. The discrimination enters as , so a sharp item is worth a great deal — its information peak rises as — but a glance at the curve shows that this tall peak is also narrow: a highly discriminating item is enormously informative about students near its difficulty and nearly useless about everyone else. (Guessing only makes this worse: a non-zero both lowers an item’s information and nudges its peak slightly above , draining precision from exactly the low-ability students who are hardest to measure 5.)
So the question to ask next is not the hardest one, nor the one a syllabus lists first. It is the one whose difficulty sits closest to your current best guess of the student’s ability — the most informative question available. Everything else in adaptive testing is built on that single observation.
Precision you can see
Information has two more properties that turn it from an idea into an instrument. First, under the model’s assumption that answers are independent once ability is fixed, information simply adds up across the questions on a test:
Second, that total converts directly into precision. The standard error of the ability estimate is
so more information — better-targeted questions, or simply more of them — means a smaller error bar 45. This is the quiet revolution IRT works on the older classical test theory, which hands back a single reliability and a single standard error for the entire test, the same for a struggling student and a brilliant one. IRT instead makes precision a function of ability: a test can measure the middle of the range beautifully and the extremes poorly, and the test information curve shows you exactly where it is sharp and where it is blunt 4.
Now the engine is assembled. To drive the error bar down as fast as possible, spend each question where information is highest — which is to say, near the student’s ability — and watch the standard error fall with every well-aimed answer. That is the whole of adaptive testing in one sentence; the rest is making it work.
The loop that learns you
As an algorithm, computerized adaptive testing is a short loop 91011:
- Start. With no answers yet, seed the ability estimate at the population mean and choose a first item of middling difficulty.
- Select. Of the items not yet used, pick the one with the most information at the current estimate — the maximum-information rule.
- Administer and score the item.
- Re-estimate the ability from every answer so far.
- Stop when the standard error drops below a target (a variable-length test) or a fixed budget of items is spent; otherwise, loop back to select.
The re-estimation step hides a genuine subtlety, and it is where the cold start bites. The natural estimator, maximum likelihood, has no finite value while a student’s answers are all correct or all wrong — and at the very first question they always are 4. Adaptive tests therefore lean on Bayesian estimation, most often expected a posteriori (EAP): treat the population distribution as a prior and report the mean of the posterior after each answer. It is finite from the first item, needs no iteration, and was designed by Bock and Mislevy in 1982 for exactly this setting — running a live ability estimate on the slender computers of the day 7. A related caution shapes the early items: when the estimate is still far from the truth, the local information at a wrong estimate can point to the wrong questions, so methods that weigh information more globally — Chang and Ying’s use of Kullback–Leibler information — behave more steadily at the start 11.
Run the loop and the estimate marches toward the truth while its error bar contracts around it: a correct answer lifts the estimate, a wrong one lowers it, and each well-aimed question shaves more off the uncertainty than a randomly chosen one would. This is the source of the efficiency claim we began with — the test is confident about a student long before a fixed form of the same precision would have finished 8.
Where adaptivity has to be reined in
A test that only ever chased maximum information would be efficient and, in several ways, irresponsible. The interesting engineering of real systems is in the constraints layered on top of the simple rule.
The first is content. An exam must cover a blueprint — so many items of algebra, so many of geometry — and the most informative item at a given moment may be the wrong topic entirely. Constrained selection handles this by, in effect, assembling a complete test that satisfies every requirement and then administering its most informative item: the weighted-deviations method, and van der Linden’s elegant shadow-test approach, both keep the syllabus whole while still adapting 9.
The second is exposure and security. The maximum-information rule keeps reaching for the same handful of excellent items, which means those items are seen by almost everyone — and a question that everyone sees is a question that leaks. Exposure controls deliberately hold the best items back some fraction of the time (the Sympson–Hetter method 13), and a-stratified designs save the most discriminating items for late in the test, when the estimate is good enough to deserve them, spreading the load across the bank 12. The stakes are concrete: the Duolingo English Test reports a mean item-exposure rate near , against the double-digit percentages typical of older operational adaptive tests — the difference between an item pool that stays secret and one that quietly ends up online 15.
The third, and the most important, is fairness. An item exhibits differential item functioning (DIF) when students of the same ability but different groups have different probabilities of answering it correctly — when its characteristic curve depends on who is taking it, beyond the ability it is meant to measure. That is a validity problem before it is a fairness problem, and it is screened with care, by methods from the Mantel–Haenszel procedure to comparing the area between two groups’ curves 141. Adaptive delivery inherits the issue wholesale: a biased item is biased no matter how much information it carries, and an efficient test that quietly disadvantages a group has optimised the wrong thing.
It is worth saying plainly that adaptivity is a choice, not an upgrade. Some of the highest-stakes computerized exams are deliberately fixed-length — the United States medical licensing examination among them — while others, like the NCLEX nursing examination, stop only once the test is statistically confident a candidate is above or below the passing standard. The right design depends on whether you are placing someone on a scale or simply deciding which side of a line they fall.
When the test learns to write itself
The classical pipeline has an expensive prerequisite hiding in plain sight: before an item can be used, its difficulty, discrimination and guessing must be calibrated by piloting it on hundreds of students. That is slow, costly, and a poor fit for a world that wants fresh questions constantly. The most striking recent development attacks exactly this. The Duolingo English Test uses machine learning to predict an item’s difficulty directly from its text — calibrating questions before a single examinee has seen them — and then delivers them adaptively; its machine-learned proficiency estimates line up with the classical IRT ability estimates at a rank correlation around 15. Other work reaches deeper into the loop, learning the selection policy itself from data rather than trusting the fixed maximum-information rule: BOBCAT casts adaptive testing as a bilevel optimisation and learns a data-driven question-picker that shortens tests further 16.
There is also a quieter connection, and it leads straight back to a companion idea. Item response theory draws a snapshot — a student’s ability fixed at the moment of the test. Knowledge tracing, the subject of an earlier dispatch, draws the moving picture — competence changing as a student learns 17. They are two views of the same latent thing, and recent work makes the bridge explicit, reading deep knowledge tracing as a kind of dynamic, multidimensional item response model 18. Measurement and learning, it turns out, are not separate theories so much as the still photograph and the film of one quantity we can never see directly.
What we read from it
EuraStudy is built on a reviewed bank of roughly three thousand exam-style questions, each tagged to a topic — the raw material any of this machinery needs. The IRT lens is, more than anything, a discipline about what a question is worth. A question has no fixed value; its worth depends entirely on who is answering it. The most useful question is rarely the hardest or the easiest — it is the one balanced at the edge of what a student can currently do. And precision is not a single number stamped on a test but something you can point at, target, and stop chasing once you have enough of it.
The honest reading is that the method and its limits are inseparable. A shorter test is only a better test when it is pointed in the right direction; an efficient one is worthless if it is unfair, narrow, or insecure. The aim of measuring a student well was never to ask the most questions. It is to ask the right ones — and, like a good player of twenty questions, to know exactly when you have heard enough to stop.
References
- 1.Lord, F. M. (1980). Applications of Item Response Theory to Practical Testing Problems. Hillsdale, NJ: Lawrence Erlbaum Associates.
- 2.Rasch, G. (1960). Probabilistic Models for Some Intelligence and Attainment Tests. Copenhagen: Danish Institute for Educational Research (repr. 1980, University of Chicago Press).
- 3.Birnbaum, A. (1968). Some Latent Trait Models and Their Use in Inferring an Examinee’s Ability. In F. M. Lord & M. R. Novick, Statistical Theories of Mental Test Scores (pp. 397–479). Reading, MA: Addison-Wesley.
- 4.Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of Item Response Theory. Newbury Park, CA: Sage Publications.
- 5.Baker, F. B. (2001). The Basics of Item Response Theory (2nd ed.). College Park, MD: ERIC Clearinghouse on Assessment and Evaluation.
- 6.Embretson, S. E., & Reise, S. P. (2000). Item Response Theory for Psychologists. Mahwah, NJ: Lawrence Erlbaum Associates.
- 7.Bock, R. D., & Mislevy, R. J. (1982). Adaptive EAP Estimation of Ability in a Microcomputer Environment. Applied Psychological Measurement, 6(4), 431–444.
- 8.Weiss, D. J. (1982). Improving Measurement Quality and Efficiency with Adaptive Testing. Applied Psychological Measurement, 6(4), 473–492.
- 9.van der Linden, W. J., & Glas, C. A. W. (Eds.). (2010). Elements of Adaptive Testing. New York: Springer.
- 10.Wainer, H., Dorans, N. J., Eignor, D., Flaugher, R., Green, B. F., Mislevy, R. J., Steinberg, L., & Thissen, D. (2000). Computerized Adaptive Testing: A Primer (2nd ed.). Mahwah, NJ: Lawrence Erlbaum Associates.
- 11.Chang, H.-H. (2015). Psychometrics Behind Computerized Adaptive Testing. Psychometrika, 80(1), 1–20.
- 12.Chang, H.-H., & Ying, Z. (1999). a-Stratified Multistage Computerized Adaptive Testing. Applied Psychological Measurement, 23(3), 211–222.
- 13.Sympson, J. B., & Hetter, R. D. (1985). Controlling Item-Exposure Rates in Computerized Adaptive Testing. Proceedings of the 27th Annual Meeting of the Military Testing Association, 973–977. San Diego, CA: Navy Personnel Research and Development Center.
- 14.Holland, P. W., & Wainer, H. (Eds.). (1993). Differential Item Functioning. Hillsdale, NJ: Lawrence Erlbaum Associates.
- 15.Settles, B., LaFlair, G. T., & Hagiwara, M. (2020). Machine Learning–Driven Language Assessment. Transactions of the Association for Computational Linguistics, 8, 247–263.
- 16.Ghosh, A., & Lan, A. (2021). BOBCAT: Bilevel Optimization-Based Computerized Adaptive Testing. Proceedings of the 30th International Joint Conference on Artificial Intelligence (IJCAI-21), 2410–2417.
- 17.Corbett, A. T., & Anderson, J. R. (1995). Knowledge Tracing: Modeling the Acquisition of Procedural Knowledge. User Modeling and User-Adapted Interaction, 4(4), 253–278.
- 18.Vie, J.-J., & Kashima, H. (2023). Deep Knowledge Tracing is an Implicit Dynamic Multidimensional Item Response Theory Model. Proceedings of the 31st International Conference on Computers in Education (ICCE).
Start preparing with EuraStudy
Join the waitlist to be first in when your curriculum opens.
How a Machine Reads What You Know
Every adaptive tutor rests on a quiet act of inference: guessing the knowledge it cannot see from the answers it can. We trace that idea from Bayesian Knowledge Tracing to its deep-learning successors — and the honest places where the deeper model is not the better one.