AI & Learning

Twenty Questions

A good adaptive test can pin down what you know in a dozen questions, not fifty — because it chooses each one to be the most revealing it can ask. We trace the quiet mathematics of item response theory and computerized adaptive testing, from the shape of a single question to the loop that learns you in real time, and the places where adaptivity has to be reined in.

The EuraStudy Team13 min readD·01
Fig. 01 · The item characteristic curve, drawn exactly. A single three-parameter item (a = 1.3, b = 0.4, c = 0.20): the difficulty b slides the curve left or right, the discrimination a sets how steeply it climbs through its midpoint, and the guessing floor c lifts the whole curve off zero. The probability of a correct answer is a smooth function of one hidden number — the student’s ability.
AbstractA fixed test asks everyone the same questions, which means it asks almost everyone the wrong ones: too easy for the strong, too hard for the struggling, informative for neither. Computerized adaptive testing does what a skilled examiner does by instinct — it chooses each next question in light of the answers so far, homing in on a student’s ability with far fewer items. Underneath it sits item response theory, which models the probability of a correct answer as a function of a single latent ability and a few properties of the question: its difficulty, how sharply it discriminates, and the chance of a lucky guess. From those curves comes a precise notion of the information a question carries — maximal, it turns out, exactly when the question’s difficulty matches the student’s ability — and from that, an algorithm: estimate the ability, ask the most informative question, re-estimate, repeat, and stop when the estimate is precise enough. We follow that idea from Birnbaum’s logistic models and Lord’s information functions to the adaptive loop itself, then dwell on what the textbook account leaves out — the cold start, the need to cover a syllabus, the security cost of over-using the best questions, and the fairness of items that behave differently for equally able students. We close on the modern turn, where machine learning estimates a question’s difficulty before any human has piloted it, and on how this snapshot of ability connects to the moving picture that knowledge tracing draws.

Hand a hundred students the same fifty-question test and you have, in a quiet way, asked each of them the wrong question many times over. A question far above a student tells you almost nothing you did not already know — they will miss it, as expected. A question far below them is just as silent — they will get it, as expected. The only genuinely revealing question is the one poised at the edge of what a student can do, where the outcome is close to a coin-flip and the answer actually carries news. A good examiner feels this and adjusts on the fly, easing off after a stumble, pressing harder after a fluent run. A computerized adaptive test does the same thing arithmetically, and it is one of the most quietly successful ideas in the measurement of learning.

The framing that does it justice is an old parlour game. A well-played round of twenty questions can single out one object from millions, because each question is chosen to split the remaining possibilities roughly in half. An adaptive test plays the same game against a single unknown — your ability — and this piece is about the mathematics that lets a machine play it well. It has two halves: a model of how a single question behaves, and a rule for choosing the next one. The first is item response theory (IRT); the second is the adaptive loop built on top of it. We will also be honest about the part the textbooks rush past — the constraints that keep a clever question-picker from being a foolish one.

The paradox of the shorter test

Start with the claim that makes the whole enterprise worth the trouble. A well-constructed adaptive test can reach the same precision as a conventional fixed-form test using roughly half the number of items — and sometimes far fewer — while measuring strong and weak students with equal care 810. That is not a free lunch; it is the result of refusing to waste questions. A fixed test spends most of its length asking things that are, for any particular student, foregone conclusions. Move the questions to where the outcome is uncertain for this student and each one suddenly earns its place.

To do that, two things are needed. You need to know, for any question, how a student of a given ability is likely to answer it — the shape of the question. And you need a principled way to say which unanswered question would tell you the most, right now, about the student in front of you. Item response theory supplies the first; the idea of information supplies the second.

A question has a shape

Item response theory rests on a single, almost austere, premise: a student’s performance is driven by one latent trait — call it ability, θ\theta — placed on a scale conventionally centred at zero with a standard deviation of one, so that in practice nearly everyone falls between θ=3\theta = -3 and θ=+3\theta = +3 46. Each question, in turn, is described by a few numbers, and together they give the probability that a student of ability θ\theta answers correctly. Plot that probability against θ\theta and you get the item characteristic curve: a smooth S that rises from near-zero for the weakest students to near-certain for the strongest.

The models differ in how many knobs the curve has. The one-parameter, or Rasch, model gives every item the same shape and lets it differ only in difficulty 2. The two-parameter (2PL) model adds a second knob, discrimination:

P(θ)=11+ea(θb)P(\theta) = \frac{1}{1 + e^{-a(\theta - b)}}

The three-parameter (3PL) model adds a third — a floor for guessing 3:

P(θ)=c+(1c)11+ea(θb)P(\theta) = c + (1 - c)\,\frac{1}{1 + e^{-a(\theta - b)}}

Each parameter means something a teacher would recognise. The difficulty bb slides the curve horizontally: it is the ability at which a 2PL item is answered correctly half the time (with guessing, the curve instead passes through (1+c)/2(1+c)/2 at θ=b\theta = b). The discrimination aa controls how steeply the curve rises through that midpoint — a precise detail worth keeping: the maximum slope of a 2PL curve is exactly a/4a/4, not aa 5. A steep item sharply separates students just above and just below its difficulty; a shallow one barely distinguishes them. The guessing parameter cc lifts the lower tail off the floor — on a five-option multiple-choice item it sits near 0.20.2, because even a student who knows nothing will sometimes land on the right answer.

Fig. 02 · How two parameters move the curve. Left: raising the difficulty b shifts an item bodily to the right — the same shape, simply harder. Right: raising the discrimination a steepens the rise, so the item sorts students more sharply, but only across a narrow band of ability. Every curve is computed from the 2PL model, not eyeballed.

A small piece of history hides in that logistic curve. The original model was the normal-ogive, written with the cumulative normal distribution (Thurstone in the 1920s; Lord in 1952); the logistic was adopted later because it is far easier to compute, and Birnbaum brought it into the theory in 1968 3. To keep the two interchangeable, item parameters are often multiplied by a scaling constant D1.702D \approx 1.702 — a value chosen (by Haley, in 1952) precisely so the logistic and normal curves never differ by more than about 0.010.01. None of this is decoration: it is what lets a difficulty estimated under one model be read on the other’s scale. And it is the first sign of IRT’s real gift over older test theory — item properties and student ability live on the same scale, each estimable, in principle, independently of the particular sample of people or questions you happened to use 14.

What a question is worth

Here is the idea the whole article turns on. If you already have a rough estimate of a student’s ability, not every remaining question is equally useful — and you can say exactly how useful each one is. The measure is item information, and for a 2PL item it has a compact form:

I(θ)=a2P(θ)(1P(θ))I(\theta) = a^{2}\,P(\theta)\,(1 - P(\theta))

The name is not a metaphor. It is Fisher information — R. A. Fisher’s notion of the precision a single observation buys you about an unknown quantity, defined as the reciprocal of the variance with which you could estimate it 5. Read the formula and the strategy of adaptive testing falls out of it. The product P(1P)P(1-P) is largest when P=0.5P = 0.5 — that is, when the item’s difficulty matches the student’s ability (θ=b\theta = b), the coin-flip question from the opening. Push the question too far above or below the student and PP races toward 11 or 00, the product collapses, and the item tells you almost nothing. The discrimination enters as a2a^2, so a sharp item is worth a great deal — its information peak rises as a2/4a^2/4 — but a glance at the curve shows that this tall peak is also narrow: a highly discriminating item is enormously informative about students near its difficulty and nearly useless about everyone else. (Guessing only makes this worse: a non-zero cc both lowers an item’s information and nudges its peak slightly above bb, draining precision from exactly the low-ability students who are hardest to measure 5.)

Fig. 03 · What a question is worth. Each item’s information peaks exactly at the ability that matches its difficulty (θ = b) and falls away on either side; a more discriminating item is worth a great deal — but only to students near its difficulty (taller, and narrower). This is why an adaptive test asks you questions poised at the edge of what you can do.

So the question to ask next is not the hardest one, nor the one a syllabus lists first. It is the one whose difficulty sits closest to your current best guess of the student’s ability — the most informative question available. Everything else in adaptive testing is built on that single observation.

Precision you can see

Information has two more properties that turn it from an idea into an instrument. First, under the model’s assumption that answers are independent once ability is fixed, information simply adds up across the questions on a test:

I(θ)=iIi(θ)I(\theta) = \sum_i I_i(\theta)

Second, that total converts directly into precision. The standard error of the ability estimate is

SE(θ)=1I(θ)\mathrm{SE}(\theta) = \frac{1}{\sqrt{I(\theta)}}

so more information — better-targeted questions, or simply more of them — means a smaller error bar 45. This is the quiet revolution IRT works on the older classical test theory, which hands back a single reliability and a single standard error for the entire test, the same for a struggling student and a brilliant one. IRT instead makes precision a function of ability: a test can measure the middle of the range beautifully and the extremes poorly, and the test information curve shows you exactly where it is sharp and where it is blunt 4.

Fig. 04 · Precision you can see. Add the items’ information together and you get the test information curve (solid); its reciprocal square root is the standard error of the ability estimate (dashed). Where the test carries the most information it measures most precisely — and, unlike classical test theory, that precision is a function of ability, not one number stamped on the whole test.

Now the engine is assembled. To drive the error bar down as fast as possible, spend each question where information is highest — which is to say, near the student’s ability — and watch the standard error fall with every well-aimed answer. That is the whole of adaptive testing in one sentence; the rest is making it work.

The loop that learns you

As an algorithm, computerized adaptive testing is a short loop 91011:

  • Start. With no answers yet, seed the ability estimate at the population mean and choose a first item of middling difficulty.
  • Select. Of the items not yet used, pick the one with the most information at the current estimate — the maximum-information rule.
  • Administer and score the item.
  • Re-estimate the ability from every answer so far.
  • Stop when the standard error drops below a target (a variable-length test) or a fixed budget of items is spent; otherwise, loop back to select.
Fig. 05 · The adaptive loop. From a prior estimate, the test repeatedly selects the most informative unused item for the current estimate, administers it, and re-estimates the ability — then stops once the standard error is small enough. Every pass refines the same single number at the centre.

The re-estimation step hides a genuine subtlety, and it is where the cold start bites. The natural estimator, maximum likelihood, has no finite value while a student’s answers are all correct or all wrong — and at the very first question they always are 4. Adaptive tests therefore lean on Bayesian estimation, most often expected a posteriori (EAP): treat the population distribution as a prior and report the mean of the posterior after each answer. It is finite from the first item, needs no iteration, and was designed by Bock and Mislevy in 1982 for exactly this setting — running a live ability estimate on the slender computers of the day 7. A related caution shapes the early items: when the estimate is still far from the truth, the local information at a wrong estimate can point to the wrong questions, so methods that weigh information more globally — Chang and Ying’s use of Kullback–Leibler information — behave more steadily at the start 11.

Fig. 06 · Homing in. The ability estimate (the stepped line) starts at the population mean and is pulled toward the true ability as answers arrive; the shaded band is its standard error. Both are computed here by expected-a-posteriori estimation over the response sequence shown — a correct answer nudges the estimate up, a wrong one down, and the band narrows as information accumulates.

Run the loop and the estimate marches toward the truth while its error bar contracts around it: a correct answer lifts the estimate, a wrong one lowers it, and each well-aimed question shaves more off the uncertainty than a randomly chosen one would. This is the source of the efficiency claim we began with — the test is confident about a student long before a fixed form of the same precision would have finished 8.

Where adaptivity has to be reined in

A test that only ever chased maximum information would be efficient and, in several ways, irresponsible. The interesting engineering of real systems is in the constraints layered on top of the simple rule.

The first is content. An exam must cover a blueprint — so many items of algebra, so many of geometry — and the most informative item at a given moment may be the wrong topic entirely. Constrained selection handles this by, in effect, assembling a complete test that satisfies every requirement and then administering its most informative item: the weighted-deviations method, and van der Linden’s elegant shadow-test approach, both keep the syllabus whole while still adapting 9.

The second is exposure and security. The maximum-information rule keeps reaching for the same handful of excellent items, which means those items are seen by almost everyone — and a question that everyone sees is a question that leaks. Exposure controls deliberately hold the best items back some fraction of the time (the Sympson–Hetter method 13), and a-stratified designs save the most discriminating items for late in the test, when the estimate is good enough to deserve them, spreading the load across the bank 12. The stakes are concrete: the Duolingo English Test reports a mean item-exposure rate near 0.1%0.1\%, against the double-digit percentages typical of older operational adaptive tests — the difference between an item pool that stays secret and one that quietly ends up online 15.

The third, and the most important, is fairness. An item exhibits differential item functioning (DIF) when students of the same ability but different groups have different probabilities of answering it correctly — when its characteristic curve depends on who is taking it, beyond the ability it is meant to measure. That is a validity problem before it is a fairness problem, and it is screened with care, by methods from the Mantel–Haenszel procedure to comparing the area between two groups’ curves 141. Adaptive delivery inherits the issue wholesale: a biased item is biased no matter how much information it carries, and an efficient test that quietly disadvantages a group has optimised the wrong thing.

It is worth saying plainly that adaptivity is a choice, not an upgrade. Some of the highest-stakes computerized exams are deliberately fixed-length — the United States medical licensing examination among them — while others, like the NCLEX nursing examination, stop only once the test is statistically confident a candidate is above or below the passing standard. The right design depends on whether you are placing someone on a scale or simply deciding which side of a line they fall.

When the test learns to write itself

The classical pipeline has an expensive prerequisite hiding in plain sight: before an item can be used, its difficulty, discrimination and guessing must be calibrated by piloting it on hundreds of students. That is slow, costly, and a poor fit for a world that wants fresh questions constantly. The most striking recent development attacks exactly this. The Duolingo English Test uses machine learning to predict an item’s difficulty directly from its text — calibrating questions before a single examinee has seen them — and then delivers them adaptively; its machine-learned proficiency estimates line up with the classical IRT ability estimates at a rank correlation around 0.960.96 15. Other work reaches deeper into the loop, learning the selection policy itself from data rather than trusting the fixed maximum-information rule: BOBCAT casts adaptive testing as a bilevel optimisation and learns a data-driven question-picker that shortens tests further 16.

There is also a quieter connection, and it leads straight back to a companion idea. Item response theory draws a snapshot — a student’s ability fixed at the moment of the test. Knowledge tracing, the subject of an earlier dispatch, draws the moving picture — competence changing as a student learns 17. They are two views of the same latent thing, and recent work makes the bridge explicit, reading deep knowledge tracing as a kind of dynamic, multidimensional item response model 18. Measurement and learning, it turns out, are not separate theories so much as the still photograph and the film of one quantity we can never see directly.

What we read from it

EuraStudy is built on a reviewed bank of roughly three thousand exam-style questions, each tagged to a topic — the raw material any of this machinery needs. The IRT lens is, more than anything, a discipline about what a question is worth. A question has no fixed value; its worth depends entirely on who is answering it. The most useful question is rarely the hardest or the easiest — it is the one balanced at the edge of what a student can currently do. And precision is not a single number stamped on a test but something you can point at, target, and stop chasing once you have enough of it.

The honest reading is that the method and its limits are inseparable. A shorter test is only a better test when it is pointed in the right direction; an efficient one is worthless if it is unfair, narrow, or insecure. The aim of measuring a student well was never to ask the most questions. It is to ask the right ones — and, like a good player of twenty questions, to know exactly when you have heard enough to stop.

References

  1. 1.Lord, F. M. (1980). Applications of Item Response Theory to Practical Testing Problems. Hillsdale, NJ: Lawrence Erlbaum Associates.
  2. 2.Rasch, G. (1960). Probabilistic Models for Some Intelligence and Attainment Tests. Copenhagen: Danish Institute for Educational Research (repr. 1980, University of Chicago Press).
  3. 3.Birnbaum, A. (1968). Some Latent Trait Models and Their Use in Inferring an Examinee’s Ability. In F. M. Lord & M. R. Novick, Statistical Theories of Mental Test Scores (pp. 397–479). Reading, MA: Addison-Wesley.
  4. 4.Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of Item Response Theory. Newbury Park, CA: Sage Publications.
  5. 5.Baker, F. B. (2001). The Basics of Item Response Theory (2nd ed.). College Park, MD: ERIC Clearinghouse on Assessment and Evaluation.
  6. 6.Embretson, S. E., & Reise, S. P. (2000). Item Response Theory for Psychologists. Mahwah, NJ: Lawrence Erlbaum Associates.
  7. 7.Bock, R. D., & Mislevy, R. J. (1982). Adaptive EAP Estimation of Ability in a Microcomputer Environment. Applied Psychological Measurement, 6(4), 431–444.
  8. 8.Weiss, D. J. (1982). Improving Measurement Quality and Efficiency with Adaptive Testing. Applied Psychological Measurement, 6(4), 473–492.
  9. 9.van der Linden, W. J., & Glas, C. A. W. (Eds.). (2010). Elements of Adaptive Testing. New York: Springer.
  10. 10.Wainer, H., Dorans, N. J., Eignor, D., Flaugher, R., Green, B. F., Mislevy, R. J., Steinberg, L., & Thissen, D. (2000). Computerized Adaptive Testing: A Primer (2nd ed.). Mahwah, NJ: Lawrence Erlbaum Associates.
  11. 11.Chang, H.-H. (2015). Psychometrics Behind Computerized Adaptive Testing. Psychometrika, 80(1), 1–20.
  12. 12.Chang, H.-H., & Ying, Z. (1999). a-Stratified Multistage Computerized Adaptive Testing. Applied Psychological Measurement, 23(3), 211–222.
  13. 13.Sympson, J. B., & Hetter, R. D. (1985). Controlling Item-Exposure Rates in Computerized Adaptive Testing. Proceedings of the 27th Annual Meeting of the Military Testing Association, 973–977. San Diego, CA: Navy Personnel Research and Development Center.
  14. 14.Holland, P. W., & Wainer, H. (Eds.). (1993). Differential Item Functioning. Hillsdale, NJ: Lawrence Erlbaum Associates.
  15. 15.Settles, B., LaFlair, G. T., & Hagiwara, M. (2020). Machine Learning–Driven Language Assessment. Transactions of the Association for Computational Linguistics, 8, 247–263.
  16. 16.Ghosh, A., & Lan, A. (2021). BOBCAT: Bilevel Optimization-Based Computerized Adaptive Testing. Proceedings of the 30th International Joint Conference on Artificial Intelligence (IJCAI-21), 2410–2417.
  17. 17.Corbett, A. T., & Anderson, J. R. (1995). Knowledge Tracing: Modeling the Acquisition of Procedural Knowledge. User Modeling and User-Adapted Interaction, 4(4), 253–278.
  18. 18.Vie, J.-J., & Kashima, H. (2023). Deep Knowledge Tracing is an Implicit Dynamic Multidimensional Item Response Theory Model. Proceedings of the 31st International Conference on Computers in Education (ICCE).

Start preparing with EuraStudy

Join the waitlist to be first in when your curriculum opens.

Next dispatch · D·02

How a Machine Reads What You Know

Every adaptive tutor rests on a quiet act of inference: guessing the knowledge it cannot see from the answers it can. We trace that idea from Bayesian Knowledge Tracing to its deep-learning successors — and the honest places where the deeper model is not the better one.

More dispatches