AI & Learning

Withholding the Answer

A system that hands over the answer is not teaching. We argue that the central design problem for a machine tutor is not how to explain, but when and how much to withhold.

The EuraStudy Team11 June 20269 min readD·02

Fig. 01 · The tutoring loop with a fading scaffold. Help is dense at t₀ and is deliberately withdrawn toward tₙ, transferring regulation of the task from tutor to learner.

AbstractThe arrival of fluent generative models has made it trivially easy to build a tutor that answers any question instantly and completely. We argue that this capability, applied naïvely, is pedagogically counterproductive: a system optimised to resolve uncertainty on demand short-circuits the very cognitive work through which durable learning occurs. Drawing on scaffolding theory, the literature on formative assessment, and research on desirable difficulties and productive failure, we propose that the central design problem for machine tutors is not how to explain, but when and how much to withhold. We outline a principled stance on adaptive sequencing, argue that tutors must be evaluated on transfer and retention rather than engagement, and discuss the limits of present evidence.

Introduction: the answer is not the point

A tutor that always answers is, in an important sense, no tutor at all. The defining capability of a contemporary language model — to produce a fluent, correct-sounding resolution to almost any posed question — is precisely the capability that, deployed without discipline, undermines learning. When a learner who is stuck receives the complete solution, the episode terminates: the desirable difficulty that would have driven encoding is removed, the retrieval that would have consolidated memory never occurs, and the learner leaves with the comfortable, and usually mistaken, impression of understanding 1.

This is not a complaint about correctness. The model may be entirely right. The problem is one of cognitive division of labour: a tutor that performs the difficult thinking on the learner's behalf produces a fluent transcript and an unchanged mind. The literature on this is older than the technology. Decades of work on worked examples, feedback, and self-explanation converge on a single uncomfortable observation — that the conditions which make studying feel efficient are frequently the conditions under which least is later retained 2.

We take as our premise that the purpose of a tutor is to change what the learner can do unaided, later, on a different problem. Every design decision in a machine tutor should be evaluated against that end. In what follows we set out the pedagogical foundations that bear on this — scaffolding and the zone of proximal development, the formative role of feedback, and the counter-intuitive value of struggle — and then ask what they imply for systems we can now actually build.

Background: scaffolding and the zone of proximal development

Vygotsky's notion of the zone of proximal development (ZPD) names the gap between what a learner can accomplish independently and what they can accomplish with the guidance of a more capable other 3. Instruction is effective, on this account, precisely when it operates inside that zone: tasks below it are merely consolidating, tasks above it produce only confusion, and tasks within it are where assisted performance today becomes independent performance tomorrow.

The operational counterpart of the ZPD is scaffolding, the term Wood, Bruner and Ross introduced for the temporary support an expert supplies so that a novice can complete a task otherwise beyond them 4. The decisive and frequently forgotten feature of a scaffold is that it is temporary. The expert tutor calibrates support to the learner's current competence and then deliberately withdraws it as competence grows — a process sometimes called fading, and the principle the hero plate renders. The skill of good tutoring lies less in the giving of help than in its graduated removal.

A scaffold that is never removed is not a scaffold. It is a crutch — and the learner does not notice the difference until it is taken away.

This reframes the central design question. The interesting variable in a machine tutor is not the quality of its explanations — models are already extraordinary explainers — but its policy of withdrawal: how it decides, turn by turn, to offer less than it could. Bloom's celebrated finding that one-to-one tutoring can move the average student roughly two standard deviations above conventional classroom instruction is often invoked as the aspiration for these systems 5. It is worth recalling what that human tutor actually did: not deliver more content, but continuously diagnose, prompt, and step back.

Formative against summative: the timing of feedback

A feedback signal can serve two distinct functions. Summative assessment certifies what a learner has achieved at the end of a period — the mark, the report, the grade. Formative assessment, by contrast, is feedback gathered and returned during learning in order to change its course. Black and Wiliam's influential synthesis argued that strengthening formative practice produces some of the largest gains available in education, but only when the feedback is acted upon rather than merely received 6.

Machine tutors are, in principle, formative instruments of unprecedented density: they can return feedback on every keystroke. This is an opportunity and a hazard in equal measure. Feedback is not uniformly beneficial. A meta-analysis by Kluger and DeNisi found that a substantial minority of feedback interventions actually depress performance, particularly when they direct attention to the self rather than to the task 7. The relevant distinctions are captured in the framework below: feedback varies in its timing and in its specificity, and the two interact.

Fig. 02 · A framework for feedback, crossing timing (immediate to delayed) with specificity (mere verification to elaboration). Effective tutoring tends toward elaborative, well-timed guidance — neither a bare verdict nor a deferred grade.

The evidence on timing is genuinely subtle and resists slogans. Immediate verification helps a learner abandon a wrong procedure before it is rehearsed; yet a controlled delay can be superior for retention and transfer, because it forces the learner to hold and reconstruct their reasoning rather than outsource error-detection to the system 8. The honest conclusion is that there is no universally optimal latency. A competent tutor treats timing as a decision variable conditioned on the task, the error, and the learner's stage — not as a fixed setting of "instant."

The case for restraint: productive struggle and desirable difficulties

Two convergent lines of research justify a tutor's restraint. The first is Bjork's programme on desirable difficulties: manipulations that slow acquisition and depress immediate performance — spacing practice, interleaving topics, requiring effortful retrieval rather than rereading — reliably improve long-term retention and transfer 9. The difficulty is desirable precisely because it is difficult; remove it, and you remove the benefit. A tutor that smooths every obstacle is optimising the wrong curve.

Fig. 03 · Schematic, not data. The answer-giving tutor produces a quick rise in apparent fluency that plateaus on a delayed test; scaffolded restraint learns more slowly, pays an early cost in difficulty, and overtakes at the crossover. The variable plotted is unaided performance — the only one that matters.

The second is Kapur's work on productive failure: learners who first attempt to solve a problem they have not been taught to solve — and who genuinely struggle, even fail — subsequently learn the canonical method more deeply than peers given direct instruction from the outset 10. The failed attempt is not wasted; it activates and differentiates the prior knowledge into which the eventual explanation must be assimilated. The implication for a tutor is stark: the most valuable moment is often the one before it speaks.

This must be stated with care, because restraint can curdle into mere unhelpfulness. The relevant qualification comes from the worked-example effect and cognitive load theory: for genuine novices, unguided problem-solving can overwhelm working memory and teach nothing, and studying worked examples is then more efficient than struggling 11. The two findings are reconciled by sequence and by the learner's state. Struggle is productive when the learner has enough prior knowledge to make the attempt meaningful; it is merely destructive when they do not. A principled tutor therefore does not adopt restraint as an ideology. It estimates whether this learner, on this task, is positioned to benefit from struggle — and only then withholds.

Fig. 04 · The zone of proximal development. Below the zone, tasks merely consolidate; above it, they overwhelm. A tutor earns its keep by working in the middle band — and by pushing the inner boundary outward, so that what required help becomes independent.

Adaptive sequencing: what it can and cannot responsibly do

If restraint must be conditioned on the learner's state, the tutor needs some estimate of that state. This is the proper domain of adaptive sequencing: selecting the next task, hint, or review to keep the learner inside their ZPD. The tradition is mature. Knowledge-tracing models, from Bayesian formulations to their deep-learning successors, infer the probability that a learner has mastered a skill from their pattern of responses 12; spacing algorithms schedule review at the expanding intervals that retention research recommends 13.

What such systems can responsibly do is real and valuable: maintain an appropriate level of challenge, surface a concept for review at the moment it is about to be forgotten, and avoid both the boredom of trivial tasks and the futility of impossible ones. What they cannot responsibly do is overclaim. A mastery estimate is a probabilistic inference from sparse, noisy behaviour, not a measurement of a mind. It is confounded by guessing, by slips, by the gulf between a fluent answer and genuine understanding, and by the simple fact that performing a skill in one narrow context predicts little about performing it in another.

The responsible posture, then, is calibrated humility. Adaptive sequencing should be treated as a prior over what to try next, continuously revised by evidence, and not as a verdict on the learner. The failure mode to fear is a confident model that narrows a learner's path on thin evidence — sequencing them away from challenge in the name of personalisation, and thereby denying the very difficulties from which durable learning is built.

Evaluation: measure the tutor by what the learner keeps

A design philosophy is only as honest as the metric it is willing to be judged by. Here the field is in danger, because the metrics that are easiest to collect are the ones least connected to learning. Engagement — time on platform, sessions per week, messages exchanged, satisfaction ratings — is a vanity metric in this domain 14. A tutor that hands over answers will score superbly on every one of them: it is pleasant, fast, and frictionless, and it teaches nothing. Optimising a tutor for engagement actively selects for the answer-giving failure mode.

Engagement is the metric a tutor optimises when it has given up on teaching. Measure transfer, or measure nothing.

The defensible criteria are those the learning sciences have long insisted upon. The first is the learning gain measured on a delayed, unaided assessment — performance without the tutor present, after enough time has passed for fluency to fade and only what was encoded to remain. The second, and more demanding, is transfer: the ability to apply what was learned to problems superficially unlike those practised, which is the strongest available proxy for understanding as opposed to pattern-matching 15. Where it can be measured, we should also attend to metacognitive outcomes — whether the learner has grown better at judging what they do and do not know, since calibration of confidence is itself a transferable skill 16. A tutor that improves immediate accuracy while degrading the learner's self-knowledge has done harm that no engagement curve will reveal.

Fig. 05 · Two tutors compared across the dimensions that matter. The answer-giving tutor optimises the felt experience of a session; the teaching tutor optimises what survives it. The two are frequently in tension — which is the whole design problem.

Discussion and limitations

We have argued that the central design problem for a machine tutor is the disciplined withholding of help, and that this stance follows from a coherent body of pedagogical evidence. Several limitations temper the claim, and we state them plainly.

First, much of the cited evidence predates the technology and was established in human or simple computer-based settings; whether the effects survive at scale, in the hands of a fluent conversational model, is an empirical question that the field is only beginning to answer rigorously. Generalisation should be cautious. Second, restraint has a cost. Productive struggle shades into unproductive frustration, and the boundary is individual; a misjudged scaffold can demoralise as easily as a misjudged answer can spoil. The same intervention is desirable for one learner and destructive for another, which is exactly why calibration — not ideology — must govern it. Third, our preferred metrics, delayed transfer and metacognitive calibration, are expensive and slow to collect, and the field's commercial incentives push relentlessly toward the cheap engagement proxies we have criticised. There is a real risk that what is easy to measure will continue to crowd out what is worth measuring.

Finally, we have written as though the goal of the tutor were uncontested. It is not. Whether a system should optimise for measured learning, for the learner's autonomy and self-direction, or for some negotiated balance is a question of values that no evaluation metric can settle. Our position is narrow and, we hope, defensible: whatever a tutor is for, it cannot be for handing over answers, because that is the one thing we can say with confidence does not teach. The machine that can answer everything must learn, like every good teacher before it, the harder discipline of when to stay silent.

References

1.Koedinger, K. R., & Aleven, V. (2007). Exploring the assistance dilemma in experiments with cognitive tutors. Educational Psychology Review, 19(3), 239–264.
2.Bjork, R. A., & Bjork, E. L. (2011). Making things hard on yourself, but in a good way: Creating desirable difficulties to enhance learning. In Psychology and the Real World (pp. 56–64). Worth.
3.Vygotsky, L. S. (1978). Mind in Society: The Development of Higher Psychological Processes. Harvard University Press.
4.Wood, D., Bruner, J. S., & Ross, G. (1976). The role of tutoring in problem solving. Journal of Child Psychology and Psychiatry, 17(2), 89–100.
5.Bloom, B. S. (1984). The 2 sigma problem: The search for methods of group instruction as effective as one-to-one tutoring. Educational Researcher, 13(6), 4–16.
6.Black, P., & Wiliam, D. (1998). Assessment and classroom learning. Assessment in Education, 5(1), 7–74.
7.Kluger, A. N., & DeNisi, A. (1996). The effects of feedback interventions on performance: A historical review and meta-analysis. Psychological Bulletin, 119(2), 254–284.
8.Shute, V. J. (2008). Focus on formative feedback. Review of Educational Research, 78(1), 153–189.
9.Soderstrom, N. C., & Bjork, R. A. (2015). Learning versus performance: An integrative review. Perspectives on Psychological Science, 10(2), 176–199.
10.Kapur, M. (2008). Productive failure. Cognition and Instruction, 26(3), 379–424.
11.Sweller, J., van Merriënboer, J. J. G., & Paas, F. (1998). Cognitive architecture and instructional design. Educational Psychology Review, 10(3), 251–296.
12.Corbett, A. T., & Anderson, J. R. (1995). Knowledge tracing: Modeling the acquisition of procedural knowledge. User Modeling and User-Adapted Interaction, 4(4), 253–278.
13.Cepeda, N. J., Pashler, H., Vul, E., Wixted, J. T., & Rohrer, D. (2006). Distributed practice in verbal recall tasks: A review and quantitative synthesis. Psychological Bulletin, 132(3), 354–380.
14.Reich, J. (2020). Failure to Disrupt: Why Technology Alone Can't Transform Education. Harvard University Press.
15.Barnett, S. M., & Ceci, S. J. (2002). When and where do we apply what we learn? A taxonomy for far transfer. Psychological Bulletin, 128(4), 612–637.
16.Dunlosky, J., Rawson, K. A., Marsh, E. J., Nathan, M. J., & Willingham, D. T. (2013). Improving students' learning with effective learning techniques. Psychological Science in the Public Interest, 14(1), 4–58.

Start preparing with EuraStudy

Join the waitlist to be first in when your curriculum opens.

Join the waitlist

Next dispatch · D·03

Drawn, Not Decorated

Every chart, curve and diagram a student meets is drawn to exact specification by a single figure engine — and verified before it ships. Never faked, never screenshotted.

D·01 · On the Making of a Quiet Machine Contents D·03 · Drawn, Not Decorated

More dispatches

D·04

AI & Learning

How Should We Measure a Tutor?

A tutor that keeps students busy is not the same as a tutor that helps them learn. We argue for measuring AI tutors by learning gains and transfer — and against the engagement metrics that quietly reward the wrong thing.

5 Jun 2026 · 8 min read

D·06

AI & Learning

Adaptive Practice and Its Limits

Adaptivity is the most over-promised word in educational technology. Two effects in the learning-science record are real and worth building on; almost everything sold above them is decoration.

28 May 2026 · 8 min read