Verifier-Calibrated On-Policy Distillation: A Practical Algorithm for Teaching Models Without Making Them Forget

Most post-training methods can be reduced to one practical question: what distribution are we moving the model toward? This republished AIML Labs note argues that verifier-calibrated on-policy distillation is a useful hybrid when you want new capability without broad forgetting.

A language model is a probability distribution over sequences. Supervised fine-tuning, reinforcement learning, and on-policy distillation all reshape that distribution in different ways.

The core proposal is simple: let the student generate the trajectory, let a verifier score whether the trajectory succeeded, and use teacher logits only as dense guidance inside that verified frame.

That gives you the locality of on-policy learning, the density of distillation, and a cleaner answer to the credit-assignment problem than a single end-of-trajectory reward can provide.

The distribution question

Every post-training method is a choice about where probability mass should move.

SFT

Pulls the model toward a fixed external dataset. Strong for task format and cold-start behavior, but broad token pressure can damage older capabilities.

RL

Updates from the model's own samples, which keeps learning local to the policy it already visits, but the reward signal is usually sparse.

On-policy distillation

Keeps on-policy locality while adding dense teacher supervision on the prefixes the student actually produced.

Core algorithm

Verifier-Calibrated On-Policy Distillation

Sample from the student, score with a verifier, use teacher logits as dense hints, clip dangerous updates, and replay older skills on-policy so the model learns from the states it actually visits.

1. Use a small amount of SFT only for task shape

If the base model cannot follow the task interface, teach the format first with a short, low-learning-rate SFT pass. Do not use this phase to force deep capability transfer.

2. Generate rollouts from the current student

For each prompt, sample multiple completions from the current student policy. The point is to train on the states the student really visits, including imperfect prefixes.

3. Score each rollout with a verifier

Use a reward that checks true task success. For code repair, the verifier should reward passing tests while penalizing unnecessary rewrites, extra complexity, and unrelated edits.

4. Query the teacher on the student's trajectory

Run the teacher on each student prefix and collect token logits as dense hints. The teacher can be a stronger model, a specialist model, or the same model with privileged context.

5. Calibrate token updates with outcome success

Do not blindly imitate the teacher's style preferences. Use verifier success to decide where dense teacher guidance is allowed to shape the student update.

6. Clip aggressive token pressure

Bound the influence of teacher logits and use trust-region style clipping so one high-KL stylistic token does not yank the student policy far off course.

7. Replay older skills on-policy

Mix in prior tasks and sample them from the current student as well. This keeps retention grounded in the same distributional logic instead of replaying only static historical demonstrations.

Why this matters

SFT often forgets because every demonstrated token becomes a target, whether it was task-critical or incidental. A style token, a formatting habit, and a genuinely decisive reasoning step can all receive the same direct gradient pressure.

RL forgets less because the model trains on its own samples, but outcome rewards are sparse. The model learns which completions won, not always which internal token decisions were the ones that mattered.

Plain distillation is dense but can over-teach style. A strong teacher might have large token-level preferences around phrases, formatting, or reasoning cadence that are not actually responsible for task success. That is why verifier calibration is the key move.

Guardrails

Dense teacher guidance only helps when the surrounding loop is honest.

Teacher logits are not the same thing as task importance. High KL can be style, not substance.

The verifier must reward the thing you actually want. A weak verifier will miscalibrate the whole loop.

Training only on teacher trajectories leaves the student unprepared for its own mistakes.

Replay should preserve older behavior from the student distribution, not just from archived static datasets.

Linked credits

SFT, RL, and On-Policy Distillation Through a Distributional Lens On-Policy Distillation RL's Razor: Why Online Reinforcement Learning Forgets Less Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models MiniLLM: Knowledge Distillation of Large Language Models

Source note

This post is republished on AIML Labs from the original ChatGPT AiML article.

Original source: chatgptaiml.com/articles/verifier-calibrated-on-policy-distillation

Back to all blogs