AIML Labs Research Note
June 16, 2026 / 18 min read
Verifier-Calibrated On-Policy Distillation: A Practical Algorithm for Teaching Models Without Making Them Forget
Most post-training methods can be reduced to one practical question: what distribution are we moving the model toward? This republished AIML Labs note argues that verifier-calibrated on-policy distillation is a useful hybrid when you want new capability without broad forgetting.
A language model is a probability distribution over sequences. Supervised fine-tuning, reinforcement learning, and on-policy distillation all reshape that distribution in different ways.
The core proposal is simple: let the student generate the trajectory, let a verifier score whether the trajectory succeeded, and use teacher logits only as dense guidance inside that verified frame.
That gives you the locality of on-policy learning, the density of distillation, and a cleaner answer to the credit-assignment problem than a single end-of-trajectory reward can provide.
The distribution question
Every post-training method is a choice about where probability mass should move.
SFT
Pulls the model toward a fixed external dataset. Strong for task format and cold-start behavior, but broad token pressure can damage older capabilities.
RL
Updates from the model's own samples, which keeps learning local to the policy it already visits, but the reward signal is usually sparse.
On-policy distillation
Keeps on-policy locality while adding dense teacher supervision on the prefixes the student actually produced.
Core algorithm
Verifier-Calibrated On-Policy Distillation
Sample from the student, score with a verifier, use teacher logits as dense hints, clip dangerous updates, and replay older skills on-policy so the model learns from the states it actually visits.
1. Use a small amount of SFT only for task shape
If the base model cannot follow the task interface, teach the format first with a short, low-learning-rate SFT pass. Do not use this phase to force deep capability transfer.
2. Generate rollouts from the current student
For each prompt, sample multiple completions from the current student policy. The point is to train on the states the student really visits, including imperfect prefixes.
3. Score each rollout with a verifier
Use a reward that checks true task success. For code repair, the verifier should reward passing tests while penalizing unnecessary rewrites, extra complexity, and unrelated edits.
4. Query the teacher on the student's trajectory
Run the teacher on each student prefix and collect token logits as dense hints. The teacher can be a stronger model, a specialist model, or the same model with privileged context.
5. Calibrate token updates with outcome success
Do not blindly imitate the teacher's style preferences. Use verifier success to decide where dense teacher guidance is allowed to shape the student update.
6. Clip aggressive token pressure
Bound the influence of teacher logits and use trust-region style clipping so one high-KL stylistic token does not yank the student policy far off course.
7. Replay older skills on-policy
Mix in prior tasks and sample them from the current student as well. This keeps retention grounded in the same distributional logic instead of replaying only static historical demonstrations.
Why this matters
SFT often forgets because every demonstrated token becomes a target, whether it was task-critical or incidental. A style token, a formatting habit, and a genuinely decisive reasoning step can all receive the same direct gradient pressure.
RL forgets less because the model trains on its own samples, but outcome rewards are sparse. The model learns which completions won, not always which internal token decisions were the ones that mattered.
Plain distillation is dense but can over-teach style. A strong teacher might have large token-level preferences around phrases, formatting, or reasoning cadence that are not actually responsible for task success. That is why verifier calibration is the key move.
Guardrails
Dense teacher guidance only helps when the surrounding loop is honest.
Source note
This post is republished on AIML Labs from the original ChatGPT AiML article.
Original source: chatgptaiml.com/articles/verifier-calibrated-on-policy-distillation