Fast and Effective On-policy Distillation from Reasoning Prefixes
每日信息看板 · 2026-02-16
2026-02-16T23:28:54Z
Published
AI 总结
该论文提出仅对学生输出前缀进行在策略蒸馏并提前终止采样的方法,在数学与跨域基准上达到与完整OPD相当效果,同时将训练算力开销降低2到47倍,提升了长推理模型训练效率与可扩展性。
- 传统在策略蒸馏需在线采样学生完整轨迹并逐token监督,训练成本在长回答场景下很高。
- 作者发现有效训练信号常集中在输出前缀,且短教师前缀就能显著引导学生得到正确答案。
- 方法上将蒸馏目标限制为学生生成前缀,并在蒸馏时提前终止采样,减少无效计算。
- 在AI-for-Math及域外基准实验中,前缀在策略蒸馏与完整OPD性能匹配。
- 训练FLOP相对完整OPD降低约2x到47x,显示出明显的效率优势。
#arXiv #paper #研究/论文 #On-policy Distillation #AI-for-Math
内容摘录
On-policy distillation (OPD), which samples trajectories from the student model and supervises them with a teacher at the token level, avoids relying solely on verifiable terminal rewards and can yield better generalization than off-policy distillation. However, OPD requires expensive on-the-fly sampling of the student policy during training, which substantially increases training cost, especially for long responses. Our initial analysis shows that, during OPD, training signals are often concentrated in the prefix of each output, and that even a short teacher-generated prefix can significantly help the student produce the correct answer. Motivated by these observations, we propose a simple yet effective modification of OPD: we apply the distillation objective only to prefixes of student-generated outputs and terminate each sampling early during distillation. Experiments on a suite of AI-for-Math and out-of-domain benchmarks show that on-policy prefix distillation matches the performance of full OPD while reducing training FLOP by 2x-47x.