How On-Policy Distillation Changes LLM Weights

In this AI Research Roundup episode, Alex discusses the paper: 'Dense Supervision, Sparse Updates: On the Sparsity and Geometry of On-Policy Distillation' On-policy distillation (OPD) is a popular post-training method that combines on-policy student trajectories with dense teacher supervision, but its effects on model parameters have remained poorly understood. This paper analyzes several language and vision-language model pairs to reveal that OPD updates are surprisingly small, coordinate-sparse, and concentrated within the Feed-Forward Network (FFN) modules. The researchers show that training only this discovered sparse subnetwork can almost entirely recover full-training performance. Additionally, the study reveals that these updates are spectrally concentrated, falling primarily on coordinates where the source weights are close to zero, meaning OPD retains unique geometric signatures of on-policy post-training. Finally, the authors find that adaptive optimization like AdamW remains crucial over SGD, as dense teacher supervision preserves essential momentum and scale structures. Paper URL: https://arxiv.org/pdf/2606.13657 #AI #MachineLearning #DeepLearning #LLM #ModelDistillation #PostTraining #Optimization

Видео How On-Policy Distillation Changes LLM Weights канала AI Research Roundup