Загрузка...

DPO Killed The Reward Model — And Matched RLHF On Every Benchmark

RLHF needed three extra models. DPO threw away the reward model and matched it on every benchmark.

Classic RLHF stacks three networks on top of your policy. A reward model. A critic. And a frozen reference copy. Direct Preference Optimization collapses that to one supervised loss. The trick is a sign flip.

Each training row is a pair: a chosen answer and a rejected one. DPO computes one log-probability ratio for each, against the frozen reference model. Then it pushes the chosen ratio up and pushes the rejected ratio down. That's it. Cross-entropy on a preference pair. No reward model. No PPO. No critic.

On most alignment benchmarks, DPO matches PPO. Same quality. A fraction of the moving parts. The loss is just two log-probs and a subtraction.

Honest limit: DPO is sensitive to preference data quality, and it can over-suppress the chosen answer if the gap to rejected is huge. Tune beta carefully and curate pairs.

Monday-morning bridge: if you're aligning a model on human preferences, reach for DPO before PPO. Three models becomes one loss.

Видео DPO Killed The Reward Model — And Matched RLHF On Every Benchmark канала Adam Rosler
Яндекс.Метрика
Все заметки Новая заметка Страницу в заметки
Страницу в закладки Мои закладки
На информационно-развлекательном портале SALDA.WS применяются cookie-файлы. Нажимая кнопку Принять, вы подтверждаете свое согласие на их использование.
О CookiesНапомнить позжеПринять