- Популярные видео
- Авто
- Видео-блоги
- ДТП, аварии
- Для маленьких
- Еда, напитки
- Животные
- Закон и право
- Знаменитости
- Игры
- Искусство
- Комедии
- Красота, мода
- Кулинария, рецепты
- Люди
- Мото
- Музыка
- Мультфильмы
- Наука, технологии
- Новости
- Образование
- Политика
- Праздники
- Приколы
- Природа
- Происшествия
- Путешествия
- Развлечения
- Ржач
- Семья
- Сериалы
- Спорт
- Стиль жизни
- ТВ передачи
- Танцы
- Технологии
- Товары
- Ужасы
- Фильмы
- Шоу-бизнес
- Юмор
DPO Killed The Reward Model — And Matched RLHF On Every Benchmark
RLHF needed three extra models. DPO threw away the reward model and matched it on every benchmark.
Classic RLHF stacks three networks on top of your policy. A reward model. A critic. And a frozen reference copy. Direct Preference Optimization collapses that to one supervised loss. The trick is a sign flip.
Each training row is a pair: a chosen answer and a rejected one. DPO computes one log-probability ratio for each, against the frozen reference model. Then it pushes the chosen ratio up and pushes the rejected ratio down. That's it. Cross-entropy on a preference pair. No reward model. No PPO. No critic.
On most alignment benchmarks, DPO matches PPO. Same quality. A fraction of the moving parts. The loss is just two log-probs and a subtraction.
Honest limit: DPO is sensitive to preference data quality, and it can over-suppress the chosen answer if the gap to rejected is huge. Tune beta carefully and curate pairs.
Monday-morning bridge: if you're aligning a model on human preferences, reach for DPO before PPO. Three models becomes one loss.
Видео DPO Killed The Reward Model — And Matched RLHF On Every Benchmark канала Adam Rosler
Classic RLHF stacks three networks on top of your policy. A reward model. A critic. And a frozen reference copy. Direct Preference Optimization collapses that to one supervised loss. The trick is a sign flip.
Each training row is a pair: a chosen answer and a rejected one. DPO computes one log-probability ratio for each, against the frozen reference model. Then it pushes the chosen ratio up and pushes the rejected ratio down. That's it. Cross-entropy on a preference pair. No reward model. No PPO. No critic.
On most alignment benchmarks, DPO matches PPO. Same quality. A fraction of the moving parts. The loss is just two log-probs and a subtraction.
Honest limit: DPO is sensitive to preference data quality, and it can over-suppress the chosen answer if the gap to rejected is huge. Tune beta carefully and curate pairs.
Monday-morning bridge: if you're aligning a model on human preferences, reach for DPO before PPO. Three models becomes one loss.
Видео DPO Killed The Reward Model — And Matched RLHF On Every Benchmark канала Adam Rosler
Комментарии отсутствуют
Информация о видео
10 мая 2026 г. 7:06:08
00:00:50
Другие видео канала





















