DPO Killed The Reward Model — And Matched RLHF On Every Benchmark

RLHF needed three extra models. DPO threw away the reward model and matched it on every benchmark.

Classic RLHF stacks three networks on top of your policy. A reward model. A critic. And a frozen reference copy. Direct Preference Optimization collapses that to one supervised loss. The trick is a sign flip.

Each training row is a pair: a chosen answer and a rejected one. DPO computes one log-probability ratio for each, against the frozen reference model. Then it pushes the chosen ratio up and pushes the rejected ratio down. That's it. Cross-entropy on a preference pair. No reward model. No PPO. No critic.

On most alignment benchmarks, DPO matches PPO. Same quality. A fraction of the moving parts. The loss is just two log-probs and a subtraction.

Honest limit: DPO is sensitive to preference data quality, and it can over-suppress the chosen answer if the gap to rejected is huge. Tune beta carefully and curate pairs.

Monday-morning bridge: if you're aligning a model on human preferences, reach for DPO before PPO. Three models becomes one loss.

Видео DPO Killed The Reward Model — And Matched RLHF On Every Benchmark канала Adam Rosler

Комментарии отсутствуют

Информация о видео

10 мая 2026 г. 7:06:08

00:00:50

Adam Rosler

Правообладателям

Жалоба на материал Недопустимый материал Нарушение авторских прав

Комментарии

Другие видео канала

DPO Killed The Reward Model — And Matched RLHF On Every Benchmark

vLLM Serves 24x More Queries On The Same GPU — Here's How PagedAttention Works

Why Llama 3 decodes 8x faster — they removed heads, not added compute (GQA explained)

Single agent workflows are mostly solved. Multi agent workflows are ais future.

Why every TCP connection starts at one packet (and doubles until something breaks)

Why FlashAttention Made 8K Context Free: The One IO Change That Changed Training

Stop feeding your AI everything at once. Most people think longer prompts mean better output.

How neural networks actually learn — backpropagation in 45 seconds, from miss to math

Your RAG returns same-neighborhood, different-fact answers. Here's why

My CLI tests itself now — every bug becomes a regression test automatically

Drop half your KV cache and the next token is still right — H2O eviction explained

Removing one subtraction from LayerNorm cuts ~30% of training compute

Why your AI agent keeps picking the wrong tool (and how to fix the descriptions)

Google just put a model on your phone. No cloud, no API bill.

AGAR, SUPERKID, KOBE

DeepSeek's KV Cache Is 1/16 The Size Of Llama's — Here's The Math

ChatGPT isn't thinking. Here's the 6 steps of math that actually happen.

GRPO: how DeepSeek-R1 trained reasoning without a critic, reward model, or human labels

Your 1MB write hits disk as 30MB on a healthy RocksDB instance

Qwen Trained On 32K Tokens, Shipped At 1M — Here's The YaRN Trick

If you're not using Markdown with your AI, you're wasting it.

Most AI can answer questions. OpenClaw is interesting because it can actually use your computer.