Direct Preference Optimization (DPO) explained: Bradley-Terry model, log probabilities, math

In this video I will explain Direct Preference Optimization (DPO), an alignment technique for language models introduced in the paper "Direct Preference Optimization: Your Language Model is Secretly a Reward Model".
I start by introducing language models and how they are used for text generation. After briefly introducing the topic of AI alignment, I start by reviewing Reinforcement Learning (RL), a topic that is necessary to understand the reward model and its loss function.
I derive step by step the loss function of the reward model under the Bradley-Terry model of preferences, a derivation that is missing in the DPO paper.
Using the Bradley-Terry model, I build the loss of the DPO algorithm, not only explaining its math derivation, but also giving intuition on how it works.
In the last part, I describe how to use the loss practically, that is, how to calculate the log probabilities using a Transformer model, by showing how it is implemented in the Hugging Face library.

DPO paper: Rafailov, R., Sharma, A., Mitchell, E., Manning, C.D., Ermon, S. and Finn, C., 2024. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36. https://arxiv.org/abs/2305.18290

If you're interested in how to derive the optimal solution to the RL constrained optimization problem, I highly recommend the following paper (Appendinx A, equation 36):
Peng XB, Kumar A, Zhang G, Levine S. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning. arXiv preprint arXiv:1910.00177. 2019 Oct 1. https://arxiv.org/abs/1910.00177

Slides PDF: https://github.com/hkproj/dpo-notes

Chapters
00:00:00 - Introduction
00:02:10 - Intro to Language Models
00:04:08 - AI Alignment
00:05:11 - Intro to RL
00:08:19 - RL for Language Models
00:10:44 - Reward model
00:13:07 - The Bradley-Terry model
00:21:34 - Optimization Objective
00:29:52 - DPO: deriving its loss
00:41:05 - Computing the log probabilities
00:47:27 - Conclusion

Видео Direct Preference Optimization (DPO) explained: Bradley-Terry model, log probabilities, math канала Umar Jamil

Комментарии отсутствуют

Информация о видео

14 апреля 2024 г. 10:20:58

00:48:46

Umar Jamil

Теги

Правообладателям

Жалоба на материал Недопустимый материал Нарушение авторских прав

Комментарии

Другие видео канала

Direct Preference Optimization (DPO) explained: Bradley-Terry model, log probabilities, math

LongNet: Scaling Transformers to 1,000,000,000 tokens: Python Code + Explanation

ML Interpretability: feature visualization, adversarial example, interp. for language models

Coding LLaMA 2 from scratch in PyTorch - KV Cache, Grouped Query Attention, Rotary PE, RMSNorm

Flash Attention derived and coded from first principles with Triton (Python)

Coding a Multimodal (Vision) Language Model from scratch in PyTorch with full explanation

Distributed Training with PyTorch: complete tutorial with cloud infrastructure and code

Coding Stable Diffusion from scratch in PyTorch

LoRA: Low-Rank Adaptation of Large Language Models - Explained visually + PyTorch code from scratch

Quantization explained with PyTorch - Post-Training Quantization, Quantization-Aware Training

How diffusion models work - explanation and code!

Reinforcement Learning from Human Feedback explained with math derivations and the PyTorch code.

Mamba and S4 Explained: Architecture, Parallel Scan, Kernel Fusion, Recurrent, Convolution, Math

Titans: Learning to Memorize at Test Time

Retrieval Augmented Generation (RAG) Explained: Embedding, Sentence BERT, Vector Database (HNSW)

BERT explained: Training, Inference, BERT vs GPT/LLamA, Fine tuning, [CLS] token

Coding a Transformer from scratch on PyTorch, with full explanation, training and inference.

LLaMA explained: KV-Cache, Rotary Positional Embedding, RMS Norm, Grouped Query Attention, SwiGLU

Paper: DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Mistral / Mixtral Explained: Sliding Window Attention, Sparse Mixture of Experts, Rolling Buffer

Variational Autoencoder - Model, ELBO, loss function and maths explained easily!