Загрузка...

Notebook 3: Reward Modeling — Part 2 of 2 | The Frontier Path

Part 2 of 2 of the complete Notebook 3 (Reward Modeling) walkthrough from The Frontier Path — every concept built from scratch and explained out loud.

▶ IN THIS PART
00:21 Reading The Curves
01:02 Dig Into The Scores
01:40 Length Bias
02:15 Sycophancy
02:54 Reward Hacking
03:33 More Ways It Breaks
04:13 Pairwise Vs Pointwise
04:51 The Kl Leash
05:27 Overoptimization Laws
06:11 Rm Vs Llm-As-Judge
06:52 Rlaif & The Trend
07:35 What Makes An Rm Good
08:21 Production Scale
09:04 Monitor & Retrain
10:25 You Now Get Reward Models

▶ FULL SERIES (all 2 parts in order)
https://www.youtube.com/playlist?list=PLTotE_hCoIRw

▶ RUN IT YOURSELF (free + MIT)
Notebook: https://github.com/mootvstherubric-l/frontier-ml-toolkit/blob/main/01-rlhf/notebooks/03-reward-modeling.ipynb
Colab: https://colab.research.google.com/github/mootvstherubric-l/frontier-ml-toolkit/blob/main/01-rlhf/notebooks/03-reward-modeling.ipynb

representative scenarios, not any company's real questions. ai-generated.

#machinelearning #mlinterview #frontierai #aiengineering #deeplearning

questions? dm @mootvstherubric on instagram: https://instagram.com/mootvstherubric

Видео Notebook 3: Reward Modeling — Part 2 of 2 | The Frontier Path канала moot-vs-the-rubric
Яндекс.Метрика
Все заметки Новая заметка Страницу в заметки
Страницу в закладки Мои закладки
На информационно-развлекательном портале SALDA.WS применяются cookie-файлы. Нажимая кнопку Принять, вы подтверждаете свое согласие на их использование.
О CookiesНапомнить позжеПринять