Загрузка...

Notebook 3: Reward Modeling — Part 2 of 2 | The Frontier Path

Part 2 of 2 of the complete Notebook 3 (Reward Modeling) walkthrough from The Frontier Path — every concept built from scratch and explained out loud.

▶ IN THIS PART
00:21 Reading The Curves
01:02 Dig Into The Scores
01:40 Length Bias
02:15 Sycophancy
02:54 Reward Hacking
03:33 More Ways It Breaks
04:13 Pairwise Vs Pointwise
04:51 The Kl Leash
05:27 Overoptimization Laws
06:11 Rm Vs Llm-As-Judge
06:52 Rlaif & The Trend
07:35 What Makes An Rm Good
08:21 Production Scale
09:04 Monitor & Retrain
10:25 You Now Get Reward Models

▶ FULL SERIES (all 2 parts in order)
https://www.youtube.com/playlist?list=PLTotE_hCoIRw

▶ RUN IT YOURSELF (free + MIT)
Notebook: https://github.com/mootvstherubric-l/frontier-ml-toolkit/blob/main/01-rlhf/notebooks/03-reward-modeling.ipynb
Colab: https://colab.research.google.com/github/mootvstherubric-l/frontier-ml-toolkit/blob/main/01-rlhf/notebooks/03-reward-modeling.ipynb

representative scenarios, not any company's real questions. ai-generated.

#machinelearning #mlinterview #frontierai #aiengineering #deeplearning

questions? dm @mootvstherubric on instagram: https://instagram.com/mootvstherubric

Видео Notebook 3: Reward Modeling — Part 2 of 2 | The Frontier Path канала moot-vs-the-rubric

AI engineering interview AI interview prep LLM interview ML interview prep ML interview questions RLHF deep learning frontier AI machine learning interview moot preference data pytorch reward modeling

Комментарии отсутствуют

Информация о видео

Вчера, 10:38:19

00:11:05

moot-vs-the-rubric

Правообладателям

Жалоба на материал Недопустимый материал Нарушение авторских прав

Комментарии

Поделиться

Другие видео канала

Why does REINFORCE have high variance and how is it reduced — Frontier Path #20 | ML Interview Prep

every ml & ai-engineering interview concept, built from scratch

Mock Teardown #7: back off, do not hammer. | AI-Engineering Interview

Notebook 6: RLHF Pipeline — Part 2 of 2 | The Frontier Path

Notebook 1: Transformers Attention — Part 1 of 2 | The Frontier Path

Causal masking — Frontier Path #6 | ML Interview Prep

Scaling the attention scores — Frontier Path #2 | ML Interview Prep

What are typical hyperparameters for SFT on a pretrained LLM — Frontier Path #11 | ML Interview Prep

Mock Teardown #9: every agent loop needs a stop sign. | AI-Engineering Interview

Mock Teardown #1: check a tool's result before you use it | AI-Engineering Interview

What is the bradley-terry model for preference modeling — Frontier Path #14 | ML Interview Prep

Notebook 2: SFT Basics — Part 1 of 2 | The Frontier Path

i fumbled attention from scratch in a senior ml interview. did i pass? #Shorts

Notebook 4: Policy Gradient PPO — Part 2 of 2 | The Frontier Path

Mock Teardown #5: fewer round-trips = better. | AI-Engineering Interview

What is the advantage function a(s, a) — Frontier Path #21 | ML Interview Prep

The Frontier Path · Notebook 4: Policy Gradient PPO — every concept, from scratch

Multi-head attention — Frontier Path #4 | ML Interview Prep

Scaling the attention scores — Frontier Path #2 | ML Interview Prep

Notebook 4: Policy Gradient PPO — Part 1 of 2 | The Frontier Path

What is the standard data format for supervised fine-tuning (SFT) — Frontier Path #8

Все заметки Новая заметка Страницу в заметки

Страницу в закладки Мои закладки

На информационно-развлекательном портале SALDA.WS применяются cookie-файлы. Нажимая кнопку Принять, вы подтверждаете свое согласие на их использование.

О Cookies Напомнить позже Принять