- Популярные видео
- Авто
- Видео-блоги
- ДТП, аварии
- Для маленьких
- Еда, напитки
- Животные
- Закон и право
- Знаменитости
- Игры
- Искусство
- Комедии
- Красота, мода
- Кулинария, рецепты
- Люди
- Мото
- Музыка
- Мультфильмы
- Наука, технологии
- Новости
- Образование
- Политика
- Праздники
- Приколы
- Природа
- Происшествия
- Путешествия
- Развлечения
- Ржач
- Семья
- Сериалы
- Спорт
- Стиль жизни
- ТВ передачи
- Танцы
- Технологии
- Товары
- Ужасы
- Фильмы
- Шоу-бизнес
- Юмор
AI Is No Longer A Black Box.
You've heard that AI is a black box - that nobody, not even the people building it, knows what's actually happening inside. Over the last few years, that's started to change.
This is a clear, illustrated tour through mechanistic interpretability: the young field reverse-engineering models like Claude and GPT to figure out what they're really doing. You'll see the "Golden Gate Bridge" feature researchers found inside Claude, the tiny circuit GPT-2 uses to finish a sentence about John and Mary, and the honest limits of what we can read off these models today - including why "find the circuit for lying" is so much harder than it sounds.
No math. No jargon. Just the actual ideas, slowed down enough to see.
------------------------------
CHAPTERS
------------------------------
0:00 AI is a black box (or is it?)
0:39 What a neuron actually is
1:58 The plan that didn't work
2:55 Superposition - why neurons are a mess
3:53 The Golden Gate Bridge inside Claude
5:26 How researchers find features (sparse autoencoders)
6:28 Hacking Claude's brain
7:36 The catch: features aren't enough
8:05 John, Mary, and the first real circuit
9:56 Can we find a circuit for lying?
11:50 Where mechanistic interpretability stands today
------------------------------
SOURCES
------------------------------
Mapping the Mind of a Large Language Model - Anthropic, May 2024
The accessible blog post version of the Golden Gate Bridge feature work.
https://www.anthropic.com/research/mapping-mind-language-model
Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet - Templeton et al., Anthropic, 2024
The full technical paper behind the feature work shown in the video.
https://transformer-circuits.pub/2024/scaling-monosemanticity/
Golden Gate Claude - Anthropic, May 2024
The live demo Anthropic ran for 24 hours where the Golden Gate Bridge feature was cranked up.
https://www.anthropic.com/news/golden-gate-claude
Toy Models of Superposition - Elhage et al., Anthropic, 2022
The paper that introduced the concept of superposition we walk through at 2:55.
https://transformer-circuits.pub/2022/toy_model/index.html
Interpretability in the Wild: A Circuit for Indirect Object Identification in GPT-2 Small - Wang, Variengien, Conmy, Shlegeris, Steinhardt, 2022
The "John gave a drink to Mary" circuit paper.
https://arxiv.org/abs/2211.00593
------------------------------
GOING DEEPER
------------------------------
Tracing the Thoughts of a Large Language Model - Anthropic, March 2025
The natural next step: tracing entire reasoning chains, not just single features.
https://www.anthropic.com/research/tracing-thoughts-language-model
On the Biology of a Large Language Model - Lindsey et al., Anthropic, 2025
Attribution graphs applied to Claude 3.5 Haiku for poetry, multilingual reasoning, hallucinations, and refusals.
https://transformer-circuits.pub/2025/attribution-graphs/biology.html
Concrete Steps to Get Started in Mechanistic Interpretability - Neel Nanda
If you want to actually do this yourself.
https://www.neelnanda.io/mechanistic-interpretability/getting-started
------------------------------
ABOUT GOOD ROBOTS
------------------------------
Good Robots takes the most important ideas in AI safety and slows them down enough that anyone can actually see what's going on. Not dumbed down. Just explained the way they deserve to be.
If that sounds like your kind of channel:
- Like the video so YouTube shows it to more curious people
- Subscribe for the next breakdown
- Drop a comment with the AI safety idea you want broken down next
#MechanisticInterpretability #AISafety #Anthropic
Видео AI Is No Longer A Black Box. канала Good Robots
This is a clear, illustrated tour through mechanistic interpretability: the young field reverse-engineering models like Claude and GPT to figure out what they're really doing. You'll see the "Golden Gate Bridge" feature researchers found inside Claude, the tiny circuit GPT-2 uses to finish a sentence about John and Mary, and the honest limits of what we can read off these models today - including why "find the circuit for lying" is so much harder than it sounds.
No math. No jargon. Just the actual ideas, slowed down enough to see.
------------------------------
CHAPTERS
------------------------------
0:00 AI is a black box (or is it?)
0:39 What a neuron actually is
1:58 The plan that didn't work
2:55 Superposition - why neurons are a mess
3:53 The Golden Gate Bridge inside Claude
5:26 How researchers find features (sparse autoencoders)
6:28 Hacking Claude's brain
7:36 The catch: features aren't enough
8:05 John, Mary, and the first real circuit
9:56 Can we find a circuit for lying?
11:50 Where mechanistic interpretability stands today
------------------------------
SOURCES
------------------------------
Mapping the Mind of a Large Language Model - Anthropic, May 2024
The accessible blog post version of the Golden Gate Bridge feature work.
https://www.anthropic.com/research/mapping-mind-language-model
Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet - Templeton et al., Anthropic, 2024
The full technical paper behind the feature work shown in the video.
https://transformer-circuits.pub/2024/scaling-monosemanticity/
Golden Gate Claude - Anthropic, May 2024
The live demo Anthropic ran for 24 hours where the Golden Gate Bridge feature was cranked up.
https://www.anthropic.com/news/golden-gate-claude
Toy Models of Superposition - Elhage et al., Anthropic, 2022
The paper that introduced the concept of superposition we walk through at 2:55.
https://transformer-circuits.pub/2022/toy_model/index.html
Interpretability in the Wild: A Circuit for Indirect Object Identification in GPT-2 Small - Wang, Variengien, Conmy, Shlegeris, Steinhardt, 2022
The "John gave a drink to Mary" circuit paper.
https://arxiv.org/abs/2211.00593
------------------------------
GOING DEEPER
------------------------------
Tracing the Thoughts of a Large Language Model - Anthropic, March 2025
The natural next step: tracing entire reasoning chains, not just single features.
https://www.anthropic.com/research/tracing-thoughts-language-model
On the Biology of a Large Language Model - Lindsey et al., Anthropic, 2025
Attribution graphs applied to Claude 3.5 Haiku for poetry, multilingual reasoning, hallucinations, and refusals.
https://transformer-circuits.pub/2025/attribution-graphs/biology.html
Concrete Steps to Get Started in Mechanistic Interpretability - Neel Nanda
If you want to actually do this yourself.
https://www.neelnanda.io/mechanistic-interpretability/getting-started
------------------------------
ABOUT GOOD ROBOTS
------------------------------
Good Robots takes the most important ideas in AI safety and slows them down enough that anyone can actually see what's going on. Not dumbed down. Just explained the way they deserve to be.
If that sounds like your kind of channel:
- Like the video so YouTube shows it to more curious people
- Subscribe for the next breakdown
- Drop a comment with the AI safety idea you want broken down next
#MechanisticInterpretability #AISafety #Anthropic
Видео AI Is No Longer A Black Box. канала Good Robots
Комментарии отсутствуют
Информация о видео
14 мая 2026 г. 2:48:43
00:13:18
Другие видео канала
