AI Is No Longer A Black Box.

You've heard that AI is a black box - that nobody, not even the people building it, knows what's actually happening inside. Over the last few years, that's started to change.

This is a clear, illustrated tour through mechanistic interpretability: the young field reverse-engineering models like Claude and GPT to figure out what they're really doing. You'll see the "Golden Gate Bridge" feature researchers found inside Claude, the tiny circuit GPT-2 uses to finish a sentence about John and Mary, and the honest limits of what we can read off these models today - including why "find the circuit for lying" is so much harder than it sounds.

No math. No jargon. Just the actual ideas, slowed down enough to see.

------------------------------
CHAPTERS
------------------------------
0:00 AI is a black box (or is it?)
0:39 What a neuron actually is
1:58 The plan that didn't work
2:55 Superposition - why neurons are a mess
3:53 The Golden Gate Bridge inside Claude
5:26 How researchers find features (sparse autoencoders)
6:28 Hacking Claude's brain
7:36 The catch: features aren't enough
8:05 John, Mary, and the first real circuit
9:56 Can we find a circuit for lying?
11:50 Where mechanistic interpretability stands today

------------------------------
SOURCES
------------------------------
Mapping the Mind of a Large Language Model - Anthropic, May 2024
The accessible blog post version of the Golden Gate Bridge feature work.
https://www.anthropic.com/research/mapping-mind-language-model

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet - Templeton et al., Anthropic, 2024
The full technical paper behind the feature work shown in the video.
https://transformer-circuits.pub/2024/scaling-monosemanticity/

Golden Gate Claude - Anthropic, May 2024
The live demo Anthropic ran for 24 hours where the Golden Gate Bridge feature was cranked up.
https://www.anthropic.com/news/golden-gate-claude

Toy Models of Superposition - Elhage et al., Anthropic, 2022
The paper that introduced the concept of superposition we walk through at 2:55.
https://transformer-circuits.pub/2022/toy_model/index.html

Interpretability in the Wild: A Circuit for Indirect Object Identification in GPT-2 Small - Wang, Variengien, Conmy, Shlegeris, Steinhardt, 2022
The "John gave a drink to Mary" circuit paper.
https://arxiv.org/abs/2211.00593

------------------------------
GOING DEEPER
------------------------------
Tracing the Thoughts of a Large Language Model - Anthropic, March 2025
The natural next step: tracing entire reasoning chains, not just single features.
https://www.anthropic.com/research/tracing-thoughts-language-model

On the Biology of a Large Language Model - Lindsey et al., Anthropic, 2025
Attribution graphs applied to Claude 3.5 Haiku for poetry, multilingual reasoning, hallucinations, and refusals.
https://transformer-circuits.pub/2025/attribution-graphs/biology.html

Concrete Steps to Get Started in Mechanistic Interpretability - Neel Nanda
If you want to actually do this yourself.
https://www.neelnanda.io/mechanistic-interpretability/getting-started

------------------------------
ABOUT GOOD ROBOTS
------------------------------
Good Robots takes the most important ideas in AI safety and slows them down enough that anyone can actually see what's going on. Not dumbed down. Just explained the way they deserve to be.

If that sounds like your kind of channel:
- Like the video so YouTube shows it to more curious people
- Subscribe for the next breakdown
- Drop a comment with the AI safety idea you want broken down next

#MechanisticInterpretability #AISafety #Anthropic

Видео AI Is No Longer A Black Box. канала Good Robots

Комментарии отсутствуют

Информация о видео

14 мая 2026 г. 2:48:43

00:13:18

Good Robots

Правообладателям

Жалоба на материал Недопустимый материал Нарушение авторских прав

Комментарии

Другие видео канала