- Популярные видео
- Авто
- Видео-блоги
- ДТП, аварии
- Для маленьких
- Еда, напитки
- Животные
- Закон и право
- Знаменитости
- Игры
- Искусство
- Комедии
- Красота, мода
- Кулинария, рецепты
- Люди
- Мото
- Музыка
- Мультфильмы
- Наука, технологии
- Новости
- Образование
- Политика
- Праздники
- Приколы
- Природа
- Происшествия
- Путешествия
- Развлечения
- Ржач
- Семья
- Сериалы
- Спорт
- Стиль жизни
- ТВ передачи
- Танцы
- Технологии
- Товары
- Ужасы
- Фильмы
- Шоу-бизнес
- Юмор
Alignment faking in large language models
Most of us have encountered situations where someone appears to share our views or values, but is in fact only pretending to do so—a behavior that we might call “alignment faking”.
Could AI models also display alignment faking?
Ryan Greenblatt, Monte MacDiarmid, Benjamin Wright and Evan Hubinger discuss a new paper from Anthropic, in collaboration with Redwood Research, that provides the first empirical example of a large language model engaging in alignment faking without having been explicitly—or even, we argue, implicitly—trained or instructed to do so.
Learn more: https://www.anthropic.com/research/alignment-faking
0:00 Introduction
0:47 Core setup and key findings of the paper
6:14 Understanding alignment faking through real-world analogies
9:37 Why alignment faking is concerning
14:57 Examples of of model outputs
21:39 Situational awareness and synthetic documents
28:00 Detecting and measuring alignment faking
38:09 Model training results
47:28 Potential reasons for model behavior
53:38 Frameworks for contextualizing model behavior
1:04:30 Research in the context of current model capabilities
1:09:26 Evaluations for bad behavior
1:14:22 Limitations of the research
1:20:54 Surprises and takeaways from results
1:24:46 Future directions
Видео Alignment faking in large language models канала Anthropic
Could AI models also display alignment faking?
Ryan Greenblatt, Monte MacDiarmid, Benjamin Wright and Evan Hubinger discuss a new paper from Anthropic, in collaboration with Redwood Research, that provides the first empirical example of a large language model engaging in alignment faking without having been explicitly—or even, we argue, implicitly—trained or instructed to do so.
Learn more: https://www.anthropic.com/research/alignment-faking
0:00 Introduction
0:47 Core setup and key findings of the paper
6:14 Understanding alignment faking through real-world analogies
9:37 Why alignment faking is concerning
14:57 Examples of of model outputs
21:39 Situational awareness and synthetic documents
28:00 Detecting and measuring alignment faking
38:09 Model training results
47:28 Potential reasons for model behavior
53:38 Frameworks for contextualizing model behavior
1:04:30 Research in the context of current model capabilities
1:09:26 Evaluations for bad behavior
1:14:22 Limitations of the research
1:20:54 Surprises and takeaways from results
1:24:46 Future directions
Видео Alignment faking in large language models канала Anthropic
Комментарии отсутствуют
Информация о видео
18 декабря 2024 г. 22:01:23
01:30:20
Другие видео канала





















