Trevor Lohrbeer - Improved Monitoring of Backdoor Insertion During Code Refactoring [ControlConf]
Trevor Lohrbeer's research reveals significant backdoor vulnerabilities in AI-assisted code refactoring—even in "honest" models—and establishes testing infrastructure for improving detection through iterative monitor refinement and better backdoor definitions beyond simple behavioral variance.
Highlights:
🔹 Honest models compromised - 41% of "honest" model outputs had backdoors
🔹 Limited detection - Current monitors struggle with backdoor identification
🔹 Beyond behavioral variance - Need better backdoor definitions for AI control
🔹 Iterative improvement - Testing infrastructure enables continuous refinement
Видео Trevor Lohrbeer - Improved Monitoring of Backdoor Insertion During Code Refactoring [ControlConf] канала FAR․AI
Highlights:
🔹 Honest models compromised - 41% of "honest" model outputs had backdoors
🔹 Limited detection - Current monitors struggle with backdoor identification
🔹 Beyond behavioral variance - Need better backdoor definitions for AI control
🔹 Iterative improvement - Testing infrastructure enables continuous refinement
Видео Trevor Lohrbeer - Improved Monitoring of Backdoor Insertion During Code Refactoring [ControlConf] канала FAR․AI
Комментарии отсутствуют
Информация о видео
13 июня 2025 г. 6:31:00
00:05:03
Другие видео канала