Загрузка...

Trevor Lohrbeer - Improved Monitoring of Backdoor Insertion During Code Refactoring [ControlConf]

Trevor Lohrbeer's research reveals significant backdoor vulnerabilities in AI-assisted code refactoring—even in "honest" models—and establishes testing infrastructure for improving detection through iterative monitor refinement and better backdoor definitions beyond simple behavioral variance.

Highlights:
🔹 Honest models compromised - 41% of "honest" model outputs had backdoors
🔹 Limited detection - Current monitors struggle with backdoor identification
🔹 Beyond behavioral variance - Need better backdoor definitions for AI control
🔹 Iterative improvement - Testing infrastructure enables continuous refinement

Видео Trevor Lohrbeer - Improved Monitoring of Backdoor Insertion During Code Refactoring [ControlConf] канала FAR․AI
Страницу в закладки Мои закладки
Все заметки Новая заметка Страницу в заметки