We Taught an AI to Forget

We trained a 1.5B model to actively manage a fixed memory buffer using Supervised Fine-Tuning followed by Group Relative Policy Optimization (GRPO). Built on Meta's OpenEnv framework, our agent decides every single step — Add, Remove, or Noop — across hundreds of incoming messages with slots available. The result: both GRPO variants hit positive reward per step while the base model bleeds negative. F1 jumps from 6% to 78% on the hardest episodes. This isn't a bigger context window. It's a smarter one. Submitted for the OpenEnv Hackathon 2026. GitHub and HuggingFace Space linked below.

Видео We Taught an AI to Forget канала Mathew

Комментарии отсутствуют

Информация о видео

26 апреля 2026 г. 16:09:06

00:01:05

Mathew

Правообладателям

Жалоба на материал Недопустимый материал Нарушение авторских прав

Комментарии

Другие видео канала

We Taught an AI to Forget

We Taught an AI to Forget #hackathon #pytorch #meta #huggingface #scalerschooloftechnology