- Популярные видео
- Авто
- Видео-блоги
- ДТП, аварии
- Для маленьких
- Еда, напитки
- Животные
- Закон и право
- Знаменитости
- Игры
- Искусство
- Комедии
- Красота, мода
- Кулинария, рецепты
- Люди
- Мото
- Музыка
- Мультфильмы
- Наука, технологии
- Новости
- Образование
- Политика
- Праздники
- Приколы
- Природа
- Происшествия
- Путешествия
- Развлечения
- Ржач
- Семья
- Сериалы
- Спорт
- Стиль жизни
- ТВ передачи
- Танцы
- Технологии
- Товары
- Ужасы
- Фильмы
- Шоу-бизнес
- Юмор
The Alchemy of Retention: Rewriting the AI Data Bottleneck
Disclaimer: This video is generated with Google's NotebookLM.
SwallowCode and SwallowMath: Rewriting Pre-training Data for LLM Excellence
https://arxiv.org/pdf/2505.02881
The researchers introduce SwallowCode and SwallowMath, two high-quality, openly licensed datasets designed to improve the performance of large language models in mathematical reasoning and program synthesis. Moving beyond traditional data filtering, the study utilizes a "transform-and-retain" methodology where an LLM systematically rewrites existing public data to ensure stylistic consistency, algorithmic efficiency, and self-contained logic. SwallowCode refines Python snippets through a multi-stage pipeline focused on syntax validation and readability, while SwallowMath reformats mathematical problems into clear, step-by-step explanations. Empirical results show that models pre-trained on these rewritten corpora significantly outperform those trained on standard datasets like Stack-Edu, achieving substantial gains on benchmarks such as HumanEval and GSM8K. By releasing the datasets, prompts, and code, the authors aim to bridge the "data quality gap" and provide the open community with a reproducible framework for advanced data curation.
#ai #research
Видео The Alchemy of Retention: Rewriting the AI Data Bottleneck канала Vinh Nguyen
SwallowCode and SwallowMath: Rewriting Pre-training Data for LLM Excellence
https://arxiv.org/pdf/2505.02881
The researchers introduce SwallowCode and SwallowMath, two high-quality, openly licensed datasets designed to improve the performance of large language models in mathematical reasoning and program synthesis. Moving beyond traditional data filtering, the study utilizes a "transform-and-retain" methodology where an LLM systematically rewrites existing public data to ensure stylistic consistency, algorithmic efficiency, and self-contained logic. SwallowCode refines Python snippets through a multi-stage pipeline focused on syntax validation and readability, while SwallowMath reformats mathematical problems into clear, step-by-step explanations. Empirical results show that models pre-trained on these rewritten corpora significantly outperform those trained on standard datasets like Stack-Edu, achieving substantial gains on benchmarks such as HumanEval and GSM8K. By releasing the datasets, prompts, and code, the authors aim to bridge the "data quality gap" and provide the open community with a reproducible framework for advanced data curation.
#ai #research
Видео The Alchemy of Retention: Rewriting the AI Data Bottleneck канала Vinh Nguyen
Комментарии отсутствуют
Информация о видео
22 апреля 2026 г. 14:18:21
00:04:49
Другие видео канала







![[Video Special] The Living Code: LLVM and the End of the Static Trap](https://i.ytimg.com/vi/pF-BFnl4kEk/default.jpg)
![[Podcast] Neural Thickets](https://i.ytimg.com/vi/gmT2DBTIM3k/default.jpg)
![[Podcast] Constitutional Spec-Driven Development: Securing AI Code Generation](https://i.ytimg.com/vi/Dq5p_88dHMw/default.jpg)



![[Podcast] Horizon Reduction: Stabilizing RL for Long-Horizon Tasks](https://i.ytimg.com/vi/kpPAebSHQ1M/default.jpg)


![[Podcast] The Economics of Agentic Coding: Analyzing Token Consumption Patterns](https://i.ytimg.com/vi/-s66bpvtd5I/default.jpg)
![[Video Special] The Attention Spectrum: From Dense to Hybrid](https://i.ytimg.com/vi/-O3oi5yuyog/default.jpg)

![[Video Special] The Architecture of Efficiency: Inside NVIDIA's Nemotron 3 Ultra](https://i.ytimg.com/vi/ZbftdeAAz30/default.jpg)

![[Podcast] When AI Builds Itself: The Rise of Recursive Improvement](https://i.ytimg.com/vi/XeVvab8j1SE/default.jpg)