The Alchemy of Retention: Rewriting the AI Data Bottleneck

Disclaimer: This video is generated with Google's NotebookLM.

SwallowCode and SwallowMath: Rewriting Pre-training Data for LLM Excellence

https://arxiv.org/pdf/2505.02881

The researchers introduce SwallowCode and SwallowMath, two high-quality, openly licensed datasets designed to improve the performance of large language models in mathematical reasoning and program synthesis. Moving beyond traditional data filtering, the study utilizes a "transform-and-retain" methodology where an LLM systematically rewrites existing public data to ensure stylistic consistency, algorithmic efficiency, and self-contained logic. SwallowCode refines Python snippets through a multi-stage pipeline focused on syntax validation and readability, while SwallowMath reformats mathematical problems into clear, step-by-step explanations. Empirical results show that models pre-trained on these rewritten corpora significantly outperform those trained on standard datasets like Stack-Edu, achieving substantial gains on benchmarks such as HumanEval and GSM8K. By releasing the datasets, prompts, and code, the authors aim to bridge the "data quality gap" and provide the open community with a reproducible framework for advanced data curation.

#ai #research

Видео The Alchemy of Retention: Rewriting the AI Data Bottleneck канала Vinh Nguyen

ai research large language model llm agent machine learning deep learning

Комментарии отсутствуют