Silent Errors in Large-Scale LLM training by Cyril Meurillon & Devin O'Kelley
GPU cluster reliability is a growing challenge as AI models and the clusters that host them grow to unprecedented scale. Insidious errors such as Silent Data Corruptions (SDCs) are particularly difficult to address due to their highly elusive and non-deterministic nature, and their effect on large-scale LLM training and inference is poorly understood.
In this talk, we will present how NVIDIA is leveraging its deep expertise in GPUs and AI to holistically tackle this challenge from silicon to data centers. We will go over the work we are doing to improve our understanding of these complex errors and their effect in real world at-scale AI cluster deployments, and the solutions we are developing to help researchers, cluster builders, and the industry protect against SDCs.
Learn more about the @Scale conference here: https://atscaleconference.com/events/scale-data-ai-infra/
Видео Silent Errors in Large-Scale LLM training by Cyril Meurillon & Devin O'Kelley канала @Scale
In this talk, we will present how NVIDIA is leveraging its deep expertise in GPUs and AI to holistically tackle this challenge from silicon to data centers. We will go over the work we are doing to improve our understanding of these complex errors and their effect in real world at-scale AI cluster deployments, and the solutions we are developing to help researchers, cluster builders, and the industry protect against SDCs.
Learn more about the @Scale conference here: https://atscaleconference.com/events/scale-data-ai-infra/
Видео Silent Errors in Large-Scale LLM training by Cyril Meurillon & Devin O'Kelley канала @Scale
Комментарии отсутствуют
Информация о видео
27 июня 2025 г. 23:53:17
00:18:24
Другие видео канала