Загрузка...

The Evolution of DeepSeek V3.2 AI | DeepSeek V3.2 Architecture

Resource:
DeepSeek-V3 Technical Report: https://arxiv.org/abs/2412.19437
DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models: https://arxiv.org/abs/2512.02556
A Technical Tour of the DeepSeek Models from V3 to V3.2: https://magazine.sebastianraschka.com/p/technical-deepseek
The engineering and architectural innovations behind the DeepSeek-V3 large language model, a Mixture-of-Experts (MoE) model with 671B total parameters that achieved powerful benchmark results. Significant effort was dedicated to improving training efficiency, including introducing an FP8 mixed precision training framework validated on this extreme scale, which required custom quantization techniques and methods to increase accumulation precision via CUDA Cores. To manage the heavy communication overhead of cross-node MoE training on the H800 GPU cluster, the proprietary DualPipe algorithm was designed to fully overlap computation and communication, ensuring near-zero all-to-all communication latency. Architectural features, such as the use of Multi-Head Latent Attention (MLA) for memory saving and an auxiliary-loss-free load balancing strategy to promote greater expert specialization, are detailed. These algorithmic and hardware co-designs enabled the model's pre-training to be completed economically in 2.664M H800 GPU hours, producing a model that is competitive with top proprietary systems like GPT-4o on various benchmarks.

Видео The Evolution of DeepSeek V3.2 AI | DeepSeek V3.2 Architecture канала Awesome Research
Яндекс.Метрика
Все заметки Новая заметка Страницу в заметки
Страницу в закладки Мои закладки
На информационно-развлекательном портале SALDA.WS применяются cookie-файлы. Нажимая кнопку Принять, вы подтверждаете свое согласие на их использование.
О CookiesНапомнить позжеПринять