Загрузка...

GenAI Futures. Part-1. LLM Architecture Evolution. https://www.bytegoose.com

GenAI Futures. LLM Architecture Evolution.
LLM architecture Timeline:
2019:
Emergence of GPT-2: Marks the beginning of the modern era of Large Language Models (LLMs), establishing a foundational architecture that would persist with remarkable structural similarity for years.
Early LLM Architectures: These models typically employ Multi-Head Attention (MHA) and GELU activation functions, along with absolute positional embeddings. Normalization layers are commonly placed after attention and FeedForward modules (Post-LN/Post-Norm).
2020:
Introduction of Pre-LN (Pre-Norm): Research indicates that placing normalization layers before attention and FeedForward modules leads to more stable gradients during initialization and effective training without meticulous learning rate warm-up.
Undated (Pre-DeepSeek V2):
Introduction of Multi-Head Latent Attention (MLA) concept: While not explicitly dated, MLA existed prior to DeepSeek-V2, suggesting its development before late 2023/early 2024.
Introduction of Shared Expert in MoE: The concept of a shared expert that is consistently active in Mixture-of-Experts (MoE) modules, enhancing overall modeling performance, is introduced in earlier DeepSeek and DeepSpeedMoE papers.
2023:
QK-Norm in Vision Transformers: The concept of QK-Norm (an RMSNorm layer within the Multi-Head Attention module applied to queries and keys) is first described in a 2023 paper.
Undated (Pre-2024):
Transition from LayerNorm to RMSNorm: Many LLMs, including Llama, Gemma, and later OLMo 2, transition from using LayerNorm to the simplified RMSNorm.
Evolution of Positional Embeddings: Positional embeddings evolve from absolute to rotational (RoPE).
Shift from Multi-Head Attention (MHA) to Grouped-Query Attention (GQA): GQA emerges as a standard, more computationally and parameter-efficient alternative to MHA.
Adoption of SwiGLU Activation Functions: More efficient SwiGLU activation functions replace GELU in many architectures.
2024:
December 2024:DeepSeek V3 Introduction: The DeepSeek V3 architecture is introduced, featuring Multi-Head Latent Attention (MLA) and Mixture-of-Experts (MoE) layers, including a shared expert. It boasts 671 billion total parameters, but only activates 37 billion per inference step due to MoE.
Llama 3 Release: A 405 billion parameter model is released.
Llama 4 Release: (Mentioned as 2024-2025, implying its release in this period).
2025:
January 2025:DeepSeek R1 Release: A reasoning model built upon the DeepSeek V3 architecture is released, garnering considerable attention and leading to widespread adoption of DeepSeek V3.
OLMo 2 Models Release: The Allen Institute for AI releases the OLMo 2 series of models. These models are notable for their architectural design choices around normalization, specifically a variant of Post-LN and the inclusion of QK-Norm, both contributing to training stability. OLMo 2 initially uses traditional MHA.
Gemma 3 (Mentioned as Contemporary): Google's Gemma 3 is highlighted as a contemporary model that uses sliding window attention for computational cost reduction and has a large vocabulary size.
Cast of Characters
GPT (Generative Pre-trained Transformer): An influential foundational LLM architecture that emerged prior to 2019, establishing a long-lasting structural similarity in subsequent models.
GPT-2: An early iteration of the GPT architecture, released in 2019, signifying the beginning of the modern LLM era discussed in the sources.
DeepSeek-V2: A predecessor to DeepSeek V3, which notably incorporated Multi-Head Latent Attention (MLA).
DeepSeek V3: A monumental 671-billion-parameter LLM architecture introduced in December 2024. It is distinguished by its use of Multi-Head Latent Attention (MLA) and Mixture-of-Experts (MoE) with a shared expert, enabling high efficiency despite its large size.
DeepSeek R1: A reasoning model released in January 2025, built upon the DeepSeek V3 architecture.
Llama (Series of Models): A prominent series of LLMs, including Llama 3 (405B parameters, released in 2024) and Llama 4 (2024-2025).
OLMo (Series of Models): LLMs developed by the non-profit Allen Institute for AI. They are known for their transparency regarding training data and code.
OLMo 2: A specific model in the OLMo series released in January 2025, notable for its unique normalization layer placements (a variant of Post-LN) and the inclusion of QK-Norm to enhance training stability. It initially utilized traditional Multi-Head Attention (MHA).
Gemma (Series of Models): Google's series of LLMs, including Gemma 3, known for strong performance, a large vocabulary size, and a focus on models around the 27B parameter size. Gemma 3 specifically uses sliding window attention for efficiency.
Allen Institute for AI: A non-profit organization responsible for the development of the OLMo series of models, known for their commitment to transparency in LLM development.
Google: The developer of the Gemma series of LLMs.

Видео GenAI Futures. Part-1. LLM Architecture Evolution. https://www.bytegoose.com канала Byte Goose AI.
Яндекс.Метрика
Все заметки Новая заметка Страницу в заметки
Страницу в закладки Мои закладки
На информационно-развлекательном портале SALDA.WS применяются cookie-файлы. Нажимая кнопку Принять, вы подтверждаете свое согласие на их использование.
О CookiesНапомнить позжеПринять