- Популярные видео
- Авто
- Видео-блоги
- ДТП, аварии
- Для маленьких
- Еда, напитки
- Животные
- Закон и право
- Знаменитости
- Игры
- Искусство
- Комедии
- Красота, мода
- Кулинария, рецепты
- Люди
- Мото
- Музыка
- Мультфильмы
- Наука, технологии
- Новости
- Образование
- Политика
- Праздники
- Приколы
- Природа
- Происшествия
- Путешествия
- Развлечения
- Ржач
- Семья
- Сериалы
- Спорт
- Стиль жизни
- ТВ передачи
- Танцы
- Технологии
- Товары
- Ужасы
- Фильмы
- Шоу-бизнес
- Юмор
GenAI Futures. Part-1. LLM Architecture Evolution. https://www.bytegoose.com
GenAI Futures. LLM Architecture Evolution.
LLM architecture Timeline:
2019:
Emergence of GPT-2: Marks the beginning of the modern era of Large Language Models (LLMs), establishing a foundational architecture that would persist with remarkable structural similarity for years.
Early LLM Architectures: These models typically employ Multi-Head Attention (MHA) and GELU activation functions, along with absolute positional embeddings. Normalization layers are commonly placed after attention and FeedForward modules (Post-LN/Post-Norm).
2020:
Introduction of Pre-LN (Pre-Norm): Research indicates that placing normalization layers before attention and FeedForward modules leads to more stable gradients during initialization and effective training without meticulous learning rate warm-up.
Undated (Pre-DeepSeek V2):
Introduction of Multi-Head Latent Attention (MLA) concept: While not explicitly dated, MLA existed prior to DeepSeek-V2, suggesting its development before late 2023/early 2024.
Introduction of Shared Expert in MoE: The concept of a shared expert that is consistently active in Mixture-of-Experts (MoE) modules, enhancing overall modeling performance, is introduced in earlier DeepSeek and DeepSpeedMoE papers.
2023:
QK-Norm in Vision Transformers: The concept of QK-Norm (an RMSNorm layer within the Multi-Head Attention module applied to queries and keys) is first described in a 2023 paper.
Undated (Pre-2024):
Transition from LayerNorm to RMSNorm: Many LLMs, including Llama, Gemma, and later OLMo 2, transition from using LayerNorm to the simplified RMSNorm.
Evolution of Positional Embeddings: Positional embeddings evolve from absolute to rotational (RoPE).
Shift from Multi-Head Attention (MHA) to Grouped-Query Attention (GQA): GQA emerges as a standard, more computationally and parameter-efficient alternative to MHA.
Adoption of SwiGLU Activation Functions: More efficient SwiGLU activation functions replace GELU in many architectures.
2024:
December 2024:DeepSeek V3 Introduction: The DeepSeek V3 architecture is introduced, featuring Multi-Head Latent Attention (MLA) and Mixture-of-Experts (MoE) layers, including a shared expert. It boasts 671 billion total parameters, but only activates 37 billion per inference step due to MoE.
Llama 3 Release: A 405 billion parameter model is released.
Llama 4 Release: (Mentioned as 2024-2025, implying its release in this period).
2025:
January 2025:DeepSeek R1 Release: A reasoning model built upon the DeepSeek V3 architecture is released, garnering considerable attention and leading to widespread adoption of DeepSeek V3.
OLMo 2 Models Release: The Allen Institute for AI releases the OLMo 2 series of models. These models are notable for their architectural design choices around normalization, specifically a variant of Post-LN and the inclusion of QK-Norm, both contributing to training stability. OLMo 2 initially uses traditional MHA.
Gemma 3 (Mentioned as Contemporary): Google's Gemma 3 is highlighted as a contemporary model that uses sliding window attention for computational cost reduction and has a large vocabulary size.
Cast of Characters
GPT (Generative Pre-trained Transformer): An influential foundational LLM architecture that emerged prior to 2019, establishing a long-lasting structural similarity in subsequent models.
GPT-2: An early iteration of the GPT architecture, released in 2019, signifying the beginning of the modern LLM era discussed in the sources.
DeepSeek-V2: A predecessor to DeepSeek V3, which notably incorporated Multi-Head Latent Attention (MLA).
DeepSeek V3: A monumental 671-billion-parameter LLM architecture introduced in December 2024. It is distinguished by its use of Multi-Head Latent Attention (MLA) and Mixture-of-Experts (MoE) with a shared expert, enabling high efficiency despite its large size.
DeepSeek R1: A reasoning model released in January 2025, built upon the DeepSeek V3 architecture.
Llama (Series of Models): A prominent series of LLMs, including Llama 3 (405B parameters, released in 2024) and Llama 4 (2024-2025).
OLMo (Series of Models): LLMs developed by the non-profit Allen Institute for AI. They are known for their transparency regarding training data and code.
OLMo 2: A specific model in the OLMo series released in January 2025, notable for its unique normalization layer placements (a variant of Post-LN) and the inclusion of QK-Norm to enhance training stability. It initially utilized traditional Multi-Head Attention (MHA).
Gemma (Series of Models): Google's series of LLMs, including Gemma 3, known for strong performance, a large vocabulary size, and a focus on models around the 27B parameter size. Gemma 3 specifically uses sliding window attention for efficiency.
Allen Institute for AI: A non-profit organization responsible for the development of the OLMo series of models, known for their commitment to transparency in LLM development.
Google: The developer of the Gemma series of LLMs.
Видео GenAI Futures. Part-1. LLM Architecture Evolution. https://www.bytegoose.com канала Byte Goose AI.
LLM architecture Timeline:
2019:
Emergence of GPT-2: Marks the beginning of the modern era of Large Language Models (LLMs), establishing a foundational architecture that would persist with remarkable structural similarity for years.
Early LLM Architectures: These models typically employ Multi-Head Attention (MHA) and GELU activation functions, along with absolute positional embeddings. Normalization layers are commonly placed after attention and FeedForward modules (Post-LN/Post-Norm).
2020:
Introduction of Pre-LN (Pre-Norm): Research indicates that placing normalization layers before attention and FeedForward modules leads to more stable gradients during initialization and effective training without meticulous learning rate warm-up.
Undated (Pre-DeepSeek V2):
Introduction of Multi-Head Latent Attention (MLA) concept: While not explicitly dated, MLA existed prior to DeepSeek-V2, suggesting its development before late 2023/early 2024.
Introduction of Shared Expert in MoE: The concept of a shared expert that is consistently active in Mixture-of-Experts (MoE) modules, enhancing overall modeling performance, is introduced in earlier DeepSeek and DeepSpeedMoE papers.
2023:
QK-Norm in Vision Transformers: The concept of QK-Norm (an RMSNorm layer within the Multi-Head Attention module applied to queries and keys) is first described in a 2023 paper.
Undated (Pre-2024):
Transition from LayerNorm to RMSNorm: Many LLMs, including Llama, Gemma, and later OLMo 2, transition from using LayerNorm to the simplified RMSNorm.
Evolution of Positional Embeddings: Positional embeddings evolve from absolute to rotational (RoPE).
Shift from Multi-Head Attention (MHA) to Grouped-Query Attention (GQA): GQA emerges as a standard, more computationally and parameter-efficient alternative to MHA.
Adoption of SwiGLU Activation Functions: More efficient SwiGLU activation functions replace GELU in many architectures.
2024:
December 2024:DeepSeek V3 Introduction: The DeepSeek V3 architecture is introduced, featuring Multi-Head Latent Attention (MLA) and Mixture-of-Experts (MoE) layers, including a shared expert. It boasts 671 billion total parameters, but only activates 37 billion per inference step due to MoE.
Llama 3 Release: A 405 billion parameter model is released.
Llama 4 Release: (Mentioned as 2024-2025, implying its release in this period).
2025:
January 2025:DeepSeek R1 Release: A reasoning model built upon the DeepSeek V3 architecture is released, garnering considerable attention and leading to widespread adoption of DeepSeek V3.
OLMo 2 Models Release: The Allen Institute for AI releases the OLMo 2 series of models. These models are notable for their architectural design choices around normalization, specifically a variant of Post-LN and the inclusion of QK-Norm, both contributing to training stability. OLMo 2 initially uses traditional MHA.
Gemma 3 (Mentioned as Contemporary): Google's Gemma 3 is highlighted as a contemporary model that uses sliding window attention for computational cost reduction and has a large vocabulary size.
Cast of Characters
GPT (Generative Pre-trained Transformer): An influential foundational LLM architecture that emerged prior to 2019, establishing a long-lasting structural similarity in subsequent models.
GPT-2: An early iteration of the GPT architecture, released in 2019, signifying the beginning of the modern LLM era discussed in the sources.
DeepSeek-V2: A predecessor to DeepSeek V3, which notably incorporated Multi-Head Latent Attention (MLA).
DeepSeek V3: A monumental 671-billion-parameter LLM architecture introduced in December 2024. It is distinguished by its use of Multi-Head Latent Attention (MLA) and Mixture-of-Experts (MoE) with a shared expert, enabling high efficiency despite its large size.
DeepSeek R1: A reasoning model released in January 2025, built upon the DeepSeek V3 architecture.
Llama (Series of Models): A prominent series of LLMs, including Llama 3 (405B parameters, released in 2024) and Llama 4 (2024-2025).
OLMo (Series of Models): LLMs developed by the non-profit Allen Institute for AI. They are known for their transparency regarding training data and code.
OLMo 2: A specific model in the OLMo series released in January 2025, notable for its unique normalization layer placements (a variant of Post-LN) and the inclusion of QK-Norm to enhance training stability. It initially utilized traditional Multi-Head Attention (MHA).
Gemma (Series of Models): Google's series of LLMs, including Gemma 3, known for strong performance, a large vocabulary size, and a focus on models around the 27B parameter size. Gemma 3 specifically uses sliding window attention for efficiency.
Allen Institute for AI: A non-profit organization responsible for the development of the OLMo series of models, known for their commitment to transparency in LLM development.
Google: The developer of the Gemma series of LLMs.
Видео GenAI Futures. Part-1. LLM Architecture Evolution. https://www.bytegoose.com канала Byte Goose AI.
Комментарии отсутствуют
Информация о видео
20 июля 2025 г. 4:10:20
00:18:19
Другие видео канала





![[Transformers] LLM Transformers - The Essential LLM technical guide. Transformer Explained.](https://i.ytimg.com/vi/5UExFKL743o/default.jpg)


![[GLM-5.0 Model] From Vibe Coding to Agentic Engineering The Power of Scalable Reinforcement Learning](https://i.ytimg.com/vi/Yx3bLY2PBwE/default.jpg)

![[Explainable AI] Statistical Mechanics of Explainable Artificial Intelligence. Hebbian Neural Nets.](https://i.ytimg.com/vi/NeXd55usAe0/default.jpg)


![[NVIDIA Cosmos] Why Real-World Foundation Models? Next AI VLM & LLM Paradigm for Physical World](https://i.ytimg.com/vi/KQFqS5p1U7U/default.jpg)
![[GEPA] LLM prompt tuning: Reflective Prompt Evolution for Efficient LLM Optimization (Genetic-Pareto](https://i.ytimg.com/vi/0pP3VmIs8-E/default.jpg)






