GenAI Futures. Part-1. LLM Architecture Evolution. https://www.bytegoose.com

GenAI Futures. LLM Architecture Evolution.
LLM architecture Timeline:
2019:
Emergence of GPT-2: Marks the beginning of the modern era of Large Language Models (LLMs), establishing a foundational architecture that would persist with remarkable structural similarity for years.
Early LLM Architectures: These models typically employ Multi-Head Attention (MHA) and GELU activation functions, along with absolute positional embeddings. Normalization layers are commonly placed after attention and FeedForward modules (Post-LN/Post-Norm).
2020:
Introduction of Pre-LN (Pre-Norm): Research indicates that placing normalization layers before attention and FeedForward modules leads to more stable gradients during initialization and effective training without meticulous learning rate warm-up.
Undated (Pre-DeepSeek V2):
Introduction of Multi-Head Latent Attention (MLA) concept: While not explicitly dated, MLA existed prior to DeepSeek-V2, suggesting its development before late 2023/early 2024.
Introduction of Shared Expert in MoE: The concept of a shared expert that is consistently active in Mixture-of-Experts (MoE) modules, enhancing overall modeling performance, is introduced in earlier DeepSeek and DeepSpeedMoE papers.
2023:
QK-Norm in Vision Transformers: The concept of QK-Norm (an RMSNorm layer within the Multi-Head Attention module applied to queries and keys) is first described in a 2023 paper.
Undated (Pre-2024):
Transition from LayerNorm to RMSNorm: Many LLMs, including Llama, Gemma, and later OLMo 2, transition from using LayerNorm to the simplified RMSNorm.
Evolution of Positional Embeddings: Positional embeddings evolve from absolute to rotational (RoPE).
Shift from Multi-Head Attention (MHA) to Grouped-Query Attention (GQA): GQA emerges as a standard, more computationally and parameter-efficient alternative to MHA.
Adoption of SwiGLU Activation Functions: More efficient SwiGLU activation functions replace GELU in many architectures.
2024:
December 2024:DeepSeek V3 Introduction: The DeepSeek V3 architecture is introduced, featuring Multi-Head Latent Attention (MLA) and Mixture-of-Experts (MoE) layers, including a shared expert. It boasts 671 billion total parameters, but only activates 37 billion per inference step due to MoE.
Llama 3 Release: A 405 billion parameter model is released.
Llama 4 Release: (Mentioned as 2024-2025, implying its release in this period).
2025:
January 2025:DeepSeek R1 Release: A reasoning model built upon the DeepSeek V3 architecture is released, garnering considerable attention and leading to widespread adoption of DeepSeek V3.
OLMo 2 Models Release: The Allen Institute for AI releases the OLMo 2 series of models. These models are notable for their architectural design choices around normalization, specifically a variant of Post-LN and the inclusion of QK-Norm, both contributing to training stability. OLMo 2 initially uses traditional MHA.
Gemma 3 (Mentioned as Contemporary): Google's Gemma 3 is highlighted as a contemporary model that uses sliding window attention for computational cost reduction and has a large vocabulary size.
Cast of Characters
GPT (Generative Pre-trained Transformer): An influential foundational LLM architecture that emerged prior to 2019, establishing a long-lasting structural similarity in subsequent models.
GPT-2: An early iteration of the GPT architecture, released in 2019, signifying the beginning of the modern LLM era discussed in the sources.
DeepSeek-V2: A predecessor to DeepSeek V3, which notably incorporated Multi-Head Latent Attention (MLA).
DeepSeek V3: A monumental 671-billion-parameter LLM architecture introduced in December 2024. It is distinguished by its use of Multi-Head Latent Attention (MLA) and Mixture-of-Experts (MoE) with a shared expert, enabling high efficiency despite its large size.
DeepSeek R1: A reasoning model released in January 2025, built upon the DeepSeek V3 architecture.
Llama (Series of Models): A prominent series of LLMs, including Llama 3 (405B parameters, released in 2024) and Llama 4 (2024-2025).
OLMo (Series of Models): LLMs developed by the non-profit Allen Institute for AI. They are known for their transparency regarding training data and code.
OLMo 2: A specific model in the OLMo series released in January 2025, notable for its unique normalization layer placements (a variant of Post-LN) and the inclusion of QK-Norm to enhance training stability. It initially utilized traditional Multi-Head Attention (MHA).
Gemma (Series of Models): Google's series of LLMs, including Gemma 3, known for strong performance, a large vocabulary size, and a focus on models around the 27B parameter size. Gemma 3 specifically uses sliding window attention for efficiency.
Allen Institute for AI: A non-profit organization responsible for the development of the OLMo series of models, known for their commitment to transparency in LLM development.
Google: The developer of the Gemma series of LLMs.

Видео GenAI Futures. Part-1. LLM Architecture Evolution. https://www.bytegoose.com канала Byte Goose AI.

Generative AI Futures Podcast Series GenAI LLM VLM

Комментарии отсутствуют

Информация о видео

20 июля 2025 г. 4:10:20

00:18:19

Byte Goose AI.

Теги

Правообладателям

Жалоба на материал Недопустимый материал Нарушение авторских прав

Комментарии

Другие видео канала

GenAI Futures. Part-1. LLM Architecture Evolution. https://www.bytegoose.com

TOP AI STARTUPS (HIRING LIST). 30 top AI-native startups all hiring generalists

Agentic Reinforcement Learning (RL) for Large Language Models (LLM).Markov Decision Processes (MDPs)

Text-to-LoRA: Instant Transformer Adaptation. Supervised fine-tuning (SFT), Hyper Networks for LLMs

"V-JEPA Learns Intuitive Physics Through Representation Prediction. Model: Vision-JEPA Architecture.

[Transformers] LLM Transformers - The Essential LLM technical guide. Transformer Explained.

Reinforcement Learning from Human Feedback (RLHF) - Explained in 10 minutes.

Machines that invent. Flow Matching vs. Diffusion: Mastering ODEs and SDEs in Generative Modeling

[GLM-5.0 Model] From Vibe Coding to Agentic Engineering The Power of Scalable Reinforcement Learning

LTX-2 Explained: The New Open-Source Standard for Synchronized AI Video and Audio

[Explainable AI] Statistical Mechanics of Explainable Artificial Intelligence. Hebbian Neural Nets.

Geometric Deep Learning. Group Theory, Differential Geometry, and Topology. Math insights. Review.

Modal: Serverless AI Infrastructure in Python. Generative AI infra as Code.

[NVIDIA Cosmos] Why Real-World Foundation Models? Next AI VLM & LLM Paradigm for Physical World

[GEPA] LLM prompt tuning: Reflective Prompt Evolution for Efficient LLM Optimization (Genetic-Pareto

A Roadmap of Generative Al in Physics

Google AI Co-Scientist vs Sakana AI: 10 Years of Research in 48–72 Hours. Compute scaling.

HRM: A human brain-inspired replacement to Chain-of-Thought

Diffusion Transformers with Representation Autoencoders VAE (e.g., DINO, SigLIP, MAE) with (RAEs)

Kosmos AI: Reproducing and Generating Scientific Discoveries. Generative AI for Science.

AI in Insurance 2025. Generative AI for Insurance, Agentic, LLMs, AI Forecasting cat insurance

Sparser-Faster LLMs: Breaking the Compute Wall with ReLU and TwELL CUDA Architecture. SAE models.