Coding LLaMA 2 from scratch in PyTorch - KV Cache, Grouped Query Attention, Rotary PE, RMSNorm

Full coding of LLaMA 2 from scratch, with full explanation, including Rotary Positional Embedding, RMS Normalization, Multi-Query Attention, KV Cache, Grouped Query Attention (GQA), the SwiGLU Activation function and more!

I explain the most used inference methods: Greedy, Beam Search, Temperature Scaling, Random Sampling, Top K, Top P
I also explain the math behind the Rotary Positional Embedding, with step by step proofs.

Repository with PDF slides: https://github.com/hkproj/pytorch-llama
Download the weights from: https://github.com/facebookresearch/llama

Prerequisites:
1) Transformer explained: https://www.youtube.com/watch?v=bCz4OMemCcA
2) LLaMA explained: https://www.youtube.com/watch?v=Mn_9W1nCFLo

Chapters
00:00:00 - Introduction
00:01:20 - LLaMA Architecture
00:03:14 - Embeddings
00:05:22 - Coding the Transformer
00:19:55 - Rotary Positional Embedding
01:03:50 - RMS Normalization
01:11:13 - Encoder Layer
01:16:50 - Self Attention with KV Cache
01:29:12 - Grouped Query Attention
01:34:14 - Coding the Self Attention
02:01:40 - Feed Forward Layer with SwiGLU
02:08:50 - Model weights loading
02:21:26 - Inference strategies
02:25:15 - Greedy Strategy
02:27:28 - Beam Search
02:31:13 - Temperature
02:32:52 - Random Sampling
02:34:27 - Top K
02:37:03 - Top P
02:38:59 - Coding the Inference

Видео Coding LLaMA 2 from scratch in PyTorch - KV Cache, Grouped Query Attention, Rotary PE, RMSNorm канала Umar Jamil

deep learning pytorch ai ml machine learning paper review llama llm large language model coding coding from scratch kv-cache grouped query attention swiglu rmsnorm rotary postional embeddings kv cache

Комментарии отсутствуют

Информация о видео

3 сентября 2023 г. 8:56:57

03:04:11

Umar Jamil

Теги

Правообладателям

Жалоба на материал Недопустимый материал Нарушение авторских прав

Комментарии

Другие видео канала

Coding LLaMA 2 from scratch in PyTorch - KV Cache, Grouped Query Attention, Rotary PE, RMSNorm

LongNet: Scaling Transformers to 1,000,000,000 tokens: Python Code + Explanation

ML Interpretability: feature visualization, adversarial example, interp. for language models

Flash Attention derived and coded from first principles with Triton (Python)

Direct Preference Optimization (DPO) explained: Bradley-Terry model, log probabilities, math

Coding a Multimodal (Vision) Language Model from scratch in PyTorch with full explanation

Distributed Training with PyTorch: complete tutorial with cloud infrastructure and code

LoRA: Low-Rank Adaptation of Large Language Models - Explained visually + PyTorch code from scratch

Coding Stable Diffusion from scratch in PyTorch

How diffusion models work - explanation and code!

Quantization explained with PyTorch - Post-Training Quantization, Quantization-Aware Training

Reinforcement Learning from Human Feedback explained with math derivations and the PyTorch code.

Titans: Learning to Memorize at Test Time

Retrieval Augmented Generation (RAG) Explained: Embedding, Sentence BERT, Vector Database (HNSW)

Mamba and S4 Explained: Architecture, Parallel Scan, Kernel Fusion, Recurrent, Convolution, Math

BERT explained: Training, Inference, BERT vs GPT/LLamA, Fine tuning, [CLS] token

Coding a Transformer from scratch on PyTorch, with full explanation, training and inference.

LLaMA explained: KV-Cache, Rotary Positional Embedding, RMS Norm, Grouped Query Attention, SwiGLU

Paper: DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Mistral / Mixtral Explained: Sliding Window Attention, Sparse Mixture of Experts, Rolling Buffer

Kolmogorov-Arnold Networks: MLP vs KAN, Math, B-Splines, Universal Approximation Theorem