Attention is all you need (Transformer) - Model explanation (including math), Inference and Training

A complete explanation of all the layers of a Transformer Model: Multi-Head Self-Attention, Positional Encoding, including all the matrix multiplications and a complete description of the training and inference process.

Paper: Attention is all you need - https://arxiv.org/abs/1706.03762

Slides PDF: https://github.com/hkproj/transformer-from-scratch-notes

Chapters
00:00 - Intro
01:10 - RNN and their problems
08:04 - Transformer Model
09:02 - Maths background and notations
12:20 - Encoder (overview)
12:31 - Input Embeddings
15:04 - Positional Encoding
20:08 - Single Head Self-Attention
28:30 - Multi-Head Attention
35:39 - Query, Key, Value
37:55 - Layer Normalization
40:13 - Decoder (overview)
42:24 - Masked Multi-Head Attention
44:59 - Training
52:09 - Inference

Видео Attention is all you need (Transformer) - Model explanation (including math), Inference and Training канала Umar Jamil

transformer deep learning pytorch ai ml machine learning attention is all you need

Комментарии отсутствуют

Информация о видео

28 мая 2023 г. 12:46:54

00:58:04

Umar Jamil

Теги

Правообладателям

Жалоба на материал Недопустимый материал Нарушение авторских прав

Комментарии

Другие видео канала

Attention is all you need (Transformer) - Model explanation (including math), Inference and Training

LongNet: Scaling Transformers to 1,000,000,000 tokens: Python Code + Explanation

ML Interpretability: feature visualization, adversarial example, interp. for language models

Direct Preference Optimization (DPO) explained: Bradley-Terry model, log probabilities, math

Coding LLaMA 2 from scratch in PyTorch - KV Cache, Grouped Query Attention, Rotary PE, RMSNorm

Flash Attention derived and coded from first principles with Triton (Python)

Coding a Multimodal (Vision) Language Model from scratch in PyTorch with full explanation

Distributed Training with PyTorch: complete tutorial with cloud infrastructure and code

LoRA: Low-Rank Adaptation of Large Language Models - Explained visually + PyTorch code from scratch

Coding Stable Diffusion from scratch in PyTorch

Quantization explained with PyTorch - Post-Training Quantization, Quantization-Aware Training

How diffusion models work - explanation and code!

Reinforcement Learning from Human Feedback explained with math derivations and the PyTorch code.

Titans: Learning to Memorize at Test Time

Mamba and S4 Explained: Architecture, Parallel Scan, Kernel Fusion, Recurrent, Convolution, Math

Retrieval Augmented Generation (RAG) Explained: Embedding, Sentence BERT, Vector Database (HNSW)

BERT explained: Training, Inference, BERT vs GPT/LLamA, Fine tuning, [CLS] token

Coding a Transformer from scratch on PyTorch, with full explanation, training and inference.

LLaMA explained: KV-Cache, Rotary Positional Embedding, RMS Norm, Grouped Query Attention, SwiGLU

Paper: DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Mistral / Mixtral Explained: Sliding Window Attention, Sparse Mixture of Experts, Rolling Buffer

Variational Autoencoder - Model, ELBO, loss function and maths explained easily!