Coding a Transformer from scratch on PyTorch, with full explanation, training and inference.

In this video I teach how to code a Transformer model from scratch using PyTorch. I highly recommend watching my previous video to understand the underlying concepts, but I will also rehearse them in this video again while coding. All of the code is mine, except for the attention visualization function to plot the chart, which I have found online at the Harvard university's website.

Paper: Attention is all you need - https://arxiv.org/abs/1706.03762

The full code is available on GitHub: https://github.com/hkproj/pytorch-transformer
It also includes a Colab Notebook so you can train the model directly on Colab.

Chapters
00:00:00 - Introduction
00:01:20 - Input Embeddings
00:04:56 - Positional Encodings
00:13:30 - Layer Normalization
00:18:12 - Feed Forward
00:21:43 - Multi-Head Attention
00:42:41 - Residual Connection
00:44:50 - Encoder
00:51:52 - Decoder
00:59:20 - Linear Layer
01:01:25 - Transformer
01:17:00 - Task overview
01:18:42 - Tokenizer
01:31:35 - Dataset
01:55:25 - Training loop
02:20:05 - Validation loop
02:41:30 - Attention visualization

Видео Coding a Transformer from scratch on PyTorch, with full explanation, training and inference. канала Umar Jamil

pytorch coding deep learning ai transformer

Комментарии отсутствуют

Информация о видео

26 мая 2023 г. 4:01:05

02:59:24

Umar Jamil

Теги

Правообладателям

Жалоба на материал Недопустимый материал Нарушение авторских прав

Комментарии

Другие видео канала

Coding a Transformer from scratch on PyTorch, with full explanation, training and inference.

LongNet: Scaling Transformers to 1,000,000,000 tokens: Python Code + Explanation

ML Interpretability: feature visualization, adversarial example, interp. for language models

Direct Preference Optimization (DPO) explained: Bradley-Terry model, log probabilities, math

Coding LLaMA 2 from scratch in PyTorch - KV Cache, Grouped Query Attention, Rotary PE, RMSNorm

Flash Attention derived and coded from first principles with Triton (Python)

Coding a Multimodal (Vision) Language Model from scratch in PyTorch with full explanation

Distributed Training with PyTorch: complete tutorial with cloud infrastructure and code

Coding Stable Diffusion from scratch in PyTorch

LoRA: Low-Rank Adaptation of Large Language Models - Explained visually + PyTorch code from scratch

Quantization explained with PyTorch - Post-Training Quantization, Quantization-Aware Training

How diffusion models work - explanation and code!

Reinforcement Learning from Human Feedback explained with math derivations and the PyTorch code.

Mamba and S4 Explained: Architecture, Parallel Scan, Kernel Fusion, Recurrent, Convolution, Math

Titans: Learning to Memorize at Test Time

Retrieval Augmented Generation (RAG) Explained: Embedding, Sentence BERT, Vector Database (HNSW)

BERT explained: Training, Inference, BERT vs GPT/LLamA, Fine tuning, [CLS] token

LLaMA explained: KV-Cache, Rotary Positional Embedding, RMS Norm, Grouped Query Attention, SwiGLU

Paper: DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Mistral / Mixtral Explained: Sliding Window Attention, Sparse Mixture of Experts, Rolling Buffer

Variational Autoencoder - Model, ELBO, loss function and maths explained easily!