Загрузка...

How LLMs Understand Text: Tokenization, Encoding, and BPE Explained | LLMs From Scratch Part 1

This is Part 1 of a series on how Large Language Models like GPT, Gemini, and other transformer-based AI systems work under the hood.

In this lecture, we start from the beginning: how raw text becomes something a machine learning model can process. We cover the big picture, data pre-processing, tokenization, encoding, vocabulary design, character vs word tokenization, and Byte-Pair Encoding (BPE) - the sub-word tokenization approach used in many modern language models.

This episode is meant to build the foundation for later parts of the series, where we’ll move into self-supervised training, embeddings, positional encodings, transformer blocks, attention, feedforward networks, loss, back-propagation, fine-tuning, and inference.

If you’re a student, software engineer, or AI enthusiast trying to understand how LLMs actually work beyond the hype, this series is for you.

Reference article: https://decodelm.com/articles/f4c7ea89-46ae-4d8c-825a-78b09e0bd330

Timestamps:
00:00 Introduction
2:20 What are LLMs
4:45 The big picture - Inference and Training
10:40 Training data and pre-processing
13:30 Representing data - Tokens and Vocabulary
16:30 Character tokenization and encoding
21:35 Word tokenization
23:40 Context window
26:50 Character vs word tokenization
32:50 Byte-Pair Encoding (BPE)
36:30 BPE pseudocode and example
47:10 Why BPE
50:15 Conclusion

#LLM #LargeLanguageModels #MachineLearning #ArtificialIntelligence #NLP #Tokenization #BytePairEncoding

Видео How LLMs Understand Text: Tokenization, Encoding, and BPE Explained | LLMs From Scratch Part 1 канала Vishal Sahoo
Яндекс.Метрика
Все заметки Новая заметка Страницу в заметки
Страницу в закладки Мои закладки
На информационно-развлекательном портале SALDA.WS применяются cookie-файлы. Нажимая кнопку Принять, вы подтверждаете свое согласие на их использование.
О CookiesНапомнить позжеПринять