- Популярные видео
- Авто
- Видео-блоги
- ДТП, аварии
- Для маленьких
- Еда, напитки
- Животные
- Закон и право
- Знаменитости
- Игры
- Искусство
- Комедии
- Красота, мода
- Кулинария, рецепты
- Люди
- Мото
- Музыка
- Мультфильмы
- Наука, технологии
- Новости
- Образование
- Политика
- Праздники
- Приколы
- Природа
- Происшествия
- Путешествия
- Развлечения
- Ржач
- Семья
- Сериалы
- Спорт
- Стиль жизни
- ТВ передачи
- Танцы
- Технологии
- Товары
- Ужасы
- Фильмы
- Шоу-бизнес
- Юмор
Ep 39: Quantization — Running AI on Your Phone | LLM Mastery Podcast
Here's what you need to know about quantization — running ai on your phone:
- Quantization reduces the number of bits per parameter. Going from FP32 (32 bits) to int4 (4 bits) shrinks a model by 8x with often surprisingly small quality loss.
- Int8 is the "free lunch" of compression. For most models, int8 quantization preserves over 99% of quality while halving memory from FP16 and quartering it from FP32.
- PTQ is fast; QAT is better. Post-training quantization is quick and easy. Quantization-aware training requires more compute but produces higher-quality quantized models.
- Larger quantized models beat smaller precise models. A 70B model at int4 usually outperforms a 13B model at FP16, despite similar memory footprints.
- Quantization enabled the open-source LLM revolution. Without quantization, running powerful language models on consumer hardware would be impossible
Up next: We've now covered the major transformer architectures (GPT, T5, ViT) and the key tricks for making them practical (distillation, quantization). Next, we'll shift our focus to how these models are actually trained and fine-tuned for specific tasks — the techniques that turn a general-purpose language model into a specialized assistant, code generator, or creative tool.
---
Series: LLM Mastery Podcast | Module: Foundations
138 episodes taking you from zero to production with LLMs.
#AI #LLM #MachineLearning #Podcast #SoftwareEngineering #Foundations
Carlos Hernandez | roclas.com
Видео Ep 39: Quantization — Running AI on Your Phone | LLM Mastery Podcast канала carlos Hernandez
- Quantization reduces the number of bits per parameter. Going from FP32 (32 bits) to int4 (4 bits) shrinks a model by 8x with often surprisingly small quality loss.
- Int8 is the "free lunch" of compression. For most models, int8 quantization preserves over 99% of quality while halving memory from FP16 and quartering it from FP32.
- PTQ is fast; QAT is better. Post-training quantization is quick and easy. Quantization-aware training requires more compute but produces higher-quality quantized models.
- Larger quantized models beat smaller precise models. A 70B model at int4 usually outperforms a 13B model at FP16, despite similar memory footprints.
- Quantization enabled the open-source LLM revolution. Without quantization, running powerful language models on consumer hardware would be impossible
Up next: We've now covered the major transformer architectures (GPT, T5, ViT) and the key tricks for making them practical (distillation, quantization). Next, we'll shift our focus to how these models are actually trained and fine-tuned for specific tasks — the techniques that turn a general-purpose language model into a specialized assistant, code generator, or creative tool.
---
Series: LLM Mastery Podcast | Module: Foundations
138 episodes taking you from zero to production with LLMs.
#AI #LLM #MachineLearning #Podcast #SoftwareEngineering #Foundations
Carlos Hernandez | roclas.com
Видео Ep 39: Quantization — Running AI on Your Phone | LLM Mastery Podcast канала carlos Hernandez
Комментарии отсутствуют
Информация о видео
19 марта 2026 г. 16:28:09
00:21:00
Другие видео канала





















