Ep 39: Quantization — Running AI on Your Phone | LLM Mastery Podcast

Here's what you need to know about quantization — running ai on your phone:

- Quantization reduces the number of bits per parameter. Going from FP32 (32 bits) to int4 (4 bits) shrinks a model by 8x with often surprisingly small quality loss.
- Int8 is the "free lunch" of compression. For most models, int8 quantization preserves over 99% of quality while halving memory from FP16 and quartering it from FP32.
- PTQ is fast; QAT is better. Post-training quantization is quick and easy. Quantization-aware training requires more compute but produces higher-quality quantized models.
- Larger quantized models beat smaller precise models. A 70B model at int4 usually outperforms a 13B model at FP16, despite similar memory footprints.
- Quantization enabled the open-source LLM revolution. Without quantization, running powerful language models on consumer hardware would be impossible

Up next: We've now covered the major transformer architectures (GPT, T5, ViT) and the key tricks for making them practical (distillation, quantization). Next, we'll shift our focus to how these models are actually trained and fine-tuned for specific tasks — the techniques that turn a general-purpose language model into a specialized assistant, code generator, or creative tool.

---
Series: LLM Mastery Podcast | Module: Foundations
138 episodes taking you from zero to production with LLMs.

#AI #LLM #MachineLearning #Podcast #SoftwareEngineering #Foundations

Carlos Hernandez | roclas.com

Видео Ep 39: Quantization — Running AI on Your Phone | LLM Mastery Podcast канала carlos Hernandez