I Made The Smallest (And Dumbest) Image Generation Model

I Made the Smallest and Dumbest Image Generation Model (Under 2GB)

What happens when you try to squeeze Stable Diffusion into less VRAM than a Chrome tab? You get something beautifully broken and surprisingly functional. In this video, I take you through two insane experiments: compressing Stable Diffusion to 1.5GB of VRAM using aggressive quantization, and building a completely different architecture using Residual Quantized VAEs that... well, let's just say it needed some.

🚀 What You'll Learn:
• How LoRA lets you train with 10M parameters instead of 860M
• The brutal truth about Q4_K quantization (4 bits per weight!)
• Why a tiny 15-25M parameter UNet saved everything
• Training on the LAPIS dataset for aesthetic-driven art generation

🛠️ Part 1 - Extreme Stable Diffusion Compression:
• LoRA fine-tuning for memory-efficient training
• Q4_K quantization: 4 bits per weight with minimal quality loss
• 8-bit CLIP quantization as a compromise
• CPU offloading the VAE decoder

🎨 Part 2 - The RQ-VAE Experiment:
• Residual Vector Quantization for progressive image compression
• Four-layer quantization: structure → edges → textures → details
• The refinement UNet rescue mission
• L1 + Perceptual loss for sharp, high-quality outputs
📊 The Numbers:
• Stable Diffusion: 4GB → 1.5GB (62.5% reduction)
• UNet: 860M params → 330MB quantized
• CLIP: 123M params → 250MB (8-bit)
• VAE: Moved to CPU (saved 160MB VRAM)
• RQ-VAE: 256×256 image → 256 tokens → sharp output
• Refinement UNet: 15-25M parameters
🎭 The LAPIS Dataset:
This project uses the LAPIS dataset - a carefully curated collection of artworks rated by individuals with diverse aesthetic tastes. The outputs lean toward abstract, interpretive art rather than photorealism. This isn't about replacing human creativity, but exploring patterns of aesthetic inspiration, much like artists have always studied art before them.

🔧 Technical Deep Dives:
• Why Q4_K beats naive 4-bit quantization
• Block-wise quantization with scale factors
• The importance of keeping time embeddings in FP16
• How RQ-VAE's exponential combinations work (4 tables = trillion+ codes)
• Why the decoder can only guess smooth averages from heavy compression
• Training residual networks to predict high-frequency details
🔔 Don't Miss Out!
If you enjoyed watching me torture AI models in the name of aggressive optimization, SMASH that subscribe button! Your support means everything and helps me create more gloriously educational tech content.

📚 Resources & Code:
• Full training scripts (link in pinned comment)

🎵 Music Credits:
Windmill Isle (Day) - Sonic Unleashed [OST]
I do not own this music or intend to infringe its copyright, with the knowledge that the owners of this material have originally released it free of charge.
🙏 Special Thanks:
• Colab for the Compute
• LAPIS dataset creators
💬 Let's Discuss:
• What's your experience with model quantization?
• Have you tried training on limited hardware?
• What compression techniques should I try next?
• Drop your questions in the comments!

---

This model won't win any benchmarks. But if you're curious about what happens when you push compression to its absolute limits and aren't afraid to try architectures that "shouldn't work," you're in the right place.

🎯 Keywords:
#AI #StableDiffusion #ImageGeneration #Quantization #LoRA #MachineLearning #RQVAE #VectorQuantization #DeepLearning #ModelCompression #TextToImage #ArtificialIntelligence #PyTorch #Diffusion #VAE #UNet #AIArt #TechExperiment #Programming #ComputerScience

Видео I Made The Smallest (And Dumbest) Image Generation Model канала Codeically

AI StableDiffusion ImageGeneration Quantization MachineLearning

Комментарии отсутствуют