- Популярные видео
- Авто
- Видео-блоги
- ДТП, аварии
- Для маленьких
- Еда, напитки
- Животные
- Закон и право
- Знаменитости
- Игры
- Искусство
- Комедии
- Красота, мода
- Кулинария, рецепты
- Люди
- Мото
- Музыка
- Мультфильмы
- Наука, технологии
- Новости
- Образование
- Политика
- Праздники
- Приколы
- Природа
- Происшествия
- Путешествия
- Развлечения
- Ржач
- Семья
- Сериалы
- Спорт
- Стиль жизни
- ТВ передачи
- Танцы
- Технологии
- Товары
- Ужасы
- Фильмы
- Шоу-бизнес
- Юмор
How to Engineer AI Inference Systems [Philip Kiely] - 766
In this episode, Philip Kiely, head of AI education at Baseten, joins us to unpack the fast-evolving discipline of inference engineering. We explore why inference has become the stickiest and most critical workload in AI, how it blends GPU programming, applied research, and large-scale distributed systems, and where the line sits between inference and model serving. Philip shares how research-to-production can move in hours, not months, and why understanding “the knobs” of inference—batching, quantization, speculation, and KV cache reuse—lets teams design better products and SLAs. We trace the inference maturity journey from closed APIs to dedicated deployments and in-house platforms, discuss GPU lifecycles, and survey today’s runtime landscape, including vLLM, SGLang, and TensorRT LLM. Finally, we look ahead to agents and multimodality, making the case for specialized, workload-specific runtimes when performance and efficiency matter most.
🗒️ For the full list of resources for this episode, visit the show notes page: https://twimlai.com/go/766.
🔔 Subscribe to our channel for more great content just like this: https://youtube.com/twimlai?sub_confirmation=1
🗣️ CONNECT WITH US!
===============================
Subscribe to the TWIML AI Podcast: https://twimlai.com/podcast/twimlai/
Follow us on Twitter: https://twitter.com/twimlai
Follow us on LinkedIn: https://www.linkedin.com/company/twimlai/
Join our Slack Community: https://twimlai.com/community/
Subscribe to our newsletter: https://twimlai.com/newsletter/
Want to get in touch? Send us a message: https://twimlai.com/contact/
📖 CHAPTERS
===============================
00:00 - Introduction
03:40 - Why inference is the most important AI workload?
06:21 - Inference vs model serving
07:18 - Inference challenges
09:57 - Pace of inference research to production timeline
13:41 - Reasons to care about inference engineering
15:49 - Considerations in build vs buy decisions
22:08 - Product maturity cycle
27:14 - GPU lifecycles in inference maturity
32:14 - LLM-assisted inference
36:46 - Agents and multimodal models in specialized inference optimization
47:21 - Open source runtimes: vLLM, SGLang, and TensorRT LLM
49:50 - Specialized AI hardware
51:24 - Future trends and predictions
52:36 - Where to find the inference engineering book
🔗 LINKS & RESOURCES
===============================
Inference Engineering Book - https://www.baseten.co/inference-engineering/
Baseten - https://www.baseten.co/
📸 Camera: https://amzn.to/3TQ3zsg
🎙️Microphone: https://amzn.to/3t5zXeV
🚦Lights: https://amzn.to/3TQlX49
🎛️ Audio Interface: https://amzn.to/3TVFAIq
🎚️ Stream Deck: https://amzn.to/3zzm7F5
Видео How to Engineer AI Inference Systems [Philip Kiely] - 766 канала The TWIML AI Podcast with Sam Charrington
🗒️ For the full list of resources for this episode, visit the show notes page: https://twimlai.com/go/766.
🔔 Subscribe to our channel for more great content just like this: https://youtube.com/twimlai?sub_confirmation=1
🗣️ CONNECT WITH US!
===============================
Subscribe to the TWIML AI Podcast: https://twimlai.com/podcast/twimlai/
Follow us on Twitter: https://twitter.com/twimlai
Follow us on LinkedIn: https://www.linkedin.com/company/twimlai/
Join our Slack Community: https://twimlai.com/community/
Subscribe to our newsletter: https://twimlai.com/newsletter/
Want to get in touch? Send us a message: https://twimlai.com/contact/
📖 CHAPTERS
===============================
00:00 - Introduction
03:40 - Why inference is the most important AI workload?
06:21 - Inference vs model serving
07:18 - Inference challenges
09:57 - Pace of inference research to production timeline
13:41 - Reasons to care about inference engineering
15:49 - Considerations in build vs buy decisions
22:08 - Product maturity cycle
27:14 - GPU lifecycles in inference maturity
32:14 - LLM-assisted inference
36:46 - Agents and multimodal models in specialized inference optimization
47:21 - Open source runtimes: vLLM, SGLang, and TensorRT LLM
49:50 - Specialized AI hardware
51:24 - Future trends and predictions
52:36 - Where to find the inference engineering book
🔗 LINKS & RESOURCES
===============================
Inference Engineering Book - https://www.baseten.co/inference-engineering/
Baseten - https://www.baseten.co/
📸 Camera: https://amzn.to/3TQ3zsg
🎙️Microphone: https://amzn.to/3t5zXeV
🚦Lights: https://amzn.to/3TQlX49
🎛️ Audio Interface: https://amzn.to/3TVFAIq
🎚️ Stream Deck: https://amzn.to/3zzm7F5
Видео How to Engineer AI Inference Systems [Philip Kiely] - 766 канала The TWIML AI Podcast with Sam Charrington
Podcast Tech Technology ML AI Machine Learning Artificial Intelligence Sam Charrington science computer science TWIML AI data inference systems inference engineering Baseten GPUs model serving quantization KV-cache inference batching CUDA Kernel API vLLM SGLang TensorRT LLMs large language models AI agents multimodal models CPU AI workloads applied research speculation algorithms quantization techniques KV cache reuse mechanisms hardware latency
Комментарии отсутствуют
Информация о видео
Другие видео канала













![s1: A High-Performance Reasoning Model Trained for Under $50 [Niklas Muennighoff] - 721](https://i.ytimg.com/vi/kEfUaLBlSHc/default.jpg)



![Latent Reasoning: Scaling Test-Time Compute for Smarter LLMs [Jonas Geiping] - 723](https://i.ytimg.com/vi/dY90DXLi0vk/default.jpg)



