Загрузка...

Why LLM Inference Is Memory-Bound, Not Compute-Bound

The limiting factor in LLM inference isn't compute. It's how fast you can move weights from DRAM to the chip.

In this interview, CTO Mathias Lechner speaks with Piotr Mazurek from Liquid AI's inference team about what's actually happening when an LLM handles a request: the prefill/decode distinction, multi-GPU parallelism strategies, and how to choose between inference frameworks like vLLM, SGLang, and TensorRT-LLM depending on latency and throughput requirements.

Liquid AI builds foundation models designed for efficiency and performance across a range of deployment contexts. This series features Mathias in conversation with researchers and engineers across the company.

Subscribe to follow every episode: https://www.youtube.com/@liquid-ai-inc

Careers at Liquid AI: https://www.liquid.ai/careers

Видео Why LLM Inference Is Memory-Bound, Not Compute-Bound канала Liquid AI
Яндекс.Метрика
Все заметки Новая заметка Страницу в заметки
Страницу в закладки Мои закладки
На информационно-развлекательном портале SALDA.WS применяются cookie-файлы. Нажимая кнопку Принять, вы подтверждаете свое согласие на их использование.
О CookiesНапомнить позжеПринять