- Популярные видео
- Авто
- Видео-блоги
- ДТП, аварии
- Для маленьких
- Еда, напитки
- Животные
- Закон и право
- Знаменитости
- Игры
- Искусство
- Комедии
- Красота, мода
- Кулинария, рецепты
- Люди
- Мото
- Музыка
- Мультфильмы
- Наука, технологии
- Новости
- Образование
- Политика
- Праздники
- Приколы
- Природа
- Происшествия
- Путешествия
- Развлечения
- Ржач
- Семья
- Сериалы
- Спорт
- Стиль жизни
- ТВ передачи
- Танцы
- Технологии
- Товары
- Ужасы
- Фильмы
- Шоу-бизнес
- Юмор
Building a Real-Time Inference Stack on AMD Instinct GPUs
Speakers
Gaël Delalleau. Founder and CEO, Kog
Augustin Verneuil, GPU engineer, Kog
Talk Abstract: In this talk, we share our vision for real-time generative AI, and the techniques we developed to achieve the fastest LLM inference on GPU ever, with a generation speed of 2500 tokens/s per request. We first showcase our end-to-end stack optimized for minimal latency on AMD hardware, spanning model re-architecting, a single monokernel implementation, along with topology-aware algorithms. In the second part, we focus on one of the defining challenges of megakernels, intra-GPU grid synchronization barriers and reduce/gather primitives. Using a chiplet-aware approach grounded in deep hardware insight, we are able to decrease the overhead from 1.5µs to 600ns.
Find the resources you need to develop using AMD products: https://www.amd.com/en/developer.html
Join the Developer Community: https://devcommunity.amd.com/
Join the Developer Discord server: https://discord.gg/amd-dev
***
© 2026 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo, EPYC, ROCm, and AMD Instinct and combinations thereof are trademarks of Advanced Micro Devices, Inc.
Видео Building a Real-Time Inference Stack on AMD Instinct GPUs канала AMD Developer Central
Gaël Delalleau. Founder and CEO, Kog
Augustin Verneuil, GPU engineer, Kog
Talk Abstract: In this talk, we share our vision for real-time generative AI, and the techniques we developed to achieve the fastest LLM inference on GPU ever, with a generation speed of 2500 tokens/s per request. We first showcase our end-to-end stack optimized for minimal latency on AMD hardware, spanning model re-architecting, a single monokernel implementation, along with topology-aware algorithms. In the second part, we focus on one of the defining challenges of megakernels, intra-GPU grid synchronization barriers and reduce/gather primitives. Using a chiplet-aware approach grounded in deep hardware insight, we are able to decrease the overhead from 1.5µs to 600ns.
Find the resources you need to develop using AMD products: https://www.amd.com/en/developer.html
Join the Developer Community: https://devcommunity.amd.com/
Join the Developer Discord server: https://discord.gg/amd-dev
***
© 2026 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo, EPYC, ROCm, and AMD Instinct and combinations thereof are trademarks of Advanced Micro Devices, Inc.
Видео Building a Real-Time Inference Stack on AMD Instinct GPUs канала AMD Developer Central
Комментарии отсутствуют
Информация о видео
14 мая 2026 г. 21:53:03
00:22:16
Другие видео канала




















