Building a Real-Time Inference Stack on AMD Instinct GPUs

Speakers
Gaël Delalleau. Founder and CEO, Kog
Augustin Verneuil, GPU engineer, Kog

Talk Abstract: In this talk, we share our vision for real-time generative AI, and the techniques we developed to achieve the fastest LLM inference on GPU ever, with a generation speed of 2500 tokens/s per request. We first showcase our end-to-end stack optimized for minimal latency on AMD hardware, spanning model re-architecting, a single monokernel implementation, along with topology-aware algorithms. In the second part, we focus on one of the defining challenges of megakernels, intra-GPU grid synchronization barriers and reduce/gather primitives. Using a chiplet-aware approach grounded in deep hardware insight, we are able to decrease the overhead from 1.5µs to 600ns.

Find the resources you need to develop using AMD products: https://www.amd.com/en/developer.html

Join the Developer Community: https://devcommunity.amd.com/

Join the Developer Discord server: https://discord.gg/amd-dev

***

© 2026 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo, EPYC, ROCm, and AMD Instinct and combinations thereof are trademarks of Advanced Micro Devices, Inc.

Видео Building a Real-Time Inference Stack on AMD Instinct GPUs канала AMD Developer Central

Developer Central Dev Central developer developer tools AMD Advanced Micro Devices AMD Developer

Комментарии отсутствуют