- Популярные видео
- Авто
- Видео-блоги
- ДТП, аварии
- Для маленьких
- Еда, напитки
- Животные
- Закон и право
- Знаменитости
- Игры
- Искусство
- Комедии
- Красота, мода
- Кулинария, рецепты
- Люди
- Мото
- Музыка
- Мультфильмы
- Наука, технологии
- Новости
- Образование
- Политика
- Праздники
- Приколы
- Природа
- Происшествия
- Путешествия
- Развлечения
- Ржач
- Семья
- Сериалы
- Спорт
- Стиль жизни
- ТВ передачи
- Танцы
- Технологии
- Товары
- Ужасы
- Фильмы
- Шоу-бизнес
- Юмор
MTP (Multi-Token Prediction): 2x Faster Token Generation on AMD Strix Halo & Radeon 9700 AI Pro
Multi-Token Prediction (MTP) is one of the most practical ways to speed up local token generation. In this video, I break down how the MTP architecture works, how it acts as a built-in replacement for traditional speculative decoding without needing a separate draft model, and why it performs best on highly structured output like code generation.
I walk through the recent integration of MTP into llama.cpp and show how to run it locally using Qwen 3.6. I also share benchmarks comparing performance on AMD Strix Halo and Radeon 9700 AI PRO GPUs.
Join the AMD AI Developer Program for free cloud credits, expert access, and premium AI training—everything you need to build, optimize, and scale on AMD.
Start building today - https://www.amd.com/en/developer/ai-dev-program.html
Timestamps:
00:00 | Introduction
01:03 | Prompt Processing / Token Generation
02:45 | Speculative Decoding
04:34 | Multi-Token Prediction (MTP)
06:44 | Where MTP Wotks Best
08:07 | Using MTP in llama.cpp
11:46 | Benchmarks
16:15 | Conclusion
Links & Resources:
Strix Halo Toolboxes & Tutorials: https://strix-halo-toolboxes.com
Buy Me a Coffee: https://buymeacoffee.com/dcapitella
llama.cpp MTP PR 22673: https://github.com/ggml-org/llama.cpp/pull/22673
MTP GGUFs (Qwen 3.6 27B): https://huggingface.co/ggml-org/Qwen3.6-27B-MTP-GGUF
MTP GGUFs (Qwen 3.6 35B-A3B): https://huggingface.co/ggml-org/Qwen3.6-35B-A3B-MTP-GGUF
Benchmark Script (mtp-bench.py): https://gist.github.com/am17an/228edfb84ed082aa88e3865d6fa27090
Reference Papers & Theory:
Speculative Decoding Paper (Leviathan, Kalman, Matias - Google, 2023): https://arxiv.org/abs/2211.17192
Multi-Token Prediction Paper (Gloeckle et al. - Meta, 2024): https://arxiv.org/abs/2404.19737
DeepSeek-V3 Technical Report: https://arxiv.org/abs/2412.19437
Sebastian Raschka MTP Architecture Gallery: https://sebastianraschka.com/llm-architecture-gallery/mtp/
Community Explanations:
Devsplainers - MTP Explanation: https://www.youtube.com/watch?v=aLq9DModnaw
Hardware Used:
- Platform 1: AMD Strix Halo Framework (Unified Memory)
- Platform 2: 2x AMD Radeon AI PRO R9700 32GB (Discrete PCIe Setup)
Видео MTP (Multi-Token Prediction): 2x Faster Token Generation on AMD Strix Halo & Radeon 9700 AI Pro канала Donato Capitella
I walk through the recent integration of MTP into llama.cpp and show how to run it locally using Qwen 3.6. I also share benchmarks comparing performance on AMD Strix Halo and Radeon 9700 AI PRO GPUs.
Join the AMD AI Developer Program for free cloud credits, expert access, and premium AI training—everything you need to build, optimize, and scale on AMD.
Start building today - https://www.amd.com/en/developer/ai-dev-program.html
Timestamps:
00:00 | Introduction
01:03 | Prompt Processing / Token Generation
02:45 | Speculative Decoding
04:34 | Multi-Token Prediction (MTP)
06:44 | Where MTP Wotks Best
08:07 | Using MTP in llama.cpp
11:46 | Benchmarks
16:15 | Conclusion
Links & Resources:
Strix Halo Toolboxes & Tutorials: https://strix-halo-toolboxes.com
Buy Me a Coffee: https://buymeacoffee.com/dcapitella
llama.cpp MTP PR 22673: https://github.com/ggml-org/llama.cpp/pull/22673
MTP GGUFs (Qwen 3.6 27B): https://huggingface.co/ggml-org/Qwen3.6-27B-MTP-GGUF
MTP GGUFs (Qwen 3.6 35B-A3B): https://huggingface.co/ggml-org/Qwen3.6-35B-A3B-MTP-GGUF
Benchmark Script (mtp-bench.py): https://gist.github.com/am17an/228edfb84ed082aa88e3865d6fa27090
Reference Papers & Theory:
Speculative Decoding Paper (Leviathan, Kalman, Matias - Google, 2023): https://arxiv.org/abs/2211.17192
Multi-Token Prediction Paper (Gloeckle et al. - Meta, 2024): https://arxiv.org/abs/2404.19737
DeepSeek-V3 Technical Report: https://arxiv.org/abs/2412.19437
Sebastian Raschka MTP Architecture Gallery: https://sebastianraschka.com/llm-architecture-gallery/mtp/
Community Explanations:
Devsplainers - MTP Explanation: https://www.youtube.com/watch?v=aLq9DModnaw
Hardware Used:
- Platform 1: AMD Strix Halo Framework (Unified Memory)
- Platform 2: 2x AMD Radeon AI PRO R9700 32GB (Discrete PCIe Setup)
Видео MTP (Multi-Token Prediction): 2x Faster Token Generation on AMD Strix Halo & Radeon 9700 AI Pro канала Donato Capitella
Комментарии отсутствуют
Информация о видео
19 мая 2026 г. 1:56:51
00:17:50
Другие видео канала





















