Загрузка...

INT8 finally beats FP8 on consumer GPUs — Fused INT8 GEMM kernel

A fused INT8 GEMM kernel keeps W8A8 matrix multiplies in 8-bit on the GPU's tensor cores, so a quantized model finally runs as fast as INT8 promises.

The usual INT8 kernel quietly converts the weights back to 16-bit before it multiplies, so the fast INT8 tensor cores never switch on — and the model can end up slower than FP8. This fused Triton kernel does the matmul as int8×int8 into int32 on the tensor cores and folds the dequantization into the epilogue, hitting 2.8–4.2× faster per GEMM with no measurable quality loss.

Full explainer (interactive): https://learnaivisually.com/g/fused-int8-gemm-tensor-cores
Source: https://arxiv.org/abs/2606.14598

Learn AI & GPUs visually — free interactive courses at learnaivisually.com

#INT8 #GPU #TensorCores #Quantization #LLM

Видео INT8 finally beats FP8 on consumer GPUs — Fused INT8 GEMM kernel канала Learn AI Visually
Яндекс.Метрика
Все заметки Новая заметка Страницу в заметки
Страницу в закладки Мои закладки
На информационно-развлекательном портале SALDA.WS применяются cookie-файлы. Нажимая кнопку Принять, вы подтверждаете свое согласие на их использование.
О CookiesНапомнить позжеПринять