INT8 finally beats FP8 on consumer GPUs — Fused INT8 GEMM kernel

A fused INT8 GEMM kernel keeps W8A8 matrix multiplies in 8-bit on the GPU's tensor cores, so a quantized model finally runs as fast as INT8 promises.

The usual INT8 kernel quietly converts the weights back to 16-bit before it multiplies, so the fast INT8 tensor cores never switch on — and the model can end up slower than FP8. This fused Triton kernel does the matmul as int8×int8 into int32 on the tensor cores and folds the dequantization into the epilogue, hitting 2.8–4.2× faster per GEMM with no measurable quality loss.

Full explainer (interactive): https://learnaivisually.com/g/fused-int8-gemm-tensor-cores
Source: https://arxiv.org/abs/2606.14598

Learn AI & GPUs visually — free interactive courses at learnaivisually.com

#INT8 #GPU #TensorCores #Quantization #LLM

Видео INT8 finally beats FP8 on consumer GPUs — Fused INT8 GEMM kernel канала Learn AI Visually

AI LLM on-device

Комментарии отсутствуют

Информация о видео

16 июня 2026 г. 2:40:55

00:01:03

Learn AI Visually

Теги

Правообладателям

Жалоба на материал Недопустимый материал Нарушение авторских прав

Комментарии

Другие видео канала