DFlash on GTX 1060: Can Dense AI Models Cheat VRAM Like MoE?

DFlash speculative decoding on llama.cpp took Qwen3-8B from 12 tok/s to 40 tok/s on a GTX 1060 6GB — but only when the dense model fit fully on GPU.

The last video proved that a 35B MoE model can run on a 6GB GPU using expert offloading. It was not perfect, but it worked: around 17 tok/s on old consumer hardware.

So the obvious question was:

Can we make it faster?

This video tests DFlash on the same GTX 1060 setup.

First, I tried it on the 35B MoE offload setup. It got slower.

Then I tried a 27B dense model. But because it still could not fully fit on the GPU, offloading came back into the picture — and performance was still bad.

Then I switched to Qwen3-8B, a dense model that fits fully on the GTX 1060 with room for the DFlash drafter.

That is where things changed.

Baseline: around 12 tok/s
With DFlash: around 40 tok/s
Peak runs brushed around 60 tok/s

But the real lesson is the catch:

DFlash is not a VRAM cheat.

It is a throughput multiplier.

If your model is slow because layers are offloaded to CPU, DFlash cannot magically fix that. The target model still has to verify the draft tokens, and if verification is slow, the speedup collapses.

The other surprise: the task matters.

On coding tasks, draft acceptance hit around 53%, giving the big 40 tok/s speedup.

On creative writing, acceptance dropped to around 9%, and performance fell back near baseline.

Same model. Same drafter. Same GPU.

Different task, completely different result.

So can dense AI models cheat VRAM like MoE does?

Not really.

MoE offloading is a memory trick.
DFlash speculative decoding is a throughput trick.

And if the model fits fully on GPU, that throughput trick can be extremely powerful.

⏱ Timestamps
00:00 Cold open: 12 → 40 tok/s
00:36 The question from the 35B MoE video
01:26 The catch
01:47 Why most speculative-decoding demos do not apply here
02:26 Attempt 1: 35B MoE + DFlash
03:17 Attempt 2: 27B dense + DFlash
04:03 The realization
04:29 Qwen3-8B + DFlash
05:41 Subscribe + Discord
05:56 Why the small dense model worked
07:20 M1 Mac test
07:45 Task-asymmetry test
08:06 Why coding wins and creative writing loses
09:22 Final results
09:58 Hot take
10:22 Cheat sheet
10:28 Wrap

🔗 Links
• DFlash llama.cpp PR: https://github.com/ggml-org/llama.cpp/pull/22105
• DFlash paper: https://arxiv.org/abs/2602.06036
• Target model — Qwen3-8B Q4_K_M GGUF: https://huggingface.co/unsloth/Qwen3-8B-GGUF
• DFlash drafter for Qwen3-8B: https://huggingface.co/z-lab/Qwen3-8B-DFlash-b16
• Full build + run config: see the pinned comment
• Previous video — 35B MoE on 6GB GPU: https://youtu.be/8F_5pdcD3HY

💬 Discord
Someone in the Discord already pulled 80 tok/s on a 3080 Ti with a similar setup. Join here: discord.gg/XgBzczAWs

⚠️ Caveats
• DFlash PR is still in draft, so behavior may change before merge.
• Apple Silicon / M1 path is still WIP. I saw inconsistent results.
• CUDA is the safer path for now.
• Tested on Pascal / GTX 1060 6GB. Newer GPUs will likely perform better.
• DFlash helps decode speed, not prompt processing.
• This is not a replacement for VRAM. If the target model is heavily CPU-offloaded, the speedup can disappear.

Channel: Codacus — local AI experiments on real consumer hardware.

#localai #llamacpp #speculativeDecoding #dflash #qwen3 #qwen #localllm #gtx1060 #consumerGPU #AIonOldHardware

Видео DFlash on GTX 1060: Can Dense AI Models Cheat VRAM Like MoE? канала Codacus

Комментарии отсутствуют

Информация о видео

15 мая 2026 г. 4:32:25

00:11:30

Codacus

Правообладателям

Жалоба на материал Недопустимый материал Нарушение авторских прав

Комментарии

Другие видео канала