Does llama.cpp's speculative actually work? : r/LocalLLaMA Skip to main content

Get the Reddit app

Scan this QR code to download the app now
Or check it out in the app stores
r/LocalLLaMA icon
r/LocalLLaMA icon
Go to LocalLLaMA
r/LocalLLaMA

Subreddit to discuss about Llama, the large language model created by Meta AI.


Members Online

Does llama.cpp's speculative actually work?

Question | Help

Played around with their speculative for a while now with both code llama and phi but noticed that there was a near negligible speed boost if the speculative didn't end up being slower than the model alone. Anyone have similar experiences?

Share
Sort by:
Best
Open comment sort options
u/vasileer avatar

speculative decoding worked well when 2 models with 10x difference were used (7B and 70B)

u/Particular-Guard774 avatar

Interesting how much of a speed boost did you see with 7B and 70B?

More replies
u/a_beautiful_rhind avatar

Speculative has never been good for me. Plus what I'm short of is vram so running another model seems funny.

u/a_slay_nub avatar

It's been a while but I noticed the same thing with exllama. Maybe there's other settings that need to be set to actually get speed increases?

u/Particular-Guard774 avatar

Oh that's not a bad idea. I think I've already tried switching up temp, number drafted, and token number but they didn't have any significant effect. Any idea as to what else I could try changing?

More replies
u/Calcidiol avatar

I GUESS try looking at the llama.cpp github issues discussions, usually someone does benchmarking or various use-case testing wrt. pull requests / features being proposed so if there are identified use cases where it should be better in X ways then someone should have commented about those, tested them, and benchmarked it for regressions / improvements etc.

u/Particular-Guard774 avatar

Smart, I'll take a look

More replies
u/Adventurous_Doubt_70 avatar

If your system bottleneck is not memory bandwidth but arithmetic operations (e.g. you are inferring on a slow CPU with high bandwidth DDR5 RAM), then speculative decoding won't help.

u/Particular-Guard774 avatar
Edited

I'm using a M3 Max MacBook Pro do you think the CPU is too slow on it? Also if it is the problem what models could I use to get it to work?

u/Adventurous_Doubt_70 avatar
Edited

In this particular case, it's unlikely that your CPU is too slow. My suggestion would be to use a smaller model for guidance. A 1:10 parameter ratio is empirically a good starting point.

In essence, speculative decoding takes advantage of arithmetic-memory imbalance. If your CPU speed is on par with your memory bandwidth, you will gain no speedup. Apple SOC's have much higher memory bandwidth comparing to an average consumer-grade DDR5 x86 machine. It's hard to tell which side is the bottleneck.

u/Particular-Guard774 avatar

Okay, I kind of get what you're saying. I tried deepseek coder 1.3B and 33B to no avail so your assumption is probably right on the dot. Do you have any recommendations on resources/videos I can check out to get a better idea of how and when this works?

u/Adventurous_Doubt_70 avatar

I personally read the original paper to understand how and when it works. If you have some ML/DL background I would recommend reading it yourself.

more reply More replies
More replies
More replies
More replies
More replies
u/CKtalon avatar

Despite sharing the same tokenizer, phi was trained in quite a different way, so the output behavior might be very different from llama.

u/koflerdavid avatar
Edited

The assistant model should be much much smaller than the main model. Using a 1B for a 7B is already too wasteful since the assistant model will be wrong a lot of times. It might be a good idea to distill one from the main model.

u/Chromix_ avatar

Speculative decoding works fine when using suitable and compatible models. It increases the tokens/s that I get 3x. I run the small draft model on the GPU and the big main model on the CPU (due to lack of VRAM).

speculative -m ..\models\deepseek-coder-33b-instruct_Q8_0.gguf -md ..\models\deepseek-coder-1.3b-instruct_Q8_0.gguf --temp 0 -f prompt-deepseek-coder-instruct.txt -ngld 55 -n 512

This runs with these stats:

encoded 144 tokens in 3.761 seconds, speed: 38.292 t/s
decoded 224 tokens in 43.265 seconds, speed: 5.177 t/s

n_draft = 5
n_predict = 224
n_drafted = 220
n_accept = 179
accept = 81.364%

When I run the big model on its own with the maximum offload that my GPU supports: main -m ..\models\deepseek-coder-33b-instruct_Q8_0.gguf --temp 0 -f prompt-deepseek-coder-instruct.txt -ngl 10 -c 1024

I get this: eval time = 138174.00 ms / 227 runs ( 608.70 ms per token, 1.64 tokens per second)

So, speculative decoding will result in a huge speed boost when used as intended. The draft model should be a high-quality quant like Q8 or Q6_K with imatrix. Going lower hurts the acceptance rate and thus the speed.

u/Particular-Guard774 avatar

Interesting I'm working with code llama 7 and 34B right now so the size difference could be the issue. I tried downloaded a higher quality quant for 7B but it didn't seem to make a difference. Is it important to figure out whether to run a model on GPU or CPU and if so how do I find the threshold for my own machine?

u/Chromix_ avatar

If you can fully offload the draft model to your GPU then you'll definitely get better speed. If I don't offload at all then my 38 t/s drop to 24 t/s, which increases the processing time by about 7 seconds.

There are two numbers you need to look at: accept percentage and encoding model speed. If the first one is too low then you need a better draft model (or quant), if the second one is too low then you need a smaller draft model (or offload to GPU)

u/Particular-Guard774 avatar
Edited

Okay I see. I tried running deepseek coder 33B with 1.3B and then offloaded and minimized temp completely. I did see some speed improvement from previous speculative runs but even with pretty good encoding speed and accuracy it is still slower than main. Any idea what else I could try doing?

Alone:

./main -m deepseek-coder-33b-instruct.Q4_K_M.gguf -p "Q: What are the planets in the solar system? A:" -r "Q:"

llama_print_timings:        load time =     732.41 ms.

llama_print_timings:      sample time =       0.84 ms /    62 runs   (    0.01 ms per token, 74162.68 tokens per second)

llama_print_timings: prompt eval time =     365.15 ms /    14 tokens (   26.08 ms per token,    38.34 tokens per second)

llama_print_timings:        eval time =    5033.40 ms /    61 runs   (   82.51 ms per token,    12.12 tokens per second)

llama_print_timings:       total time =    5470.43 ms /    75 tokens

With Speculative:

./speculative -m deepseek-coder-33b-instruct.Q4_K_M.gguf -md deepseek-coder-1.3b-instruct.Q6_K.gguf -p "Q: What are the planets of the solar system? A: " -n 200 -ngld 200 --temp 0

encoded   15 tokens in    0.477 seconds, speed:   31.430 t/s

decoded  201 tokens in   17.655 seconds, speed:   11.385 t/s

n_draft   = 5

n_predict = 201

n_drafted = 205

n_accept  = 159

accept    = 77.561%

draft:

llama_print_timings:        load time =      76.72 ms

llama_print_timings:      sample time =     297.39 ms /     1 runs   (  297.39 ms per token,     3.36 tokens per second)

llama_print_timings: prompt eval time =   15553.61 ms /    96 tokens (  162.02 ms per token,     6.17 tokens per second)

llama_print_timings:        eval time =    1423.79 ms /   164 runs   (    8.68 ms per token,   115.19 tokens per second)

llama_print_timings:       total time =   18132.45 ms /   260 tokens

target:

llama_print_timings:        load time =     756.08 ms

llama_print_timings:      sample time =       3.61 ms /   201 runs   (    0.02 ms per token, 55678.67 tokens per second)

llama_print_timings: prompt eval time =   15682.18 ms /   261 tokens (   60.08 ms per token,    16.64 tokens per second)

llama_print_timings:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)

llama_print_timings:       total time =   18245.41 ms /   262 tokens

ggml_metal_free: deallocating

ggml_metal_free: deallocating

u/Chromix_ avatar

Ah, you're using an Apple laptop. Regular RAM is pretty fast on that one. GPU RAM might be shared though, which could drag down the speed. I'm not sure. Maybe someone else with a recent Apple device can share their stats.

Maybe you'll see an improvement when using the Q8 version of the 33B model instead (running at the same speed as the Q4 that you've used for the test).

More replies
More replies
More replies
More replies