r/LocalLLaMA

Subreddit to discuss about Llama, the large language model created by Meta AI.

Members

Online

•

Particular-Guard774

Does llama.cpp's speculative actually work?

Question | Help

Played around with their speculative for a while now with both code llama and phi but noticed that there was a near negligible speed boost if the speculative didn't end up being slower than the model alone. Anyone have similar experiences?

Sort by:

Best

Open comment sort options

Best

Top

New

Controversial

Old

Q&A

vasileer

•

speculative decoding worked well when 2 models with 10x difference were used (7B and 70B)

Particular-Guard774

•

Interesting how much of a speed boost did you see with 7B and 70B?

More replies

a_beautiful_rhind

•

Speculative has never been good for me. Plus what I'm short of is vram so running another model seems funny.

a_slay_nub

•

It's been a while but I noticed the same thing with exllama. Maybe there's other settings that need to be set to actually get speed increases?

Particular-Guard774

•

Oh that's not a bad idea. I think I've already tried switching up temp, number drafted, and token number but they didn't have any significant effect. Any idea as to what else I could try changing?

More replies

Calcidiol

•

I GUESS try looking at the llama.cpp github issues discussions, usually someone does benchmarking or various use-case testing wrt. pull requests / features being proposed so if there are identified use cases where it should be better in X ways then someone should have commented about those, tested them, and benchmarked it for regressions / improvements etc.

Particular-Guard774

•

Smart, I'll take a look

More replies

Adventurous_Doubt_70

•

If your system bottleneck is not memory bandwidth but arithmetic operations (e.g. you are inferring on a slow CPU with high bandwidth DDR5 RAM), then speculative decoding won't help.

Particular-Guard774

•

• Edited

I'm using a M3 Max MacBook Pro do you think the CPU is too slow on it? Also if it is the problem what models could I use to get it to work?

Adventurous_Doubt_70

•

• Edited

In this particular case, it's unlikely that your CPU is too slow. My suggestion would be to use a smaller model for guidance. A 1:10 parameter ratio is empirically a good starting point.

In essence, speculative decoding takes advantage of arithmetic-memory imbalance. If your CPU speed is on par with your memory bandwidth, you will gain no speedup. Apple SOC's have much higher memory bandwidth comparing to an average consumer-grade DDR5 x86 machine. It's hard to tell which side is the bottleneck.

Particular-Guard774

•

Okay, I kind of get what you're saying. I tried deepseek coder 1.3B and 33B to no avail so your assumption is probably right on the dot. Do you have any recommendations on resources/videos I can check out to get a better idea of how and when this works?

Adventurous_Doubt_70

•

I personally read the original paper to understand how and when it works. If you have some ML/DL background I would recommend reading it yourself.

more reply More replies

More replies

CKtalon

•

Despite sharing the same tokenizer, phi was trained in quite a different way, so the output behavior might be very different from llama.

koflerdavid

•

• Edited

The assistant model should be much much smaller than the main model. Using a 1B for a 7B is already too wasteful since the assistant model will be wrong a lot of times. It might be a good idea to distill one from the main model.

Chromix_

•

Speculative decoding works fine when using suitable and compatible models. It increases the tokens/s that I get 3x. I run the small draft model on the GPU and the big main model on the CPU (due to lack of VRAM).

speculative -m ..\models\deepseek-coder-33b-instruct_Q8_0.gguf -md ..\models\deepseek-coder-1.3b-instruct_Q8_0.gguf --temp 0 -f prompt-deepseek-coder-instruct.txt -ngld 55 -n 512

This runs with these stats:

encoded 144 tokens in 3.761 seconds, speed: 38.292 t/s
decoded 224 tokens in 43.265 seconds, speed: 5.177 t/s

n_draft = 5
n_predict = 224
n_drafted = 220
n_accept = 179
accept = 81.364%

When I run the big model on its own with the maximum offload that my GPU supports: main -m ..\models\deepseek-coder-33b-instruct_Q8_0.gguf --temp 0 -f prompt-deepseek-coder-instruct.txt -ngl 10 -c 1024

I get this: eval time = 138174.00 ms / 227 runs ( 608.70 ms per token, 1.64 tokens per second)

So, speculative decoding will result in a huge speed boost when used as intended. The draft model should be a high-quality quant like Q8 or Q6_K with imatrix. Going lower hurts the acceptance rate and thus the speed.

Particular-Guard774

•

Interesting I'm working with code llama 7 and 34B right now so the size difference could be the issue. I tried downloaded a higher quality quant for 7B but it didn't seem to make a difference. Is it important to figure out whether to run a model on GPU or CPU and if so how do I find the threshold for my own machine?

Chromix_

•

If you can fully offload the draft model to your GPU then you'll definitely get better speed. If I don't offload at all then my 38 t/s drop to 24 t/s, which increases the processing time by about 7 seconds.

There are two numbers you need to look at: accept percentage and encoding model speed. If the first one is too low then you need a better draft model (or quant), if the second one is too low then you need a smaller draft model (or offload to GPU)

Particular-Guard774

•

• Edited

Okay I see. I tried running deepseek coder 33B with 1.3B and then offloaded and minimized temp completely. I did see some speed improvement from previous speculative runs but even with pretty good encoding speed and accuracy it is still slower than main. Any idea what else I could try doing?

Alone:

./main -m deepseek-coder-33b-instruct.Q4_K_M.gguf -p "Q: What are the planets in the solar system? A:" -r "Q:"

llama_print_timings: load time = 732.41 ms.

llama_print_timings: sample time = 0.84 ms / 62 runs ( 0.01 ms per token, 74162.68 tokens per second)

llama_print_timings: prompt eval time = 365.15 ms / 14 tokens ( 26.08 ms per token, 38.34 tokens per second)

llama_print_timings: eval time = 5033.40 ms / 61 runs ( 82.51 ms per token, 12.12 tokens per second)

llama_print_timings: total time = 5470.43 ms / 75 tokens

With Speculative:

./speculative -m deepseek-coder-33b-instruct.Q4_K_M.gguf -md deepseek-coder-1.3b-instruct.Q6_K.gguf -p "Q: What are the planets of the solar system? A: " -n 200 -ngld 200 --temp 0

encoded 15 tokens in 0.477 seconds, speed: 31.430 t/s

decoded 201 tokens in 17.655 seconds, speed: 11.385 t/s

n_draft = 5

n_predict = 201

n_drafted = 205

n_accept = 159

accept = 77.561%

draft:

llama_print_timings: load time = 76.72 ms

llama_print_timings: sample time = 297.39 ms / 1 runs ( 297.39 ms per token, 3.36 tokens per second)

llama_print_timings: prompt eval time = 15553.61 ms / 96 tokens ( 162.02 ms per token, 6.17 tokens per second)

llama_print_timings: eval time = 1423.79 ms / 164 runs ( 8.68 ms per token, 115.19 tokens per second)

llama_print_timings: total time = 18132.45 ms / 260 tokens

target:

llama_print_timings: load time = 756.08 ms

llama_print_timings: sample time = 3.61 ms / 201 runs ( 0.02 ms per token, 55678.67 tokens per second)

llama_print_timings: prompt eval time = 15682.18 ms / 261 tokens ( 60.08 ms per token, 16.64 tokens per second)

llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)

llama_print_timings: total time = 18245.41 ms / 262 tokens

ggml_metal_free: deallocating

Chromix_

•

Ah, you're using an Apple laptop. Regular RAM is pretty fast on that one. GPU RAM might be shared though, which could drag down the speed. I'm not sure. Maybe someone else with a recent Apple device can share their stats.

Maybe you'll see an improvement when using the Q8 version of the 33B model instead (running at the same speed as the Q4 that you've used for the test).