How to tune llama.cpp for free

yoshi_fp36 đã đăng vào 12, Tháng 5, 2026, 15:37

Here's a 100% legit tutorial to tune for llama.cpp MoE models. Very simple.

To start, you need to determine your target context size and other parameters (e.g. context size 64k, ctk q8_0, etc.). Enable flash attention.

We'll focus on the two main parameters: -ncm and -ub. Set the "batch size" -b to 4096 and left there.

Now, here's simple initial tuning:

If targetting maximum tg128, etc:

Decrease ncm until OOM, then increase ub in steps of 256 until reached 4096 or OOM. this is most effective when ncm = 0, but not really when ncm is high (see below).

Now, there's the option to do more comprehensive tuning. You can use this table as your template and change if neccessary.

here MAX can refer to your context size or [your context size] - 4096. Choose any but the former will leave a slight VRAM budget for other things.

`ub`	`ncm min`	`pp4096 @ d8192`	`tg128 @ d8192`	`pp4096 @ dMAX`	`tg128 @ dMAX`
512	...	...	...	...	...
1024
...
4096 (or OOM)

Basically you can do the d8192 tests first and see what fits you best (maximum tg or PP.) If you have low VRAM, chances that just setting ub to max won't change the tg much, but pp performance can literally double. After that you can test on depths with step size 4k from 0 to MAX.

Bình luận

Hãy đọc nội quy trước khi bình luận.

0

yoshi_fp36 đã bình luận lúc 12, Tháng 5, 2026, 8:51

Be aware that low context pp such as pp512 isn't a substitute for pp4096 or higher.