Here's a 100% legit tutorial to tune for llama.cpp MoE models. Very simple.
To start, you need to determine your target context size and other parameters (e.g. context size 64k, ctk q8_0, etc.). Enable flash attention.
We'll focus on the two main parameters: -ncm and -ub. Set the "batch size" -b to 4096 and left there.
Now, here's simple initial tuning:
If targetting maximum tg128, etc:
- Decrease
ncmuntil OOM, then increaseubin steps of 256 until reached 4096 or OOM. this is most effective whenncm = 0, but not really whenncmis high (see below).
Now, there's the option to do more comprehensive tuning. You can use this table as your template and change if neccessary.
here MAX can refer to your context size or [your context size] - 4096. Choose any but the former will leave a slight VRAM budget for other things.
ub |
ncm min |
pp4096 @ d8192 |
tg128 @ d8192 |
pp4096 @ dMAX |
tg128 @ dMAX |
|---|---|---|---|---|---|
| 512 | ... | ... | ... | ... | ... |
| 1024 | |||||
| ... | |||||
| 4096 (or OOM) |
Basically you can do the d8192 tests first and see what fits you best (maximum tg or PP.) If you have low VRAM, chances that just setting ub to max won't change the tg much, but pp performance can literally double. After that you can test on depths with step size 4k from 0 to MAX.
Bình luận
Be aware that low context pp such as pp512 isn't a substitute for pp4096 or higher.