-
Notifications
You must be signed in to change notification settings - Fork 11.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vulkan: Optimize mul_mat_vec p021 and nc shaders #12505
Conversation
These shaders are used in attention calculations, and when the KV cache grows large they start to dominate the run time. For the nc shader (which is called with large 'k' dimension), use unrolling and vector loads. For the p021 shader (which is called with large 'm' and small 'k' dimensions), take advantage of grouped query attention to reuse loads from the A matrix for the whole group, and reduce the number of workgroups (too much overhead from tiny dispatches). Using subgroupAdd in the p021 shader also helps, use that conditionally.
6883c26
to
d984de6
Compare
Wow, great work. That's a good improvement across all my tests. ggml_vulkan: 0 = NVIDIA GeForce RTX 3090 (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 49152 | matrix cores: NV_coopmat2
ggml_vulkan: 0 = AMD Radeon (TM) Pro VII (RADV VEGA20) (radv) | uma: 0 | fp16: 1 | warp size: 64 | shared memory: 65536 | matrix cores: none
ggml_vulkan: 0 = Intel(R) Arc(tm) A770 Graphics (DG2) (Intel open-source Mesa driver) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 65536 | matrix cores: none
|
The performance bump is very nice! However,
Note that the outputs still seem perfectly fine and coherent, so this isn't really a problem. |
Oh no, not the proprietary driver again... |
The tests were finally all passing on the proprietary driver with the commit just before this one 😢 |
Can you try disabling subgroup_add? Maybe that is what the driver doesn't like (even though it claims it supports it).
set this to false |
Is the wave size of 32 expected? Maybe that's not interacting well with the full REQUIRE_FULL_SUBGROUPS bit, and then that could be somehow breaking the subgroupAdd? |
Tests are still failing when subgroupAdd is disabled. |
I've also got f16 tests failing on my RX 470, but I've only got two of them and I'm on RADV. Welp there goes my weekend!
This also fails with subgroupadd turned off and this card has worked fine with subgroup stuff in the past.
Oftentimes I've seen that the outputs still look fine if the tests fail with a small NMSE. But if you run a perplexity test the results are noticeably worse. |
Maybe try turning off the disablerobustness flag? But i haven't been able to think of a reason there would be an out of bounds access. |
I've got 4 failing tests because I have two AMD GPUs. It's 2 fails per GPU. |
I doesn't look like this is it either. :/ |
I see. In that case I have 4 too as my W8100 is also failing on the same test 😞. My laptop with an Intel integrated GPU passes. |
Try forcing gqa_ratio = 1 at line 4646? |
If it happens on RADV too, then I probably missed it due to unrelated mesa bugs I'm dealing with on my Radeon VII. Maybe I need to revise my testing setup. |
Actually neither of the failing tests calls one of the two shaders that were optimized here, they are just part of the newly-added tests.
So it's either the copy or the regular mul_mat_vec shader. |
I think I see what's going on, and I can reproduce it on NVIDIA with different K values. The odd K value should hit the OOB logic in the mul_mat_vec shader, but the logic for lastiter is wrong (the unrolled loops can potentially include the last iteration). I'll make a fix. Thanks for the help in pinpointing this. |
I've pushed a fix at #12529. |
* tests: add mul_mat perf/functional tests for p021/nc vulkan shaders * vulkan: Optimize mul_mat_vec p021 and nc shaders. These shaders are used in attention calculations, and when the KV cache grows large they start to dominate the run time. For the nc shader (which is called with large 'k' dimension), use unrolling and vector loads. For the p021 shader (which is called with large 'm' and small 'k' dimensions), take advantage of grouped query attention to reuse loads from the A matrix for the whole group, and reduce the number of workgroups (too much overhead from tiny dispatches). Using subgroupAdd in the p021 shader also helps, use that conditionally.
These shaders are used in attention calculations, and when the KV cache grows large they start to dominate the run time. For the nc shader (which is called with large 'k' dimension), use unrolling and vector loads. For the p021 shader (which is called with large 'm' and small 'k' dimensions), take advantage of grouped query attention to reuse loads from the A matrix for the whole group, and reduce the number of workgroups (too much overhead from tiny dispatches).
Using subgroupAdd in the p021 shader also helps, use that conditionally.
I added new directed perf tests based on the multiplies when KV is 16k:
llama-bench results with large KV cache: