-
Notifications
You must be signed in to change notification settings - Fork 11.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PR: Refine ggml-hexagon backend(Qualcomm Hexagon NPU backend) for latest ggml,whisper.cpp,llama.cpp #12326
base: master
Are you sure you want to change the base?
Conversation
3ef106e
to
db890cc
Compare
why "mapping the entire ggml's computational graph to QNN graph"(the second technical approach in above post) is not practical in ggml-qnn backend
the key-point function in this complicated C++ source file is: we can clearly see that an ideal or expected QNN graph which has only single QNN graph with many many graph nodes would be generated/composed in this function. then we can understand that the codes in QnnSampleMain.cpp is just a route work or skeleton codes. in this case, we clearly know the single QNN graph was generated by Qualcomm's dedicated tool.
https://docs.qualcomm.com/bundle/publicresource/topics/80-63442-100/introduction.html we can clearly see that the core process of offload inference to NPU(HTP) backend is 90%-99% same to the general approach of "utilize the Hexagon NPU maximally in QNN NPU backend" in Qualcomm's QNN Sample after tracking all relative codes in QNN SDK. in this case, we clearly know the single QNN graph was generated by Qualcomm's dedicated tool.
we can clearly see that a customized model which was trained and provided by XiaoMi's AI team and this customized binary model will be used in this open source project: they claimed they can got a 10x performance gain with NPU inference . at the same time, we can clearly see that the main logic of this open source project is 90% same to Qualcomm's QNN Sample after tracking codes carefully, but we still don't know how that single QNN graph was generated? what should we think at the moment???
this open source project comes from a famous China top university and it can be considered a derived or highly-customized project of llama.cpp. one of the highlights in this derived project is that the R&D developers implemented a closed-source QNN backend. recently I found a highly related project on GitHub with help from an unknown(here means I don't know) programmer @zhuipiaochen. we can clearly see the approach of "utilize the Hexagon NPU maximally in QNN NPU backend" in this interesting project is 90% same to the approach in Qualcomm's Genie or 90% same to the approach in Qualcomm's QNN Sample after tracking codes carefully:
the last 3 steps are exactly similar to offload 2D/3D matrix mulipilication to QNN backend in this PR. the difference between these two scenarios is that there are only 2 QNN graph nodes in the QNN graph of 2D/3D mulmat on QNN backend. in this case, we still don't know how the single QNN graph was generated? what should we think at the moment?????
ok, let me doing an interesting experiment with ggml-qnn backend in this PR:
what we can see from the logs of adb logcat? we can clearly see that there is no such an entire or complete GGML graph in this function: accordingly, the logic or inference procedure in this function is exactly same to the original approach or general approach in all ggml backends.this is the limitation of the existing implementation of inference procedure or inference architecture in llama.cpp. conclusion:
[updated on 21:56, 03/12/2025], the conclusion here is incorrect because the analysis in case 5 is WRONG, the first tech approach in this PR is still meaningful(because all op functions can be used in the second tech approach after some minor adjustment) and the second tech approach should be finished in this PR or other similar PR, but the analysis in case 1/2/3/4 is completely correct and logic in this tech doc is correct:Qualcomm provides some binary dedicated tools to do LLM model conversion which is exactly hard work(compose an ideal QNN graph according to the complete ggml cgraph or mapping the complete ggml cgraph to a single QNN graph) in the second tech approach of ggml-qnn backend. the second tech approach can be also implemented in this PR but I think I can't completely finish it because of my limited AI knowledge(there are hundreds of cgraph nodes and there are about 50+ ops) and real AI experts must be involved in the rest parts of ggml-qnn. so, good lucky to other similar PR. I made a wrong analysis in step 5 or misunderstanding in #12342 which already explained by slaren, the rootcause of these two-stupid mistakes is that I have very limited knowledge about real hard-core AI tech. |
2ceaaf5
to
3402e2c
Compare
Nice job. NPU support is huge for this project. Do you think its also possible to make it work on Exynos 2200 and 2400 NPUs? |
thanks for your kind comment.
|
0065122
to
1f702df
Compare
1e98561
to
e4b0d8c
Compare
967be44
to
a26806a
Compare
…mul_mat works fine on Hexagon cDSP at the first time
…ulmat performance on cDSP easily
e0d751e
to
979c379
Compare
* [ ] Low
* [x] Medium
* [ ] High
* [x]
test-backend-ops
andllama-cli
through QNN on Qualcomm Snapdragon 8Gen3 equipped Android phone* [x]
llama-cli
through cDSP on Qualcomm Snapdragon 8Gen3 equipped Android phonePR Description
this PR is a continued effort of my original PR #6869 on 04/2024, focus on the final mission:how to utilize the Hexagon NPU maximally with the highly well-designed and highly compact ggml machine learning framework.
this is a concise ggml-hexagon(the previous name was ggml-qnn but that wasn't accurate) implementation:
thanks to the huge changes in the software architecture in the latest llama.cpp (especially the maturation of the "backend scheduler" feature and the maturation of test-backend-ops),
this implementation put main logic in one single source file(ggml-qnn.cpp) because it will helpful for other experienced programmers and experts be involved in further dev activity. other reason of this coding style is I think this will make the developers' workflow more easily:
Features
data path works good between QNN SDK and ggml/llama.cpp through reverse engineering from executorch(the implementation through QNN in executorch comes from Qualcomm) in my first PR on 04/2024
a simple and effective QNN graph cache mechanism already implemented on 04/2024
use a simple STL to manage QNN resources in this PR rather than complex C++ encapsulation because the highly-well designed QNN SDK already manage it's internal hardware and software resource very carefully
a simple skeleton in function ggmlqnn_compute_elementwise:offload GGML_OP_ADD & GGML_OP_MUL & GGML_OP_SUB & GGML_OP_DIV & GGML_OP_LOG & GGML_OP_SQRT to QNN backend. we can see this function is a very concise implementation rather than complex C++ encapsulation with hide many tech details.
a complex skeleton in function ggml_qnn_mulmat: offload GGML_OP_MUL_MAT(2d&3d mulmat) to QNN backend, this skeleton can be used to illustrate the second technical approach of "how to utilize the Hexagon NPU maximally". we can see this function is a very concise implementation rather than complex C++ encapsulation with hide many tech details.
a more complex skeleton in function ggml_qnn_mulmat_4d: offload 4d mulmat to QNN backend, this skeleton can be used to illustrate the second technical approach of "how to utilize the Hexagon NPU maximally". we can see this function is a concise implementation rather than complex C++ encapsulation with hide many tech details.(UT passed but some unknown bugs with test-backend-ops).
QNN NPU RPC feature already implemented on 04/2024.
dynamic running parameter adjustment through ggml-qnn.cfg(this idea comes from @ngxson in his draft AI-dedicated PR and more parameters can be added in this configuration file).

offload quantized data type with 2d&3d mulmat to QNN backend.
provide big picture of ggm-hexagon backend in this PR for further or other relative dev activity in this great pure-tech community.
provide a very fast approach which exactly similar to Intel's ggml-sycl or Qualcomm's ggml-opencl or Huawei's ggml-cann: offload ggml op to Hexagon cDSP directly.
code is simple and everyone can understand code easily and quickly, without complex encapsulation and hide tech details, because layered abstraction and loose coupling will bring difficult with code tracking and troubleshooting.
special clarification in this section:
Hexagon NPU Performance
test phone is a Snapdragon 8 Gen3 Android phone, test model is qwen1_5-1_8b-chat-q4_0.gguf.
verify mulmat on cDSP(modify enable_mulmat_cdsp to 1 in scripts/ggml-qnn.cfg and then run):
verify mulmat on QNN-NPU(modify hwaccel_approach to 0 --- hwaccel approach through QNN--- in scripts/ggml-qnn.cfg and then run)
we can/will clearly see(from adb logcat | grep ggml-qnn) that the performance of mulmat with cDSP is much much faster than mulmat with QNN-NPU for small matrix.




TBD
Hexagon NPU performance with qwen1_5-1_8b-chat-q4_0.gguf on Snapdragon 8 Gen3
03/11/2025:prompt eval 3.10 tokens / second, eval performance 11.47 tokens / second
03/12/2025:prompt eval 3.47 tokens / second, eval performance 22.03 tokens / second
03/13/2025:prompt eval 4.34 tokens / second, eval performance 23.72 tokens / second
03/24/2025:prompt eval 40-70+ tokens / second, eval performance 10+ tokens / second (only offload GGML_OP_ADD to cDSP during LLM inference at the moment, not sure the real performance when mulmat can works fine on cDSP during LLM inference. so this impressive benchmark data has not much value at the moment.)
How to build ggml‐hexagon source code for Android and verify ggml-hexagon backend on Snapdragon based phone
Ubuntu 20.04,22.04 is validated and recommended as host machine(other Linux distributions might be also ok). the dev activity in this PR can be done in pure command line without any IDE:
utilize build-run-android.sh to download Android NDK and Qualcomm QNN SDK automatically(pls see below section)
you will need an Android smartphone with adb-connected running on one of below Qualcomm SoCs:
SM8450 (Snapdragon 8 Gen 1+)
SM8550 (Snapdragon 8 Gen 2)
SM8650 (Snapdragon 8 Gen 3)
SM8750-AB (Snapdragon 8 Elite) (aka Snapdragon 8 Gen 4)
we can find that this backend works fine as expected from the log output of "adb logcat | grep ggml-qnn". for programmers, we can use "adb logcat | grep ggml-hexagon" to help troubleshooting work.
How to build ggml‐hexagon source code for Snapdragon based WoA(Windows on ARM) device
the good news for WoA port is:
Big picture of ggml-hexagon backend
there are three tech approaches to implement the ggml-hexagon backend for Qualcomm's Hexagon NPU:
the general approach through QNN SDK or Hexagon SDK can be seen in this PR. the special approach through QNN will be seen in another standalone PR, because:
[updated on 03/19/2025] the technical approach of "mapping the entire ggml computational graph to QNN graph" will be seen in another standalone PR: provide a concise( without complex/complicated encapsulation and hide tech details, for example, 4d mulmat) implementation of the technical approach "mapping the entire ggml cgraph to a single QNN graph" .
[updated on 03/20/2025]: I deeply thought many hours after a senior staff technical expert from Qualcomm told me on 03/18/2025 that "QNN is not the right solution here" very valuablely, today I think I know there is another tech approach of "utilize the Hexagon NPU maximally". I'll try to implement the third tech approach base on this PR(in other words, most of codes in this PR will be re-used in the third tech approach, AND the efforts on the first tech approach or the second tech approach is also meaningful because these are all necessary exploring steps before completing the final mission) if my guess can be confirmed by the senior staff technical expert at Qualcomm: I think I know how do that so-called third approach and I think I completely understand why there is so much performance difference between ggml-hexagon and Intel's ggml-sycl or Huawei's ggml-cann at the moment if my guess can be confirmed.
[updated on 03/22/2025]: the general approach through Hexagon cDSP which exactly similar to Qualcomm's ggml-opencl or Intel's ggml-sycl can be seen in this PR.
[updated on 03/23/2025]: I'm not AI expert so I'd like to port a tiny customized ggml-dsp to Hexagon cDSP and then optimize this tiny ggml-dsp with Hexagon SIMD instructions.
[updated on 03/25/2025]: all codes can be seen in this PR and I try my best to refine the codes to make it more clear and more concise and bug-free. code review from AI experts and domain experts is greatly welcomed and appreciated, hope this PR can be seen in the master branch of llama.cpp so other domain tech experts and AI experts can have a chance to participate in dev activities of improve the hexagon kernel (which similar to opencl kernel, cuda kernel, metal kernel......)
key-points about ggml-hexagon's performance in the general approach:
key-points about ggml-hexagon's peformance in the special approach:
Acknowledgement
Conclusion
after spent too much efforts on ggml-hexagon backend, I personally think:
@max-krasnyansky, sorry to bother you, I understand your time is valuable, could you take a look at this PR again?(the code in ggml-qnn.cpp is close to being stable and I try my best to make it more clear and more concise and bug-free, theoretically speaking, we only need to focus on hexagon kernels from now(03/25/2025) on)