PR: Refine ggml-hexagon backend(Qualcomm Hexagon NPU backend) for latest ggml,whisper.cpp,llama.cpp #12326

zhouwg · 2025-03-11T06:32:03Z

I have read the contributing guidelines
Self-reported review complexity:
* [ ] Low
* [x] Medium
* [ ] High
Testing Done
* [x] test-backend-ops and llama-cli through QNN on Qualcomm Snapdragon 8Gen3 equipped Android phone
* [x] llama-cli through cDSP on Qualcomm Snapdragon 8Gen3 equipped Android phone

PR Description

this PR is a continued effort of my original PR #6869 on 04/2024, focus on the final mission:how to utilize the Hexagon NPU maximally with the highly well-designed and highly compact ggml machine learning framework.

this is a concise ggml-hexagon(the previous name was ggml-qnn but that wasn't accurate) implementation:

follow the principle of "simple is beautiful" which comes from the great Unix in the US, code is simple and easily/quickly understand, without complex encapsulation, it's a good reference implementation of ggml-hexagon, can be extended as needs .
follow the principle of "make it run, then make it right, then make it fast" (run and right already got at the moment).

thanks to the huge changes in the software architecture in the latest llama.cpp (especially the maturation of the "backend scheduler" feature and the maturation of test-backend-ops),

data path of ggml-hexagon backend works pretty good as expected.
the official command line tool "test-backend-ops" & "llama-cli" has verified on a Qualcomm Snapdragon 8 Gen3 equipped Android phone.
works pretty good with ASR inference via whisper.cpp and LLM inference via llama.cpp with a standard Android APP(which is a self-made Android APP) on a Qualcomm Snapdragon 8 Gen 3 equipped Android phone.

this implementation put main logic in one single source file(ggml-qnn.cpp) because it will helpful for other experienced programmers and experts be involved in further dev activity. other reason of this coding style is I think this will make the developers' workflow more easily:

this is a self-contained single source file(I can split to some well-organized small source files in less then 1 day if there is a strong need, I don't think this is the point at the moment: this self-contained single source file is similar to what ggerganov did in the very beginning of ggml.c/llama.cpp or what Intel did in the very beginning of ggml-sycl.cpp), or is exactly similar to Qualcomm's ggml-opencl.
try to overcome all relevant technical issues/difficulties with a specified op GGML_OP_ADD or GGML_OP_MUL_MAT
then expand other ggml ops accordingly with team-work from AI experts and programmers in this great pure-tech community

Features

data path works good between QNN SDK and ggml/llama.cpp through reverse engineering from executorch(the implementation through QNN in executorch comes from Qualcomm) in my first PR on 04/2024
a simple and effective QNN graph cache mechanism already implemented on 04/2024
use a simple STL to manage QNN resources in this PR rather than complex C++ encapsulation because the highly-well designed QNN SDK already manage it's internal hardware and software resource very carefully
a simple skeleton in function ggmlqnn_compute_elementwise:offload GGML_OP_ADD & GGML_OP_MUL & GGML_OP_SUB & GGML_OP_DIV & GGML_OP_LOG & GGML_OP_SQRT to QNN backend. we can see this function is a very concise implementation rather than complex C++ encapsulation with hide many tech details.
a complex skeleton in function ggml_qnn_mulmat: offload GGML_OP_MUL_MAT(2d&3d mulmat) to QNN backend, this skeleton can be used to illustrate the second technical approach of "how to utilize the Hexagon NPU maximally". we can see this function is a very concise implementation rather than complex C++ encapsulation with hide many tech details.
a more complex skeleton in function ggml_qnn_mulmat_4d: offload 4d mulmat to QNN backend, this skeleton can be used to illustrate the second technical approach of "how to utilize the Hexagon NPU maximally". we can see this function is a concise implementation rather than complex C++ encapsulation with hide many tech details.(UT passed but some unknown bugs with test-backend-ops).
QNN NPU RPC feature already implemented on 04/2024.
dynamic running parameter adjustment through ggml-qnn.cfg(this idea comes from @ngxson in his draft AI-dedicated PR and more parameters can be added in this configuration file).
offload quantized data type with 2d&3d mulmat to QNN backend.
provide big picture of ggm-hexagon backend in this PR for further or other relative dev activity in this great pure-tech community.
provide a very fast approach which exactly similar to Intel's ggml-sycl or Qualcomm's ggml-opencl or Huawei's ggml-cann: offload ggml op to Hexagon cDSP directly.
code is simple and everyone can understand code easily and quickly, without complex encapsulation and hide tech details, because layered abstraction and loose coupling will bring difficult with code tracking and troubleshooting.

special clarification in this section:

all original tech comes from Qualcomm, Qualcomm provide the fundamental mechanism and we programmer use it regardless of coding style or tech approach.

Hexagon NPU Performance

test phone is a Snapdragon 8 Gen3 Android phone, test model is qwen1_5-1_8b-chat-q4_0.gguf.

how to verify or validate NPU performance between general approach through QNN and general approach through cDSP

  git clone https://github.com/kantv-ai/ggml-hexagon
  cd ggml-hexagon
  git checkout pr_to_upstream
  ./scripts/build-run-android.sh build

verify mulmat on cDSP(modify enable_mulmat_cdsp to 1 in scripts/ggml-qnn.cfg and then run):

./scripts/build-run-android.sh run_testop MUL_MAT 2

verify mulmat on QNN-NPU(modify hwaccel_approach to 0 --- hwaccel approach through QNN--- in scripts/ggml-qnn.cfg and then run)

./scripts/build-run-android.sh run_testop MUL_MAT 2

we can/will clearly see(from adb logcat | grep ggml-qnn) that the performance of mulmat with cDSP is much much faster than mulmat with QNN-NPU for small matrix.

how to verify or validate NPU performance in real LLM inference between general approach through QNN and general approach through cDSP

TBD

Hexagon NPU performance with qwen1_5-1_8b-chat-q4_0.gguf on Snapdragon 8 Gen3

03/11/2025:prompt eval 3.10 tokens / second, eval performance 11.47 tokens / second
03/12/2025:prompt eval 3.47 tokens / second, eval performance 22.03 tokens / second
03/13/2025:prompt eval 4.34 tokens / second, eval performance 23.72 tokens / second
03/24/2025:prompt eval 40-70+ tokens / second, eval performance 10+ tokens / second (only offload GGML_OP_ADD to cDSP during LLM inference at the moment, not sure the real performance when mulmat can works fine on cDSP during LLM inference. so this impressive benchmark data has not much value at the moment.)

How to build ggml‐hexagon source code for Android and verify ggml-hexagon backend on Snapdragon based phone

Ubuntu 20.04,22.04 is validated and recommended as host machine(other Linux distributions might be also ok). the dev activity in this PR can be done in pure command line without any IDE:

utilize build-run-android.sh to download Android NDK and Qualcomm QNN SDK automatically(pls see below section)
you will need an Android smartphone with adb-connected running on one of below Qualcomm SoCs:

SM8450 (Snapdragon 8 Gen 1+)
SM8550 (Snapdragon 8 Gen 2)
SM8650 (Snapdragon 8 Gen 3)
SM8750-AB (Snapdragon 8 Elite) (aka Snapdragon 8 Gen 4)

  git clone https://github.com/kantv-ai/ggml-hexagon
  cd ggml-hexagon
  git checkout pr_to_upstream

 ./scripts/build-run-android.sh 
Usage:
  ./scripts/build-run-android.sh help
  ./scripts/build-run-android.sh print_oplist
  ./scripts/build-run-android.sh build
  ./scripts/build-run-android.sh updateqnnlib
  ./scripts/build-run-android.sh run_testops
  ./scripts/build-run-android.sh run_llamacli
  ./scripts/build-run-android.sh run_llamabench

we can find that this backend works fine as expected from the log output of "adb logcat | grep ggml-qnn". for programmers, we can use "adb logcat | grep ggml-hexagon" to help troubleshooting work.

How to build ggml‐hexagon source code for Snapdragon based WoA(Windows on ARM) device

the good news for WoA port is:

a Snapdragon 8gen2 or 8gen3 or 8 Elite(aka 8gen4) equipped Android phone can be seen or bought everywhere.
WoA port can be done in another standalone PR by a skilled Windows programmer because the highly-well designed Qualcomm QNN SDK and the source codes of ggml/llama.cpp are both highly portable.

Big picture of ggml-hexagon backend

there are three tech approaches to implement the ggml-hexagon backend for Qualcomm's Hexagon NPU:

general approach through Qualcomm QNN SDK:offload ggml op to QNN (then QNN's internal will transfer to Hexagon cDSP)
general approach through Qualcomm Hexagon SDK:offload ggml op to Hexagon cDSP directly, which exactly similar to Qualcomm's ggml-opencl or Intel's ggml-sycl, or the story here is similar to what DeepSeek did in their LLM model(use lowlevel NVIDIA API rather than the general highlevel NVIDIA CUDA API).
special approach through Qualcomm QNN SDK:mapping the entire ggml cgraph to a single QNN graph. the technical approach of "mapping the entire ggml computational graph to a single QNN graph" already discovered on 04/02024. the tech details of "the special approach through QNN" or "how to compose an ideal single QNN graph from a ggml cgraph" can be found at my forked llama.cpp project:how to compose a single QNN graph from a complete ggml cgraph kantv-ai/ggml-hexagon#24.

the general approach through QNN SDK or Hexagon SDK can be seen in this PR. the special approach through QNN will be seen in another standalone PR, because:

it's heavily not mature. the general approach through QNN or cDSP directly is a common approach in the all existing backends and it works fine in the all existing backends.
it's implementation contains about 1000+ LoC and it contains a navie graph algorithm, I think it will brings some unexpected troubles to this formal PR(there are already about 5000+ LoC in the existing PR). accordingly, it will be seen in another standalone PR after this PR can be approved.
we should/must implement all ggml ops through QNN API in this approach, this workload is very very big.
it's not the key-point at the moment(my personal opinion): the general approach through QNN is mature but the NPU performance is really bad; the general approach through cDSP is not enough mature(can't pass the test-backend-ops and lack of optimization with Hexagon SIMD instructions) at the moment but the NPU performance is really good and much faster then QNN solution.

[updated on 03/19/2025] the technical approach of "mapping the entire ggml computational graph to QNN graph" will be seen in another standalone PR: provide a concise( without complex/complicated encapsulation and hide tech details, for example, 4d mulmat) implementation of the technical approach "mapping the entire ggml cgraph to a single QNN graph" .

[updated on 03/20/2025]: I deeply thought many hours after a senior staff technical expert from Qualcomm told me on 03/18/2025 that "QNN is not the right solution here" very valuablely, today I think I know there is another tech approach of "utilize the Hexagon NPU maximally". I'll try to implement the third tech approach base on this PR(in other words, most of codes in this PR will be re-used in the third tech approach, AND the efforts on the first tech approach or the second tech approach is also meaningful because these are all necessary exploring steps before completing the final mission) if my guess can be confirmed by the senior staff technical expert at Qualcomm: I think I know how do that so-called third approach and I think I completely understand why there is so much performance difference between ggml-hexagon and Intel's ggml-sycl or Huawei's ggml-cann at the moment if my guess can be confirmed.

[updated on 03/22/2025]: the general approach through Hexagon cDSP which exactly similar to Qualcomm's ggml-opencl or Intel's ggml-sycl can be seen in this PR.

[updated on 03/23/2025]: I'm not AI expert so I'd like to port a tiny customized ggml-dsp to Hexagon cDSP and then optimize this tiny ggml-dsp with Hexagon SIMD instructions.
[updated on 03/25/2025]: all codes can be seen in this PR and I try my best to refine the codes to make it more clear and more concise and bug-free. code review from AI experts and domain experts is greatly welcomed and appreciated, hope this PR can be seen in the master branch of llama.cpp so other domain tech experts and AI experts can have a chance to participate in dev activities of improve the hexagon kernel (which similar to opencl kernel, cuda kernel, metal kernel......)

key-points about ggml-hexagon's performance in the general approach:

load/performance loss of data transfer between AP(arm cpu) and NPU(dsp). that's the performance loss caused by transferring data between the main CPU and the NPU. this part requires redesigning the data structure in the ggml-hexagon implementation(in other words, shared buffer or memory pool should-be used in the implementation of ggml-hexagon), placing all tensor data entirely in the DSP's device memory to minimize data copying or ideally achieve zero-copy.
relative tricks with Qualcomm's QNN SDK in the tech approach through QNN, this is a highly-well designed SDK at the same time I personally think it's usage is really not easy.
relative tricks with Qualcomm's Hexagon cDSP in the tech approach through Hexagon cDSP, this is a straight way and exactly similar to Intel's ggml-sycl.
we need to write some "hexagon kernels" in the general approach through Hexagon cDSP, which is similar to Qualcomm's ggml-opencl(opencl kernel) or other backend(cuda kernel), or which is exactly similar to TEE TA & CA: we can clearly see here hexagon kernels(libggmlop-skel.so) is similar to a specified TEE TA on TEE OS and ggml-hexagon is similar to a specified TEE CA on AP(arm-cpu), the difference is that TEE GP API is an international sw standard and here is Qualcomm's dedicated sw stack.
some ops that are generally critical to inference performance in ggml-hexagon, AI experts must be involved in the rest parts of ggml-hexagon, we only need to implement some performance-sensitive ggml ops in the general approach(through Hexagon cDSP).

key-points about ggml-hexagon's peformance in the special approach:

Qualcomm provides some binary dedicated tools to do LLM model conversion which is exactly hard work in this approach: compose an ideal QNN graph according to the complete ggml cgraph or mapping the complete ggml cgraph to a single QNN graph, this is the most important key-point in this approach.
we must implement ALL ggml ops in this approach and the general approach through QNN is the essential foundation of this approach.

Acknowledgement

the implementation through QNN is mainly porting/reverse engineering from executorch(the implementation through QNN in executorch comes from Qualcomm). all the original techs of this topic comes from Qualcomm.
I got breakthrough help from chiwwang@Qualcomm Technologies Inc on 04/2024.
I also got a meaningful help from XiaoMi-StableDiffusionOnDevice on 05/2024.
thanks for that I borrowed 7-8 functions from another QNN implementation which comes from a CN programmer chraac 's team. one of these functions is very helpful although it can be also directly used in the original PR(04/2024 or 05/2024 or before I disappeared in this great tech community last year) accordingly rather than complex encapsulation of QNN(because QNN SDK is already binary).
thanks for this post:Bug: MinGW build fails to load models with "error loading model: PrefetchVirtualMemory unavailable" #9311; the original tech of this customized/dedicated toolchain llvm-mingw-20250305-ggml-ucrt-x86_64.zip comes from https://github.com/mstorsjo/llvm-mingw/releases; the git.exe and cmake.exe and ninja.exe comes from MS's VS2022. the purpose of this customized toolchain is to make workflow easy for Linux programmer.
thanks for that I got a kind help during my difficult effort on WoA(Windows on ARM) build from @ejrydhfs and a senior&staff tech leader&engineer @AndreasKunar, at the same time, I also got a meaningful help from https://github.com/Windows-on-ARM-Experiments/mingw-woarm64-build which comes from some excellent MS's compiler&toolchain engineers.
thanks for the standout/excellent/great maintainers&original authors of llama.cpp,I learnt so much from their codes.
thanks for @zhuipiaochen because your unintentional/casual & kind help helped me to complete the final puzzle and now I has a more clear understanding of implementation through QNN for ggml-hexagon.
thanks so much for a senior staff technical expert @max-krasnyansky from Qualcomm HQ whom give an important/valuable/breakthrough guidance on direction on 03/18/2025:QNN is not the right solution here.
recently I tried AI-assisted programming for ggml-hexagon with the help from the powerful Grok 3,it really helped me a lot in this PR.

Conclusion

after spent too much efforts on ggml-hexagon backend, I personally think:

AI experts must be involved in the rest parts of ggml-hexagon regardless of coding style or tech approach: AI experts and other developers around the world can engage ggml-hexagon and improve for the hexagon-kernels on cDSP. a fully open-source implementation of ggml-hexagon backend might be a team-work between experienced programmers and AI experts even the professional technical help from Qualcomm regardless of coding style or tech approach.
the general approach through Hexagon cDSP is really fast, it's an impressive approach which pointed out by a senior staff tech expert and tech leader at Qualcomm @max-krasnyansky. I couldn't reach this point or milestone without the breakthrough reminder/guidance/help from Max. AI experts and domain tech expert can do many things on Hexagon cDSP directly .
Intel's ggml-sycl might-be fully utilized for Qualcomm's Hexagon NPU if Qualcomm's engineering team can do some adaption work(this is really difficult(I need to familiar with Intel's sycl sw stack firstly before PoC) so efforts on this PR and approval of this PR is still meaningful). we can cleary see that programming on Hexagon cDSP is exactly similar to what the Intel's sycl want to do: Intel's sycl provide an uniform software stack/framework for heterogeneous multi-core computing in embedded system or desktop system. this is another key-point I got after I finished the PoC and initial version of general approach of "how to utilize the Hexagon NPU maximally in ggml/llama.cpp" through Hexagon cDSP.
I think some design tricks from FFmpeg or GStreamer might-be/already used in GGML's backend subsystem: there are more than 1 backend implementation for the same hardware accelerator.
this PR's style is exactly similr/same to the original ggml/llama.cpp: code is clear & concise and without complex/complicated encapsulation in ggml/llama.cpp although the core maintainers are both genius programmers and modern C++ masters.

@max-krasnyansky, sorry to bother you, I understand your time is valuable, could you take a look at this PR again?(the code in ggml-qnn.cpp is close to being stable and I try my best to make it more clear and more concise and bug-free, theoretically speaking, we only need to focus on hexagon kernels from now(03/25/2025) on)

zhouwg · 2025-03-11T09:37:29Z

why "mapping the entire ggml's computational graph to QNN graph"(the second technical approach in above post) is not practical in ggml-qnn backend

general approach of "utilize the Hexagon NPU maximally in QNN NPU backend" in Qualcomm's QNN Sample
https://github.com/kantv-ai/kantv/blob/01a49bb2d3dc963d920ce7b6827524c8e6c0a70a/core/ggml/qnnsample/QnnSampleMain.cpp#L434-L484
we can clearly see that there is a prebuilt binary module file(xxxxx.so) which generated by Qualcomm's dedicated tool(they called it as "qnn-pytorch-converter" or "qnn-tensorflow-converter" or "qnn-tflite-converter"), this binary module file can be converted from a complicated C++ source file which is also generated by Qualcomm's dedicated tool. an example of this very complicated C++ source file as following:
https://github.com/kantv-ai/kantv/blob/kantv-poc-with-qnn/core/ggml/jni/Inception_v3.cpp

the key-point function in this complicated C++ source file is:
https://github.com/kantv-ai/kantv/blob/kantv-poc-with-qnn/core/ggml/jni/Inception_v3.cpp#L20634

we can clearly see that an ideal or expected QNN graph which has only single QNN graph with many many graph nodes would be generated/composed in this function. then we can understand that the codes in QnnSampleMain.cpp is just a route work or skeleton codes. in this case, we clearly know the single QNN graph was generated by Qualcomm's dedicated tool.

approach of "utilize the Hexagon NPU maximally in QNN NPU backend" in Qualcomm's Genie(Generative AI Inference Extensions) software stack from Qualcomm's latest QNN SDK(2.32.0.250228, as of 03/11/2025)

https://docs.qualcomm.com/bundle/publicresource/topics/80-63442-100/introduction.html

we can clearly see that the core process of offload inference to NPU(HTP) backend is 90%-99% same to the general approach of "utilize the Hexagon NPU maximally in QNN NPU backend" in Qualcomm's QNN Sample after tracking all relative codes in QNN SDK. in this case, we clearly know the single QNN graph was generated by Qualcomm's dedicated tool.

approach of "utilize the Hexagon NPU maximally in QNN NPU backend" in XiaoMi StableDiffusionOnDevice

we can clearly see that a customized model which was trained and provided by XiaoMi's AI team and this customized binary model will be used in this open source project: they claimed they can got a 10x performance gain with NPU inference . at the same time, we can clearly see that the main logic of this open source project is 90% same to Qualcomm's QNN Sample after tracking codes carefully, but we still don't know how that single QNN graph was generated? what should we think at the moment???

approach of "utilize the Hexagon NPU maximally in QNN NPU backend" in PowerInfer

this open source project comes from a famous China top university and it can be considered a derived or highly-customized project of llama.cpp. one of the highlights in this derived project is that the R&D developers implemented a closed-source QNN backend. recently I found a highly related project on GitHub with help from an unknown(here means I don't know) programmer @zhuipiaochen. we can clearly see the approach of "utilize the Hexagon NPU maximally in QNN NPU backend" in this interesting project is 90% same to the approach in Qualcomm's Genie or 90% same to the approach in Qualcomm's QNN Sample after tracking codes carefully:

load model from a binary model file which probably generated by a dedicated tool
https://github.com/powerserve-project/PowerServe/blob/main/src/backend/qnn/qnn.cpp#L413
create a QNN context
https://github.com/powerserve-project/PowerServe/blob/main/src/backend/qnn/qnn.cpp#L448
fetch QNN graph info from this binary model file
https://github.com/powerserve-project/PowerServe/blob/main/src/backend/qnn/qnn.cpp#L467
fetch required infos of input tensors and output tensors
https://github.com/powerserve-project/PowerServe/blob/main/src/backend/qnn/qnn.cpp#L789
fetch QNN graph handle from the generated QNN context
https://github.com/powerserve-project/PowerServe/blob/main/src/backend/qnn/qnn.cpp#L802
exec inference on NPU
https://github.com/powerserve-project/PowerServe/blob/main/src/backend/qnn/qnn.cpp#L888

the last 3 steps are exactly similar to offload 2D/3D matrix mulipilication to QNN backend in this PR. the difference between these two scenarios is that there are only 2 QNN graph nodes in the QNN graph of 2D/3D mulmat on QNN backend. in this case, we still don't know how the single QNN graph was generated? what should we think at the moment?????

inference procedure in the existing implementation of llama.cpp

we can clearly see that the inference procedure in ggml-sycl is a typical skeleton of all existing ggml backends. accordingly, there is a similar code snippet in this PR(ggml-qnn backend):

ok, let me doing an interesting experiment with ggml-qnn backend in this PR:

uncomment line 3665 and line 3666 in function ggml_backend_qnn_graph_compute(ggml_backend_t backend, struct ggml_cgraph * cgraph)
modify configuration file <llama.cpp path>/scripts/ggml-qnn.cfg from

to
running the script ./scripts/build-run-android.sh run_llamacli 2 accordingly(the meaning of this command is to launch LLM inference on the QNN NPU backend)

what we can see from the logs of adb logcat?

we can clearly see that there is no such an entire or complete GGML graph in this function:

accordingly, the logic or inference procedure in this function is exactly same to the original approach or general approach in all ggml backends.this is the limitation of the existing implementation of inference procedure or inference architecture in llama.cpp.

conclusion:

there is NO second technical approach in ggml-qnn backend because of limitation of the existing implementation of llama.cpp.
tech approach in this PR is a general approach and step in all ggml backends regardless coding style.

[updated on 21:56, 03/12/2025], the conclusion here is incorrect because the analysis in case 5 is WRONG, the first tech approach in this PR is still meaningful(because all op functions can be used in the second tech approach after some minor adjustment) and the second tech approach should be finished in this PR or other similar PR, but the analysis in case 1/2/3/4 is completely correct and logic in this tech doc is correct:Qualcomm provides some binary dedicated tools to do LLM model conversion which is exactly hard work(compose an ideal QNN graph according to the complete ggml cgraph or mapping the complete ggml cgraph to a single QNN graph) in the second tech approach of ggml-qnn backend. the second tech approach can be also implemented in this PR but I think I can't completely finish it because of my limited AI knowledge(there are hundreds of cgraph nodes and there are about 50+ ops) and real AI experts must be involved in the rest parts of ggml-qnn. so, good lucky to other similar PR.

I made a wrong analysis in step 5 or misunderstanding in #12342 which already explained by slaren, the rootcause of these two-stupid mistakes is that I have very limited knowledge about real hard-core AI tech.

Dampfinchen · 2025-03-12T18:39:52Z

Nice job. NPU support is huge for this project. Do you think its also possible to make it work on Exynos 2200 and 2400 NPUs?

zhouwg · 2025-03-12T23:07:46Z

Nice job. NPU support is huge for this project. Do you think its also possible to make it work on Exynos 2200 and 2400 NPUs?

thanks for your kind comment.

Quacomm's Hexagon NPU support is really huge work for this project although now we clearly know the principle or know what, because Qualcomm provides some binary dedicated tools to do LLM model conversion in their dedicated AI sw stacks and some other closed-source implementation also use this similar approach exactly. so programmers must compose an ideal QNN graph according to the complete ggml cgraph manually in ggml-qnn backend if they use/chose the second tech approach in ggml-qnn backend("mapping the complete ggml cgraph to a single opcfg QNN graph"). there are 800+ cgraph nodes and 50+ ops in qwen1_5-1_8b-chat-q4_0.gguf, accordingly, "(Hexgon) NPU support is huge for this project", real AI experts must be involved in the rest parts of ggml-qnn.
I think I can make it(ggml-exynos or ggml-samsung) work on Exynos 2200 if I can get a necessary phone(I can try to buy it) and SDK&tech docs(this might-be not easy because of strict IPR policy in some big IT companys as my personal understanding at the moment) and follow the principle "make it run and then make it right and finally make it fast",this is one of my areas of expertise.

ggml/src/ggml-qnn/ggml-qnn.cpp

…ode more clearly

…lear

…mul_mat works fine on Hexagon cDSP at the first time

…ulmat performance on cDSP easily

…e clear

github-actions bot added build script ggml labels Mar 11, 2025

zhouwg force-pushed the pr_to_upstream branch from 9780963 to 1b37471 Compare March 11, 2025 06:45

github-actions bot added the testing label Mar 11, 2025

zhouwg force-pushed the pr_to_upstream branch 5 times, most recently from 3ef106e to db890cc Compare March 11, 2025 09:09

zhouwg force-pushed the pr_to_upstream branch 2 times, most recently from 2ceaaf5 to 3402e2c Compare March 11, 2025 14:22

zhouwg mentioned this pull request Mar 12, 2025

ggml: offload the entire cgraph to a specified backend #12342

Closed

zhouwg force-pushed the pr_to_upstream branch from 3402e2c to 134eb3c Compare March 12, 2025 14:48

This comment was marked as resolved.

Sign in to view

zhouwg force-pushed the pr_to_upstream branch 2 times, most recently from 0065122 to 1f702df Compare March 16, 2025 08:12

This comment was marked as resolved.

Sign in to view

zhouwg force-pushed the pr_to_upstream branch 2 times, most recently from 1e98561 to e4b0d8c Compare March 16, 2025 09:51

zhouwg mentioned this pull request Mar 17, 2025

Add Qualcomm mobile SoC native backend for GGML ggml-org/ggml#771

Closed

zhouwg force-pushed the pr_to_upstream branch 3 times, most recently from 967be44 to a26806a Compare March 18, 2025 03:34

zhouwg commented Mar 18, 2025

View reviewed changes

ggml/src/ggml-qnn/ggml-qnn.cpp Outdated Show resolved Hide resolved

This comment was marked as resolved.

Sign in to view

zhouwg force-pushed the pr_to_upstream branch from a26806a to b983565 Compare March 18, 2025 05:32

zhouwg added 26 commits March 26, 2025 22:18

ggml-qnn: remove no-needed comments

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.

GPG key ID: B5690EEEBB952194

Verified
Learn about vigilant mode

53a9c4f

ggml-qnn: Windows port --- step3

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.

GPG key ID: B5690EEEBB952194

Verified
Learn about vigilant mode

Loading
Loading status checks…

10022ab

ggml-qnn: remove un-needed function

Verified

This commit was signed with the committer’s verified signature.

stenzek Connor McLaughlin

SSH Key Fingerprint: xXDedadPmBGHkM3TfvrcBYZE1Z+CrsSSsC7TcN8ijsk
Verified
Learn about vigilant mode

24eec8c

ggml-qnn:rebase to upstream

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.

GPG key ID: B5690EEEBB952194

Verified
Learn about vigilant mode

f341e32

ggml-qnn: fix a minior issue during rebase to upstream

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.

GPG key ID: B5690EEEBB952194

Verified
Learn about vigilant mode

2e3c824

ggml-qnn: update script according to ggml-org#12155

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.

GPG key ID: B5690EEEBB952194

Verified
Learn about vigilant mode

c3e5d3c

ggml-qnn: fix a minior issue in ggmlqnn_create_general_tensor()

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.

GPG key ID: B5690EEEBB952194

Verified
Learn about vigilant mode

fc3b9a6

ggml-qnn: active member variable _device_id in class qnn_instance

626a9cb

ggml-qnn: refine ggml_qnn_general_node and ggml_qnn_mul_mat to make c…

8471577

…ode more clearly

ggml-qnn: Windows port --- step4

0af75f2

ggml-qnn: Windows port -- step5

3802663

ggml-qnn: WoA(Windows on ARM) -- step6

05a5854

ggml-qnn: rebase to upstream

5dbc27c

ggml-qnn: pr to upstream

79fb362

ggml-qnn: rebase to upstream

bc0342f

ggml-qnn: self code-review

34bb250

ggml-qnn: rebase upstream

f79812b

ggml-qnn: add approach through Hexagon cDSP

192d474

ggml-qnn: refine general approach through Hexagon cDSP

1489a4e

ggml-qnn: refine the entire ggml-qnn.cpp to make code more clear

6612cbc

ggml-qnn: refine the entire ggml-qnn.cpp to make code more clear

f81a739

ggml-qnn: add build script for libggmlop_skel.so

63f43c6

ggml-qnn: remove redundant functions in this PR and make codes more c…

d590374

…lear

ggml-qnn: original ggml_compute_forward_add and ggml_compute_forward_…

0703cbc

…mul_mat works fine on Hexagon cDSP at the first time

ggml-qnn: modify build-run-android.sh to verify mulmat and validate m…

9ab1ea5

…ulmat performance on cDSP easily

ggml-qnn: make host code(ggml-qnn.cpp) more clear and more stable

ae6402e

zhouwg force-pushed the pr_to_upstream branch 2 times, most recently from e0d751e to 979c379 Compare March 26, 2025 14:47

ggml-qnn: refine code according to self code-review and make code mor…

5134179

…e clear

zhouwg force-pushed the pr_to_upstream branch from 979c379 to 5134179 Compare March 26, 2025 14:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PR: Refine ggml-hexagon backend(Qualcomm Hexagon NPU backend) for latest ggml,whisper.cpp,llama.cpp #12326

PR: Refine ggml-hexagon backend(Qualcomm Hexagon NPU backend) for latest ggml,whisper.cpp,llama.cpp #12326

zhouwg commented Mar 11, 2025 •

edited

Loading

zhouwg commented Mar 11, 2025 •

edited

Loading

Dampfinchen commented Mar 12, 2025

zhouwg commented Mar 12, 2025 •

edited

Loading

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

PR: Refine ggml-hexagon backend(Qualcomm Hexagon NPU backend) for latest ggml,whisper.cpp,llama.cpp #12326

Are you sure you want to change the base?

PR: Refine ggml-hexagon backend(Qualcomm Hexagon NPU backend) for latest ggml,whisper.cpp,llama.cpp #12326

Conversation

zhouwg commented Mar 11, 2025 • edited Loading

PR Description

Features

Hexagon NPU Performance

Hexagon NPU performance with qwen1_5-1_8b-chat-q4_0.gguf on Snapdragon 8 Gen3

How to build ggml‐hexagon source code for Android and verify ggml-hexagon backend on Snapdragon based phone

How to build ggml‐hexagon source code for Snapdragon based WoA(Windows on ARM) device

Big picture of ggml-hexagon backend

Acknowledgement

Conclusion

zhouwg commented Mar 11, 2025 • edited Loading

Dampfinchen commented Mar 12, 2025

zhouwg commented Mar 12, 2025 • edited Loading

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

zhouwg commented Mar 11, 2025 •

edited

Loading

zhouwg commented Mar 11, 2025 •

edited

Loading

zhouwg commented Mar 12, 2025 •

edited

Loading