-
Notifications
You must be signed in to change notification settings - Fork 755
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hyperloglog ARM NEON SIMD optimization #1859
base: unstable
Are you sure you want to change the base?
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## unstable #1859 +/- ##
============================================
- Coverage 71.09% 71.03% -0.07%
============================================
Files 123 123
Lines 65671 65671
============================================
- Hits 46692 46649 -43
- Misses 18979 19022 +43
🚀 New features to boost your workflow:
|
d5cc649
to
b2c857e
Compare
- Implement two NEON optmized functions for converting between raw and dense representations in HyperLogLog: 1. hllMergeDenseNEON 2. hllDenseCompressNEON These functions process 16 registers in each iteration. - Utilize existing SIMD test in hyperloglog.tcl (previously added for AVX2 optimization) to validate NEON implementation Test: valkey-benchmark -n 1000000 --dbnum 9 -p 21111 PFMERGE z hll1{t} hll2{t} +-------------------+-----------+-----------+---------------+ | Metric | Before | After | Improvement % | +-------------------+-----------+-----------+---------------+ | Throughput (k rps)| 7.42 | 76.98 | 937.47% | +-------------------+-----------+-----------+---------------+ | Latency (msec) | | | | | avg | 6.686 | 0.595 | 91.10% | | min | 0.520 | 0.152 | 70.77% | | p50 | 7.799 | 0.599 | 92.32% | | p95 | 8.039 | 0.767 | 90.46% | | p99 | 8.111 | 0.807 | 90.05% | | max | 9.263 | 1.463 | 84.21% | +-------------------+-----------+-----------+---------------+ Hardware: CPU: Graviton 3 Architecture: aarch64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 64 On-line CPU(s) list: 0-63 NUMA: NUMA node(s): 1 NUMA node0 CPU(s): 0-63 Memory: 256 GB Signed-off-by: xbasel <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
10 times faster is a pretty good improvement. :)
I didn't read the NEON code carefully because I'm not familiar with it. Is the logic basically the same as the one for AVX2?
#ifdef HAVE_AVX2 | ||
simd_enabled = 1; | ||
#endif | ||
#ifdef __ARM_NEON | ||
simd_enabled = 1; | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#if defined(HAVE_AVX) || defined(__ARM_NEON)
or we can consider some macro like HLL_HAVE_SIMD
to avoid this repetition, also to avoid defining the variable twice.
It is similar. NEON vectors are 128 bit, AVX2 is 256 bit. The padding and lookup is a bit different in AVX2. |
* The last 4 bytes are ignored because (1) they do not form a complete number of registers, and do not fit | ||
* in the 16 bytes. The unprocessed 4 bytes are processed in the next iteration. | ||
*/ | ||
uint8x16_t r = vld1q_u8(dense_ptr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May crash on ARMv7-A if the address isn't 16-byte aligned.
TODO
Add ARM NEON optimization for HyperLogLog
Implement two NEON optmized functions for converting between raw and
dense representations in HyperLogLog:
These functions process 16 registers in each iteration.
Utilize existing SIMD test in hyperloglog.tcl (previously added for
AVX2 optimization) to validate NEON implementation
Test:
valkey-benchmark -n 1000000 --dbnum 9 -p 21111 PFMERGE z hll1{t} hll2{t}
Hardware:
Command stats:
Before:
After:
Improved by ~14.7x.
Functional testing command:
The SIMD test randomizes input and comapres scalar vs simd results.