Hyperloglog ARM NEON SIMD optimization #1859

xbasel · 2025-03-18T11:50:31Z

Add ARM NEON optimization for HyperLogLog

Implement two NEON optmized functions for converting between raw and
dense representations in HyperLogLog:
1. hllMergeDenseNEON
2. hllDenseCompressNEON
  These functions process 16 registers in each iteration.
Utilize existing SIMD test in hyperloglog.tcl (previously added for
AVX2 optimization) to validate NEON implementation

Test:
valkey-benchmark -n 1000000 --dbnum 9 -p 21111 PFMERGE z hll1{t} hll2{t}

+-------------------+-----------+-----------+---------------+
|      Metric       |  Before   |   After   | Improvement % |
+-------------------+-----------+-----------+---------------+
| Throughput (k rps)|    7.42   |   76.98   |    937.47%    |
+-------------------+-----------+-----------+---------------+
| Latency (msec)    |           |           |               |
|   avg             |   6.686   |   0.595   |     91.10%    |
|   min             |   0.520   |   0.152   |     70.77%    |
|   p50             |   7.799   |   0.599   |     92.32%    |
|   p95             |   8.039   |   0.767   |     90.46%    |
|   p99             |   8.111   |   0.807   |     90.05%    |
|   max             |   9.263   |   1.463   |     84.21%    |
+-------------------+-----------+-----------+---------------+

Hardware:

CPU: Graviton 3
Architecture:           aarch64
  CPU op-mode(s):       32-bit, 64-bit
  Byte Order:           Little Endian
CPU(s):                 64
  On-line CPU(s) list:  0-63
NUMA:
  NUMA node(s):         1
  NUMA node0 CPU(s):    0-63
Memory: 256 GB

Command stats:
Before:

cmdstat_pfmerge:calls=1000002,usec=126327984,**usec_per_call=126.33**,rejected_calls=0,failed_calls=0

After:

cmdstat_pfmerge:calls=1000002,usec=8588205,**usec_per_call=8.59**,rejected_calls=0,failed_calls=0

Improved by ~14.7x.

Functional testing command:

./runtest --single unit/hyperloglog --only "PFMERGE results with simd"  --loops 10000  --fastfail

The SIMD test randomizes input and comapres scalar vs simd results.

codecov · 2025-03-18T12:06:25Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 71.03%. Comparing base (aa88453) to head (4c45315).

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable    #1859      +/-   ##
============================================
- Coverage     71.09%   71.03%   -0.07%     
============================================
  Files           123      123              
  Lines         65671    65671              
============================================
- Hits          46692    46649      -43     
- Misses        18979    19022      +43

Files with missing lines	Coverage Δ
src/hyperloglog.c	`92.23% <100.00%> (ø)`

... and 13 files with indirect coverage changes

🚀 New features to boost your workflow:

❄ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

- Implement two NEON optmized functions for converting between raw and dense representations in HyperLogLog: 1. hllMergeDenseNEON 2. hllDenseCompressNEON These functions process 16 registers in each iteration. - Utilize existing SIMD test in hyperloglog.tcl (previously added for AVX2 optimization) to validate NEON implementation Test: valkey-benchmark -n 1000000 --dbnum 9 -p 21111 PFMERGE z hll1{t} hll2{t} +-------------------+-----------+-----------+---------------+ | Metric | Before | After | Improvement % | +-------------------+-----------+-----------+---------------+ | Throughput (k rps)| 7.42 | 76.98 | 937.47% | +-------------------+-----------+-----------+---------------+ | Latency (msec) | | | | | avg | 6.686 | 0.595 | 91.10% | | min | 0.520 | 0.152 | 70.77% | | p50 | 7.799 | 0.599 | 92.32% | | p95 | 8.039 | 0.767 | 90.46% | | p99 | 8.111 | 0.807 | 90.05% | | max | 9.263 | 1.463 | 84.21% | +-------------------+-----------+-----------+---------------+ Hardware: CPU: Graviton 3 Architecture: aarch64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 64 On-line CPU(s) list: 0-63 NUMA: NUMA node(s): 1 NUMA node0 CPU(s): 0-63 Memory: 256 GB Signed-off-by: xbasel <[email protected]>

zuiderkwast

10 times faster is a pretty good improvement. :)

I didn't read the NEON code carefully because I'm not familiar with it. Is the logic basically the same as the one for AVX2?

zuiderkwast · 2025-03-21T15:02:14Z

src/hyperloglog.c

 #ifdef HAVE_AVX2
            simd_enabled = 1;
+#endif
+#ifdef __ARM_NEON
+            simd_enabled = 1;
 #endif


#if defined(HAVE_AVX) || defined(__ARM_NEON)

or we can consider some macro like HLL_HAVE_SIMD to avoid this repetition, also to avoid defining the variable twice.

xbasel · 2025-03-21T15:14:12Z

10 times faster is a pretty good improvement. :)

I didn't read the NEON code carefully because I'm not familiar with it. Is the logic basically the same as the one for AVX2?

It is similar. NEON vectors are 128 bit, AVX2 is 256 bit. The padding and lookup is a bit different in AVX2.
The execution time of pfmerge ~14.7x faster. The end to end is ~10x faster.

xbasel · 2025-03-22T15:34:21Z

src/hyperloglog.c

+         * The last 4 bytes are ignored because (1) they do not form a complete number of registers, and do not fit
+         * in the 16 bytes. The unprocessed 4 bytes are processed in the next iteration.
+         */
+        uint8x16_t r = vld1q_u8(dense_ptr);


May crash on ARMv7-A if the address isn't 16-byte aligned.

TODO

xbasel marked this pull request as draft March 18, 2025 11:50

xbasel mentioned this pull request Mar 18, 2025

[NEW] Implement ARM NEON and SVE2 optimizations for Hyperloglog #1860

Open

xbasel force-pushed the hll_neon branch 5 times, most recently from d5cc649 to b2c857e Compare March 18, 2025 16:41

xbasel self-assigned this Mar 18, 2025

xbasel force-pushed the hll_neon branch from b2c857e to 4c45315 Compare March 18, 2025 16:56

xbasel marked this pull request as ready for review March 18, 2025 16:56

zuiderkwast reviewed Mar 21, 2025

View reviewed changes

xbasel commented Mar 22, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hyperloglog ARM NEON SIMD optimization #1859

Hyperloglog ARM NEON SIMD optimization #1859

xbasel commented Mar 18, 2025 •

edited

Loading

codecov bot commented Mar 18, 2025 •

edited

Loading

zuiderkwast left a comment

zuiderkwast Mar 21, 2025

xbasel commented Mar 21, 2025

xbasel Mar 22, 2025

Hyperloglog ARM NEON SIMD optimization #1859

Are you sure you want to change the base?

Hyperloglog ARM NEON SIMD optimization #1859

Conversation

xbasel commented Mar 18, 2025 • edited Loading

codecov bot commented Mar 18, 2025 • edited Loading

Codecov Report

zuiderkwast left a comment

Choose a reason for hiding this comment

zuiderkwast Mar 21, 2025

Choose a reason for hiding this comment

xbasel commented Mar 21, 2025

xbasel Mar 22, 2025

Choose a reason for hiding this comment

xbasel commented Mar 18, 2025 •

edited

Loading

codecov bot commented Mar 18, 2025 •

edited

Loading