10x performance improvements #178

HoneyryderChuck · 2025-03-14T23:40:31Z

This is a long batch of changes which I've been slowly patching and testing locally towards improving the performance of the "happy path" of a burst of multiple simultaneous streams on a single connection.

The first commit includes the baseline benchmark (using the singed gem to generate flamegraphs for analysis).

The full changeset will be hard to review. The suggestion is to review each commit separately, as they're all self-contained, and the commit message contains context about the change.

These are the measurements using the benchmark library for 5000 streams over a connection:

BENCH=benchmark bundle exec ruby benchmarks/client_server.rb
# before
6.682099  15.267209  21.949308 ( 22.052998)
# after
0.984433   0.113971   1.098404 (  1.100675)

uses singed for profiling, memory_profiler to track object allocations, among other tools

…ions and static variables

fallback to << for older rubies

…bles

useful in a few places to reduce the number of needless array allocations

igrigorik · 2025-03-15T05:37:48Z

👏🏻

mullermp · 2025-03-15T16:57:18Z

I can do a review this week, but it would have been helpful to slice this up into chunks. Are all of the changes related to performance? I see rbs and other refactorings.

HoneyryderChuck · 2025-03-15T23:21:39Z

@mullermp I understand. I think it's more digestible if you go commit by commit, as each change is independent and context is in the commit message. I don't think it'd help much by making each commit its own MR at this point, but I concede that this may not fit everyone's preferred review workflow.

Are all of the changes related to performance?

I went back to the commit list to keep me honest. I counted only one commit which isn't specifically about performance. RBS changes are about method sig changes, and inconsistencies found as a result of the changes.

mullermp

Nice! I was able to sanity test this with the AWS SDK. You should also run a sanity test.

mullermp · 2025-03-15T15:28:28Z

benchmarks/client_server.rb

+
+log { "build client..." }
+CLIENT.on(:frame) do |bytes|
+  log { "(client) sending bytes: #{bytes.size}" }


You can use standard arrow semantics (-> or <-) for client/server responses if you want.

mullermp · 2025-03-15T15:30:47Z

lib/http/2.rb

@@ -1,6 +1,11 @@
 # frozen_string_literal: true

 require "http/2/version"
+
+module HTTP2
+  EMPTY = [].freeze


What is this for and why can't you check empty? in the right places?

FWIW this empty array was already there, I'm just reusing it in an additional place. This is an optimization to avoid creating useless empty arrays where it's not expected to have something else.

mullermp · 2025-03-21T14:13:30Z

lib/http/2/connection.rb

@@ -212,14 +212,17 @@ def receive(data)
      end

      while (frame = @framer.parse(@recv_buffer))
+        # @type var stream_id: Integer
+        stream_id, frame_type = frame.fetch_values(:stream, :type) {} # rubocop:disable Lint/EmptyBlock


I'd consider two fetches with nil default for readability. The empty block default is weird here.

This change was motivated by the assumption that fetching multiple values at once should be faster than fetching one at a time in ruby, due to less ruby/c-land context switching.

This is however not the case, as this benchmark shows:

require "benchmark/ips" FRAME = { type: :data, stream: 1, data: "bla" }.freeze Benchmark.ips do |x| # Configure the number of seconds used during # the warmup phase (default 2) and calculation phase (default 5) x.config(warmup: 2, time: 5) x.report("[]") do FRAME[:type] FRAME[:stream] FRAME[:data] end x.report("fetch_values") do FRAME.fetch_values(:type, :stream, :data) end # Compare the iterations per second of the various reports! x.compare! end # ruby 3.4.2 (2025-02-15 revision d2930f8e7a) +PRISM [x86_64-darwin23] # Warming up -------------------------------------- # [] 1.109M i/100ms # fetch_values 528.814k i/100ms # Calculating ------------------------------------- # [] 11.265M (± 2.8%) i/s (88.77 ns/i) - 56.538M in 5.023007s # fetch_values 5.466M (± 8.6%) i/s (182.96 ns/i) - 27.498M in 5.086584s # # Comparison: # []: 11264935.5 i/s # fetch_values: 5465822.2 i/s - 2.06x slower

I'll revert the commit where this was introduced. I'll try to get some feedback on why is my expectation wrong.

mullermp · 2025-03-21T14:15:36Z

lib/http/2/connection.rb

@@ -425,35 +432,33 @@ def <<(data)
    # @note all frames are currently delivered in FIFO order.
    # @param frame [Hash]
    def send(frame)
+      frame_type = frame[:type]


Assigning this is probably negligible in performance right ? - I see in other methods you are not doing this.

local variables access is more performant than hash lookup. the rule of thumb I applied was: if the lookup is performed more than once, cache into a variable, otherwise leave it as is.

require "benchmark/ips" FRAME = { type: :data, stream: 1, data: "bla" }.freeze Benchmark.ips do |x| # Configure the number of seconds used during # the warmup phase (default 2) and calculation phase (default 5) x.config(warmup: 2, time: 5) x.report("local") do type = FRAME[:type] type == :data type == :headers end x.report("[]") do FRAME[:type] == :data FRAME[:type] == :headers end # Compare the iterations per second of the various reports! x.compare! end # ruby 3.4.2 (2025-02-15 revision d2930f8e7a) +PRISM [x86_64-darwin23] # Warming up -------------------------------------- # local 1.343M i/100ms # [] 1.022M i/100ms # Calculating ------------------------------------- # local 13.278M (± 3.9%) i/s (75.31 ns/i) - 67.146M in 5.065645s # [] 10.315M (± 2.6%) i/s (96.95 ns/i) - 52.108M in 5.055306s # # Comparison: # local: 13278073.7 i/s # []: 10314810.6 i/s - 1.29x slower

mullermp · 2025-03-21T14:20:58Z

lib/http/2/connection.rb

+        # hasn't expired yet (it's ordered).
+        if closed_since && (now - closed_since) > 15
+          # discards all streams which have closed for a while.
+          # TODO: use a drop_while! variant whenever there is one.


None of the enumerable methods have bang methods so this will probably never happen. This is a hash right? Why not just use reject!?

enumerables don't, but Array and Hash do. I proposed this method, and even implemented it, but matz has rejected the proposal so far (although officially wanting feedback).

This is a hash right? Why not just use reject!?

It used to be that way. It was changed here with significant performance gain (hashes tend to grow with the number of concurrent streams, so you want to prevent constantly hitting O(n)), but at the cost of generating a hash constantly (GC pressure). This patch fixes the "hot" case where streams are created next to each other, so it's unlikely there'lll be anything expired to purge. To be fair, most of the performance gains from this MR come from this change.

mullermp · 2025-03-21T14:25:23Z

lib/http/2/framer.rb

+      if buffer.frozen?
+        header = String.new("", encoding: Encoding::BINARY, capacity: buffer.bytesize + 9) # header length
+
+        pack([


A lot of this code looks identical from below.

you're right. rewrote the logic, lmk what you think.

mullermp · 2025-03-21T14:28:41Z

lib/http/2/header/huffman.rb

@@ -9,9 +9,12 @@ module HTTP2
  # - http://tools.ietf.org/html/draft-ietf-httpbis-header-compression-10
  module Header
    # Huffman encoder/decoder
-    class Huffman
+    module Huffman
+      module_function


Instead of module_function, I think you should just define your methods with self. and make it static.

not sure if I understand the benefit. the module_function call is exactly to avoid needlessly calling self. the methods are static from then on.

mullermp · 2025-03-21T14:29:41Z

lib/http/2/header/huffman_statemachine.rb

@@ -8,7 +8,7 @@

 module HTTP2


Not necessarily related to this PR, but I recall the "generate huffman table" task and what not appeared to be broken or to generate data that differs from what is shipped. We really ought to figure that out. I tried last year and was stumped.

mullermp · 2025-03-21T14:31:44Z

lib/http/2/emitter.rb

-
-    private
-
-    def listeners(event)


What's wrong with defining listeners in the emitter? I think it makes a lot more sense.

this one is a bit conflicting, but the main reason is to benefit from object shapes optimization (hence the initialization of @listeners in the initialize method. This is unfortunately a case where a recent VM optimization gets in the way of a known community practice (mixins), and where due to the low code churn (this code hasn't changed in years), I preferred to privilege performance over consistency.

…llocations frame generation will reduce the number of allocations in most cases to 1 or 2 strings per frame (on modern rubies); it'll also try to (via capacity arg) to allocate the correct string size on initialization to reduce the number of internal string reallocs; initial frame hash mutation is also reduced to a minimum

just deal with enums when necessary, and do not create intermediate arrays for splatting

… earlier in the stack) the check for frame type is moved elsewhere, as it's pointless outside of the handshake phase

improved sigs also, and corrected the wrong name on a stream class method

… immediately after generation this reduces the number of allocated arrays per emitted frame, and particularly in the case of continuation frames

they were measuring the same thing. by doing so, it became the obvious value to use for the GOAWAY last stream id flag

the previous routine was taking a frame, splitting it into the partial remaining frame, and putting it on top of the stack; this bypasses the array shiff/unshift operation in that case, by building the partial frame to send out-of-buffer

…ollection didn't expire yet the collection is ordered, which is also a reason why #drop_while is used; unfortunately a #drop_while! won't happen anytime soon, so foregoing the constant hash generation has a measurable impact

…e? (usual mem vs speed tradeoff)

…ield from #encode

no way around it unfortunately, and this way an extra intermediate array is spared

method may be called multiple times, and instead of traversing the array every single time, better to update it when necessary and use the current value

this reduces the number of intermediate small strings when compressing integers, by utilizing the main headers buffer

no state management, only gc pressure. making it functional solves that

which were being joined into one at the end anyway; buffer to existing string instead.

this adds buffering support in lower-level APIs and huffman encoding module

…ow allows, or the frame is a final 0-length DATA frame

…hen-add this avoid internal hash recapacitation / realloc calls

this avoids a needless intermediate array creation for the indexed header case; it also avoids the needless duping of cmd, as it isn't used internally as such (only its values, which are allocated to local variables)

…ype of header command

…ate machine lookup

…the string in 1 go at the end

local variables are faster to access, more readable, less error prone, easier to type by LSPs

…rray

…tializer this is an object shape optimization; it does defeat the purpose of mixins somewhat, which is part of the tradeoff

HoneyryderChuck · 2025-03-22T01:06:49Z

@mullermp I did run a few sanity tests 👍

HoneyryderChuck added 7 commits March 14, 2025 23:36

benchmark aimed at measuring http-2 overhead

86cf728

uses singed for profiling, memory_profiler to track object allocations, among other tools

improving the encoding context to use more efficient enumerable funct…

1a4f879

…ions and static variables

using ruby 3.4 String#append_as_bytes wherever possible

7444155

fallback to << for older rubies

connection: cache frame properties used multiple times in local varia…

0defb72

…bles

remove redundant parentheses

d538ae7

using Hash#fetch_values and local var in framer

4cd751c

moved empty collection to main namespace

5483b2b

useful in a few places to reduce the number of needless array allocations

HoneyryderChuck requested review from igrigorik and mullermp March 14, 2025 23:40

HoneyryderChuck force-pushed the perf-improvs branch 3 times, most recently from ee412a4 to 91da0b1 Compare March 15, 2025 00:24

mullermp approved these changes Mar 21, 2025

View reviewed changes

HoneyryderChuck added 14 commits March 22, 2025 00:55

remove needless array allocations during connection handshake

1ebc1c1

just deal with enums when necessary, and do not create intermediate arrays for splatting

remove needless check from #connection_settings (they're already done…

e46a14f

… earlier in the stack) the check for frame type is moved elsewhere, as it's pointless outside of the handshake phase

initialize @h2c_upgrade (for object shape opt)

a85a541

improved sigs also, and corrected the wrong name on a stream class method

rewrite of variable usage, avoid multiple Hash#[] calls on flow control

c6a59af

rewrite logic of #encode and #encode_headers, in order to emit frames…

a2c9829

… immediately after generation this reduces the number of allocated arrays per emitted frame, and particularly in the case of continuation frames

fixed inconsistency of @last_activated_stream and @last_stream_id

be3aa02

they were measuring the same thing. by doing so, it became the obvious value to use for the GOAWAY last stream id flag

another instance of replacing a multi-value compare with Array#includ…

b01f65d

…e? (usual mem vs speed tradeoff)

header compression: eliminate two intermediate arrays (at least) by y…

3c3284c

…ield from #encode

inlined #add_to_table logic

e076e05

no way around it unfortunately, and this way an extra intermediate array is spared

memoizing current table size in encoding context

9dfe8ca

method may be called multiple times, and instead of traversing the array every single time, better to update it when necessary and use the current value

using the offset parameter in the #header function

ca9128f

this reduces the number of intermediate small strings when compressing integers, by utilizing the main headers buffer

HoneyryderChuck added 14 commits March 22, 2025 00:55

turned huffman class into a module

f088109

no state management, only gc pressure. making it functional solves that

Huffman.encode: reduce the number of intermediate strings

12c53f3

which were being joined into one at the end anyway; buffer to existing string instead.

supporting buffering to existing string when compressing strings

64ba963

this adds buffering support in lower-level APIs and huffman encoding module

when buffer is empty, push-then-pop can be avoided if the remote wind…

6fc64b4

…ow allows, or the frame is a final 0-length DATA frame

Decompressor#header: respond with correct size hash instead of init-t…

5552d9a

…hen-add this avoid internal hash recapacitation / realloc calls

EncodingContext#process: avoid intermediate array generation

19b8826

this avoids a needless intermediate array creation for the indexed header case; it also avoids the needless duping of cmd, as it isn't used internally as such (only its values, which are allocated to local variables)

improving sig for header_command to use exact record types for each t…

543e930

…ype of header command

avoid array accessors by spreading the two elements of the huffman st…

29d48c3

…ate machine lookup

avoid slow string 1 byte drip, instead iterate over bytes, and slice …

6fbd59a

…the string in 1 go at the end

use Hash#fetch_values to reduce hash lookup usage

9e0e1a8

local variables are faster to access, more readable, less error prone, easier to type by LSPs

Use #filter_map instead of custom #each_with_object injecting to an a…

c041789

…rray

initializing ivars from mixins (previously memoized) in the class ini…

2e63547

…tializer this is an object shape optimization; it does defeat the purpose of mixins somewhat, which is part of the tradeoff

use Comparable#betwen? for integer range

cbb2583

test with ruby 3.4

7352e02

HoneyryderChuck force-pushed the perf-improvs branch from 91da0b1 to 7352e02 Compare March 22, 2025 00:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

10x performance improvements #178

10x performance improvements #178

HoneyryderChuck commented Mar 14, 2025

igrigorik commented Mar 15, 2025

mullermp commented Mar 15, 2025

HoneyryderChuck commented Mar 15, 2025

mullermp left a comment

mullermp Mar 15, 2025

mullermp Mar 15, 2025

HoneyryderChuck Mar 22, 2025

mullermp Mar 21, 2025

HoneyryderChuck Mar 22, 2025

mullermp Mar 21, 2025

HoneyryderChuck Mar 22, 2025

mullermp Mar 21, 2025

HoneyryderChuck Mar 22, 2025

mullermp Mar 21, 2025

HoneyryderChuck Mar 22, 2025

mullermp Mar 21, 2025

HoneyryderChuck Mar 22, 2025

mullermp Mar 21, 2025

mullermp Mar 21, 2025

HoneyryderChuck Mar 22, 2025

HoneyryderChuck commented Mar 22, 2025

10x performance improvements #178

Are you sure you want to change the base?

10x performance improvements #178

Conversation

HoneyryderChuck commented Mar 14, 2025

igrigorik commented Mar 15, 2025

mullermp commented Mar 15, 2025

HoneyryderChuck commented Mar 15, 2025

mullermp left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HoneyryderChuck commented Mar 22, 2025