Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

10x performance improvements #178

Open
wants to merge 35 commits into
base: main
Choose a base branch
from
Open

10x performance improvements #178

wants to merge 35 commits into from

Conversation

HoneyryderChuck
Copy link
Collaborator

This is a long batch of changes which I've been slowly patching and testing locally towards improving the performance of the "happy path" of a burst of multiple simultaneous streams on a single connection.

The first commit includes the baseline benchmark (using the singed gem to generate flamegraphs for analysis).

The full changeset will be hard to review. The suggestion is to review each commit separately, as they're all self-contained, and the commit message contains context about the change.

These are the measurements using the benchmark library for 5000 streams over a connection:

BENCH=benchmark bundle exec ruby benchmarks/client_server.rb
# before
6.682099  15.267209  21.949308 ( 22.052998)
# after
0.984433   0.113971   1.098404 (  1.100675)

@HoneyryderChuck HoneyryderChuck force-pushed the perf-improvs branch 3 times, most recently from ee412a4 to 91da0b1 Compare March 15, 2025 00:24
@igrigorik
Copy link
Owner

👏🏻

@mullermp
Copy link
Collaborator

I can do a review this week, but it would have been helpful to slice this up into chunks. Are all of the changes related to performance? I see rbs and other refactorings.

@HoneyryderChuck
Copy link
Collaborator Author

@mullermp I understand. I think it's more digestible if you go commit by commit, as each change is independent and context is in the commit message. I don't think it'd help much by making each commit its own MR at this point, but I concede that this may not fit everyone's preferred review workflow.

Are all of the changes related to performance?

I went back to the commit list to keep me honest. I counted only one commit which isn't specifically about performance. RBS changes are about method sig changes, and inconsistencies found as a result of the changes.

Copy link
Collaborator

@mullermp mullermp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! I was able to sanity test this with the AWS SDK. You should also run a sanity test.


log { "build client..." }
CLIENT.on(:frame) do |bytes|
log { "(client) sending bytes: #{bytes.size}" }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use standard arrow semantics (-> or <-) for client/server responses if you want.

@@ -1,6 +1,11 @@
# frozen_string_literal: true

require "http/2/version"

module HTTP2
EMPTY = [].freeze
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this for and why can't you check empty? in the right places?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW this empty array was already there, I'm just reusing it in an additional place. This is an optimization to avoid creating useless empty arrays where it's not expected to have something else.

@@ -212,14 +212,17 @@ def receive(data)
end

while (frame = @framer.parse(@recv_buffer))
# @type var stream_id: Integer
stream_id, frame_type = frame.fetch_values(:stream, :type) {} # rubocop:disable Lint/EmptyBlock
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd consider two fetches with nil default for readability. The empty block default is weird here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change was motivated by the assumption that fetching multiple values at once should be faster than fetching one at a time in ruby, due to less ruby/c-land context switching.

This is however not the case, as this benchmark shows:

require "benchmark/ips"

FRAME = { type: :data, stream: 1, data: "bla" }.freeze

Benchmark.ips do |x|
  # Configure the number of seconds used during
  # the warmup phase (default 2) and calculation phase (default 5)
  x.config(warmup: 2, time: 5)

  x.report("[]") do
    FRAME[:type]
    FRAME[:stream]
    FRAME[:data]
  end

  x.report("fetch_values") do
    FRAME.fetch_values(:type, :stream, :data)
  end

  # Compare the iterations per second of the various reports!
  x.compare!
end

# ruby 3.4.2 (2025-02-15 revision d2930f8e7a) +PRISM [x86_64-darwin23]
# Warming up --------------------------------------
#                  []     1.109M i/100ms
#         fetch_values   528.814k i/100ms
# Calculating -------------------------------------
#                   []     11.265M (± 2.8%) i/s   (88.77 ns/i) -     56.538M in   5.023007s
#         fetch_values      5.466M (± 8.6%) i/s  (182.96 ns/i) -     27.498M in   5.086584s
# 
# Comparison:
#                   []: 11264935.5 i/s
#         fetch_values:  5465822.2 i/s - 2.06x  slower

I'll revert the commit where this was introduced. I'll try to get some feedback on why is my expectation wrong.

@@ -425,35 +432,33 @@ def <<(data)
# @note all frames are currently delivered in FIFO order.
# @param frame [Hash]
def send(frame)
frame_type = frame[:type]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assigning this is probably negligible in performance right ? - I see in other methods you are not doing this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

local variables access is more performant than hash lookup. the rule of thumb I applied was: if the lookup is performed more than once, cache into a variable, otherwise leave it as is.

require "benchmark/ips"

FRAME = { type: :data, stream: 1, data: "bla" }.freeze

Benchmark.ips do |x|
  # Configure the number of seconds used during
  # the warmup phase (default 2) and calculation phase (default 5)
  x.config(warmup: 2, time: 5)

  x.report("local") do
    type = FRAME[:type]
    type == :data
    type == :headers
  end

  x.report("[]") do
    FRAME[:type] == :data
    FRAME[:type] == :headers
  end

  # Compare the iterations per second of the various reports!
  x.compare!
end

# ruby 3.4.2 (2025-02-15 revision d2930f8e7a) +PRISM [x86_64-darwin23]
# Warming up --------------------------------------
#               local     1.343M i/100ms
#                   []     1.022M i/100ms
# Calculating -------------------------------------
#                local     13.278M (± 3.9%) i/s   (75.31 ns/i) -     67.146M in   5.065645s
#                   []     10.315M (± 2.6%) i/s   (96.95 ns/i) -     52.108M in   5.055306s
# 
# Comparison:
#                local: 13278073.7 i/s
#                   []: 10314810.6 i/s - 1.29x  slower

# hasn't expired yet (it's ordered).
if closed_since && (now - closed_since) > 15
# discards all streams which have closed for a while.
# TODO: use a drop_while! variant whenever there is one.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None of the enumerable methods have bang methods so this will probably never happen. This is a hash right? Why not just use reject!?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

enumerables don't, but Array and Hash do. I proposed this method, and even implemented it, but matz has rejected the proposal so far (although officially wanting feedback).

This is a hash right? Why not just use reject!?

It used to be that way. It was changed here with significant performance gain (hashes tend to grow with the number of concurrent streams, so you want to prevent constantly hitting O(n)), but at the cost of generating a hash constantly (GC pressure). This patch fixes the "hot" case where streams are created next to each other, so it's unlikely there'lll be anything expired to purge. To be fair, most of the performance gains from this MR come from this change.

if buffer.frozen?
header = String.new("", encoding: Encoding::BINARY, capacity: buffer.bytesize + 9) # header length

pack([
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of this code looks identical from below.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're right. rewrote the logic, lmk what you think.

@@ -9,9 +9,12 @@ module HTTP2
# - http://tools.ietf.org/html/draft-ietf-httpbis-header-compression-10
module Header
# Huffman encoder/decoder
class Huffman
module Huffman
module_function
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of module_function, I think you should just define your methods with self. and make it static.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if I understand the benefit. the module_function call is exactly to avoid needlessly calling self. the methods are static from then on.

@@ -8,7 +8,7 @@

module HTTP2
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessarily related to this PR, but I recall the "generate huffman table" task and what not appeared to be broken or to generate data that differs from what is shipped. We really ought to figure that out. I tried last year and was stumped.


private

def listeners(event)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's wrong with defining listeners in the emitter? I think it makes a lot more sense.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this one is a bit conflicting, but the main reason is to benefit from object shapes optimization (hence the initialization of @listeners in the initialize method. This is unfortunately a case where a recent VM optimization gets in the way of a known community practice (mixins), and where due to the low code churn (this code hasn't changed in years), I preferred to privilege performance over consistency.

…llocations

frame generation will reduce the number of allocations in most cases to 1 or 2 strings per frame (on modern rubies); it'll also try to (via capacity arg) to allocate the correct string size on initialization to reduce the number of internal string reallocs; initial frame hash mutation is also reduced to a minimum
just deal with enums when necessary, and do not create intermediate arrays for splatting
… earlier in the stack)

the check for frame type is moved elsewhere, as it's pointless outside of the handshake phase
improved sigs also, and corrected the wrong name on a stream class method
… immediately after generation

this reduces the number of allocated arrays per emitted frame, and particularly in the case of continuation frames
they were measuring the same thing. by doing so, it became the obvious value to use for the GOAWAY last stream id flag
the previous routine was taking a frame, splitting it into the partial remaining frame, and putting it on top of the stack; this bypasses the array shiff/unshift operation in that case, by building the partial frame to send out-of-buffer
…ollection didn't expire yet

the collection is ordered, which is also a reason why #drop_while is used; unfortunately a #drop_while! won't happen anytime soon, so foregoing the constant hash generation has a measurable impact
no way around it unfortunately, and this way an extra intermediate array is spared
method may be called multiple times, and instead of traversing the array every single time, better to update it when necessary and use the current value
this reduces the number of intermediate small strings when compressing integers, by utilizing the main headers buffer
no state management, only gc pressure. making it functional solves that
which were being joined into one at the end anyway; buffer to existing string instead.
this adds buffering support in lower-level APIs and huffman encoding module
…ow allows, or the frame is a final 0-length DATA frame
…hen-add

this avoid internal hash recapacitation / realloc calls
this avoids a needless intermediate array creation for the indexed header case; it also avoids the needless duping of cmd, as it isn't used internally as such (only its values, which are allocated to local variables)
local variables are faster to access, more readable, less error prone, easier to type by LSPs
…tializer

this is an object shape optimization; it does defeat the purpose of mixins somewhat, which is part of the tradeoff
@HoneyryderChuck
Copy link
Collaborator Author

@mullermp I did run a few sanity tests 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants