-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
10x performance improvements #178
base: main
Are you sure you want to change the base?
Conversation
uses singed for profiling, memory_profiler to track object allocations, among other tools
…ions and static variables
fallback to << for older rubies
useful in a few places to reduce the number of needless array allocations
ee412a4
to
91da0b1
Compare
👏🏻 |
I can do a review this week, but it would have been helpful to slice this up into chunks. Are all of the changes related to performance? I see rbs and other refactorings. |
@mullermp I understand. I think it's more digestible if you go commit by commit, as each change is independent and context is in the commit message. I don't think it'd help much by making each commit its own MR at this point, but I concede that this may not fit everyone's preferred review workflow.
I went back to the commit list to keep me honest. I counted only one commit which isn't specifically about performance. RBS changes are about method sig changes, and inconsistencies found as a result of the changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! I was able to sanity test this with the AWS SDK. You should also run a sanity test.
|
||
log { "build client..." } | ||
CLIENT.on(:frame) do |bytes| | ||
log { "(client) sending bytes: #{bytes.size}" } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use standard arrow semantics (-> or <-) for client/server responses if you want.
@@ -1,6 +1,11 @@ | |||
# frozen_string_literal: true | |||
|
|||
require "http/2/version" | |||
|
|||
module HTTP2 | |||
EMPTY = [].freeze |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this for and why can't you check empty? in the right places?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW this empty array was already there, I'm just reusing it in an additional place. This is an optimization to avoid creating useless empty arrays where it's not expected to have something else.
@@ -212,14 +212,17 @@ def receive(data) | |||
end | |||
|
|||
while (frame = @framer.parse(@recv_buffer)) | |||
# @type var stream_id: Integer | |||
stream_id, frame_type = frame.fetch_values(:stream, :type) {} # rubocop:disable Lint/EmptyBlock |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd consider two fetches with nil default for readability. The empty block default is weird here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change was motivated by the assumption that fetching multiple values at once should be faster than fetching one at a time in ruby, due to less ruby/c-land context switching.
This is however not the case, as this benchmark shows:
require "benchmark/ips"
FRAME = { type: :data, stream: 1, data: "bla" }.freeze
Benchmark.ips do |x|
# Configure the number of seconds used during
# the warmup phase (default 2) and calculation phase (default 5)
x.config(warmup: 2, time: 5)
x.report("[]") do
FRAME[:type]
FRAME[:stream]
FRAME[:data]
end
x.report("fetch_values") do
FRAME.fetch_values(:type, :stream, :data)
end
# Compare the iterations per second of the various reports!
x.compare!
end
# ruby 3.4.2 (2025-02-15 revision d2930f8e7a) +PRISM [x86_64-darwin23]
# Warming up --------------------------------------
# [] 1.109M i/100ms
# fetch_values 528.814k i/100ms
# Calculating -------------------------------------
# [] 11.265M (± 2.8%) i/s (88.77 ns/i) - 56.538M in 5.023007s
# fetch_values 5.466M (± 8.6%) i/s (182.96 ns/i) - 27.498M in 5.086584s
#
# Comparison:
# []: 11264935.5 i/s
# fetch_values: 5465822.2 i/s - 2.06x slower
I'll revert the commit where this was introduced. I'll try to get some feedback on why is my expectation wrong.
@@ -425,35 +432,33 @@ def <<(data) | |||
# @note all frames are currently delivered in FIFO order. | |||
# @param frame [Hash] | |||
def send(frame) | |||
frame_type = frame[:type] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assigning this is probably negligible in performance right ? - I see in other methods you are not doing this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
local variables access is more performant than hash lookup. the rule of thumb I applied was: if the lookup is performed more than once, cache into a variable, otherwise leave it as is.
require "benchmark/ips"
FRAME = { type: :data, stream: 1, data: "bla" }.freeze
Benchmark.ips do |x|
# Configure the number of seconds used during
# the warmup phase (default 2) and calculation phase (default 5)
x.config(warmup: 2, time: 5)
x.report("local") do
type = FRAME[:type]
type == :data
type == :headers
end
x.report("[]") do
FRAME[:type] == :data
FRAME[:type] == :headers
end
# Compare the iterations per second of the various reports!
x.compare!
end
# ruby 3.4.2 (2025-02-15 revision d2930f8e7a) +PRISM [x86_64-darwin23]
# Warming up --------------------------------------
# local 1.343M i/100ms
# [] 1.022M i/100ms
# Calculating -------------------------------------
# local 13.278M (± 3.9%) i/s (75.31 ns/i) - 67.146M in 5.065645s
# [] 10.315M (± 2.6%) i/s (96.95 ns/i) - 52.108M in 5.055306s
#
# Comparison:
# local: 13278073.7 i/s
# []: 10314810.6 i/s - 1.29x slower
# hasn't expired yet (it's ordered). | ||
if closed_since && (now - closed_since) > 15 | ||
# discards all streams which have closed for a while. | ||
# TODO: use a drop_while! variant whenever there is one. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
None of the enumerable methods have bang methods so this will probably never happen. This is a hash right? Why not just use reject!
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
enumerables don't, but Array and Hash do. I proposed this method, and even implemented it, but matz has rejected the proposal so far (although officially wanting feedback).
This is a hash right? Why not just use reject!?
It used to be that way. It was changed here with significant performance gain (hashes tend to grow with the number of concurrent streams, so you want to prevent constantly hitting O(n)), but at the cost of generating a hash constantly (GC pressure). This patch fixes the "hot" case where streams are created next to each other, so it's unlikely there'lll be anything expired to purge. To be fair, most of the performance gains from this MR come from this change.
lib/http/2/framer.rb
Outdated
if buffer.frozen? | ||
header = String.new("", encoding: Encoding::BINARY, capacity: buffer.bytesize + 9) # header length | ||
|
||
pack([ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A lot of this code looks identical from below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you're right. rewrote the logic, lmk what you think.
@@ -9,9 +9,12 @@ module HTTP2 | |||
# - http://tools.ietf.org/html/draft-ietf-httpbis-header-compression-10 | |||
module Header | |||
# Huffman encoder/decoder | |||
class Huffman | |||
module Huffman | |||
module_function |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of module_function, I think you should just define your methods with self.
and make it static.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure if I understand the benefit. the module_function
call is exactly to avoid needlessly calling self
. the methods are static from then on.
@@ -8,7 +8,7 @@ | |||
|
|||
module HTTP2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not necessarily related to this PR, but I recall the "generate huffman table" task and what not appeared to be broken or to generate data that differs from what is shipped. We really ought to figure that out. I tried last year and was stumped.
|
||
private | ||
|
||
def listeners(event) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's wrong with defining listeners in the emitter? I think it makes a lot more sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this one is a bit conflicting, but the main reason is to benefit from object shapes optimization (hence the initialization of @listeners
in the initialize
method. This is unfortunately a case where a recent VM optimization gets in the way of a known community practice (mixins), and where due to the low code churn (this code hasn't changed in years), I preferred to privilege performance over consistency.
…llocations frame generation will reduce the number of allocations in most cases to 1 or 2 strings per frame (on modern rubies); it'll also try to (via capacity arg) to allocate the correct string size on initialization to reduce the number of internal string reallocs; initial frame hash mutation is also reduced to a minimum
just deal with enums when necessary, and do not create intermediate arrays for splatting
… earlier in the stack) the check for frame type is moved elsewhere, as it's pointless outside of the handshake phase
improved sigs also, and corrected the wrong name on a stream class method
… immediately after generation this reduces the number of allocated arrays per emitted frame, and particularly in the case of continuation frames
they were measuring the same thing. by doing so, it became the obvious value to use for the GOAWAY last stream id flag
the previous routine was taking a frame, splitting it into the partial remaining frame, and putting it on top of the stack; this bypasses the array shiff/unshift operation in that case, by building the partial frame to send out-of-buffer
…ollection didn't expire yet the collection is ordered, which is also a reason why #drop_while is used; unfortunately a #drop_while! won't happen anytime soon, so foregoing the constant hash generation has a measurable impact
…e? (usual mem vs speed tradeoff)
…ield from #encode
no way around it unfortunately, and this way an extra intermediate array is spared
method may be called multiple times, and instead of traversing the array every single time, better to update it when necessary and use the current value
this reduces the number of intermediate small strings when compressing integers, by utilizing the main headers buffer
no state management, only gc pressure. making it functional solves that
which were being joined into one at the end anyway; buffer to existing string instead.
this adds buffering support in lower-level APIs and huffman encoding module
…ow allows, or the frame is a final 0-length DATA frame
…hen-add this avoid internal hash recapacitation / realloc calls
this avoids a needless intermediate array creation for the indexed header case; it also avoids the needless duping of cmd, as it isn't used internally as such (only its values, which are allocated to local variables)
…ype of header command
…ate machine lookup
…the string in 1 go at the end
local variables are faster to access, more readable, less error prone, easier to type by LSPs
…tializer this is an object shape optimization; it does defeat the purpose of mixins somewhat, which is part of the tradeoff
91da0b1
to
7352e02
Compare
@mullermp I did run a few sanity tests 👍 |
This is a long batch of changes which I've been slowly patching and testing locally towards improving the performance of the "happy path" of a burst of multiple simultaneous streams on a single connection.
The first commit includes the baseline benchmark (using the singed gem to generate flamegraphs for analysis).
The full changeset will be hard to review. The suggestion is to review each commit separately, as they're all self-contained, and the commit message contains context about the change.
These are the measurements using the
benchmark
library for 5000 streams over a connection: