-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kernel launch latency has regressed significantly over time #3619
Comments
See #3503 |
@bertmaher @jlebar Okay, I've dug into it and I think that I can explain large parts of the regression. Consider this nearly-doubling of wall-clock time from a pair of commits:
The first one of those commits (adding 78 microseconds) seems innocuous until you notice that it causes every argument to go through The second commit (adding 97 microseconds) reruns Both of these issues -- amongst other things, such as the expensive I'll see if there's anything egregious in the other significant latency-adding commits. |
This improves kernel launch latency by 2.2x (from 108us to 49us using @bertmaher's benchmarking script in issue #3619 ). Thanks also to @liboyue's analysis and suggestions. See the discussion in the third-party PR #3503 (comment)
As of #3648 the kernel launch latency is now 6x faster than it was two days ago. I'm marking this issue as closed because it's now faster than it's ever been before (although there may still be some minor opportunities for further improvement). |
@apgoucher awesome! Thank you for the fast optimizations! |
I've heard some user complaints about launch latency (when not using cudagraphs, of course), so I wrote a simple launch latency benchmark (https://gist.github.com/bertmaher/e8869ebf5297dfc77e26d51037d21f80) and backfilled data over the last several months. It appears that indeed latency has gone from 70us to 350us in that time period.
Full data for this benchmark is here: https://gist.github.com/bertmaher/7912b1735cf5b7c6427ef62cad4f515c
The exact numbers depend on the number and types of arguments passed into the kernel. For the benchmark I chose arg types from a kernel that a user suggested.
The text was updated successfully, but these errors were encountered: