-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster headers implementation #544
Conversation
👍 |
I for one think this is unacceptable. We're all aware of the specification, but preserving the data as it is received is important and something I'm certain I've observed people relying upon. That said, if there's a way to preserve it, and improve peformance, I'm 100% for it. These improvements are good. What's most surprising to me is how drastic it is for an empty initialization (although I'm uncertain exactly how common that really is). |
Further, in its current state, this PR isn't compatible with Python 3 (according to Travis CI) and doesn't achieve 100% test coverage. Those are 2 things that are going to block this more significantly in the near term than anything else. |
Just extended the tests for full coverage. Have some issues with the full test run locally, so I haven't uploaded it right now. In case the suggested version would otherwise be fine, I'll fix it for Python3, too. |
@ml31415 I'd rather you fix it and push for Python 3 so I can pull it and verify your benchmarks. |
Fixing the case preservation? I suppose that would kill most of the speed gains. |
@ml31415 Fixing the code on Python 3: https://travis-ci.org/shazow/urllib3/jobs/49604628 |
The last failing test, test_httplib_headers_case_insensitive, seems to be the exactly the point in question, whether or not case preservation is required or not. |
I modified the changes, so that the original case information is preserved. After some tuning, the speed improvements are similar as the previous version.
|
# Only one item so far, need to convert the tuple to list | ||
_dict_setitem(self, key_lower, [vals[0], vals[1], val]) | ||
|
||
def update_add(*args, **kwds): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, update_add
is a weird name. Maybe load
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's go with extend
. :)
Looks good overall. Otherwise if everyone is onboard, I'm happy to merge. :) |
I named it like that on purpose, in order not to mess with the previous update implementation. Previous update replaces items when hitting double occurances, update_add extends entries when required. It's meant for replacing for-loops containing headerobj.add(key, value), like seen in the request module, or as a general import function of arbitrary other header objects. I also added the from_httplib constructor for that reason. I also don't care about renaming it, though. In case you're talking about the .add implementation itself, well, it's just what benchmarks say it's the fastest. I played around with it for quite a while to achieve similar timings as before, when not preserving case information. This is the only solution which came close. The idea is, assume the general case i.e. no such item present, and optimize that as good as possible. I also had tried out another implementation using lists all over instead of tuples for single headers, but thatturned out to be too slow, about a factor of 1.5 slower than now. |
Yup, I got all that. :) I just mean that the name "update_add" is confusing. I understand why it exists, and I agree it's necessary. Another possible name would be "extend"? (Its behaviour is similar to list.extend in a way...) |
I suppose I'm quite uncreative with namings :) Better suggestions cordially welcome! |
@Lukasa @sigmavirus24 Thoughts? |
|
||
# For some reason, the test above doesn't run into the __eq__ function, | ||
# lacking test coverage for line 165, so comparing directly here | ||
self.assertFalse(self.d == 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you mean
self.assertNotEqual(len(self.d), 2)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I referred to line 197 in the comment. Without line 201, the testrun reported missing coverage for _collections
, line 163 now. I had expected, this not equal comparison the line above would also call __eq__
and reach that line, but for some reason it didn't.
I think extend works for me. |
What remained a bit undefined is, when to return case-preserved headers and when not. E.g. |
I had been trying to speed up urllib3 by monkeypatching with geventhttpclient. Unfortunately there is not too much gain, as lots of copying of response and header objects is done, which could largely be skipped, when the header and response objects would be sufficiently compatible and directly recognized and accepted as compatible, instead of being converted by lots of copying around.
So for the sake of better compatibility and speedups, I would have liked to directly create a header class - the standard urllib3.HTTPHeaderDict - by geventhttpclients c-parser when monkeypatching, which won't need any further processing from urllib3s response object. When I compared both header container implementations, turned out that the current urllib3 version isn't that fast as it could be. Below some timings comparing the old vs the new version, HTTPHeaderDict_ being the old version:
For my machine, I get the following timings:
The new version passes all current tests, though it comes with one downside, which is worth discussing: It's not storing the original case of the header items. While the current version stores that, and therefore can restore the case if desired e.g. when iterating over the keys, this implementation is consequently using lower case, except when pretty printing the object. The standards require case insensitivity, so not preserving the case seems to me like an acceptable, maybe even desireable behaviour. Furthermore, also the current implementation has to drop the case information, when joining same header items together.
In terms of compatibility, I hope this should work with all python versions, though I mainly tested it on 2.7.