-
-
Notifications
You must be signed in to change notification settings - Fork 9.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
idna bypass #3695
idna bypass #3695
Conversation
@@ -825,3 +825,19 @@ def rewind_body(prepared_request): | |||
"body for redirect.") | |||
else: | |||
raise UnrewindableBodyError("Unable to rewind request body for redirect.") | |||
|
|||
def is_ascii(data): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could also rework this as:
if isinstance(data, str):
coding_func = getattr(data, 'encode')
elif isinstance(data, bytes):
coding_func = getattr(data, 'decode')
else:
return False
# Try to encode/decode data into ascii
try:
coding_func('ascii')
return True
except (UnicodeError):
return False
Either way it's kind of an ugly function but pulls that out of the method and hopefully can provide value elsewhere later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'd like to condense the code in this form.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a really good start! I have a few miscellaneous notes, but nothing much.
host = idna.encode(host, uts46=True).decode('utf-8') | ||
except (UnicodeError, idna.IDNAError): | ||
raise InvalidURL('URL has an invalid label.') | ||
if not host.startswith('xn--') or not is_ascii(host): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need a comment to justify why we're leaving some URLs alone: specifically, because we think they've already been IDNA-encoded and don't want to stomp on that logic.
We think this if they begin with the IDNA tag (xn--
) and if all the bytes within them are ASCII.
@@ -825,3 +825,19 @@ def rewind_body(prepared_request): | |||
"body for redirect.") | |||
else: | |||
raise UnrewindableBodyError("Unable to rewind request body for redirect.") | |||
|
|||
def is_ascii(data): | |||
"""Determine is data can be encoded/decoded as ascii""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: if
.
@@ -825,3 +825,19 @@ def rewind_body(prepared_request): | |||
"body for redirect.") | |||
else: | |||
raise UnrewindableBodyError("Unable to rewind request body for redirect.") | |||
|
|||
def is_ascii(data): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'd like to condense the code in this form.
u'http://xn--n3h.net/'.encode('utf-8'), | ||
u'http://xn--n3h.net/' | ||
), | ||
('http://xn--n3h.net/', 'http://xn--n3h.net/') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's also confirm this works for things explicitly marked as byte strings.
e998ed1
to
126d3d2
Compare
Alright, all systems are nominal. 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like a change to a comment, please. 😄
except (UnicodeError, idna.IDNAError): | ||
raise InvalidURL('URL has an invalid label.') | ||
# If the host doesn't start with 'xn--' or contains non-ascii | ||
# characters, it may not be idna encoded, so we'll do it here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this comment needs to be reframed: almost all URIs aren't IDNA encoded, so we're just aiming to catch the edge case.
Comment updated, hopefully that's a bit closer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Much closer!
# their work. We determine something *looks* IDNA encoded by checking | ||
# for the IDNA-encoding tag (xn--) at the start of the hostname, and | ||
# confirming that the hostname is entirely ASCII characters. | ||
if not host.startswith('xn--') or not is_ascii(host): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's try to avoid problems by setting this to u'xn--'
. On top of this, given that we know that the string will always be unicode here, we can safely rewrite is_ascii
to unicode_is_ascii
which will just work for unicode strings for now.
Save ourselves the unneeded complexity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Lukasa, sorry I deleted this the first time, thought I'd made a mistake but I've arrived back at the same conclusion. I'm onboard with u'xn--'
.
While unicode_is_ascii
is definitely doable/simpler, I do have a couple outstanding questions:
- The new function will only be the single try/except block, so does it really warrant being a util function at that point?
- While we are now explicitly stating that this is only for unicode, if you supply the function a bytestring without the current checks, it will evaluate to
True
in Python 2 andFalse
in Python 3. If this ever gets reused in a place were people are pretty sure it will always be unicode but may not, that's a tricky place for errors to crop up.
The repro is simply b'test'.encode('ascii')
. Python 2 will let you do this all day, Python 3 raises an AttributeError
. We could alternatively wrap the try/except in a conditional block to check if it's a string, but that has the inverse problem of string in the format 'my string'
being False
in Python 2 and True
in Python 3.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A single try...except block is still a totally valuable thing to have in one function, particularly when we're trying to use it in a conditional. So yeah, I have no problem with that.
Ultimately, I'm not concerned about the failure modes. If you're really worried, add assert isinstance(argument, unicode)
where unicode
is imported from compat.py
. That should avoid the problem with unicode by adding a check that can be optimised out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So unicode
won't work here because it's not defined for Python 3 in compat. We bind str
to unicode
for Python 2. However, strings that are declared like 'my string'
, as I noted above, are not unicode
and will evaluate to False
in isinstance(string, str)
for Python 2.
I can implement the standalone try/except block, I just wanted to make sure the subtle failure points were understood before making the changes. Bytestrings (including Python 2's default str) will inconsistently evaluate between versions, the conditional only toggles which value is returned respectively.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just another passing thought, perhaps this should be moved to _internal_utils due to the caveats? It just seems like this has enough nuance to not be a utility people are actively encouraged to use externally.
Alright, ASCII check updated 😃 |
08893ed
to
1fd85b8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, point noted: let's make most of the changes you suggested.
:param str u_string: unicode string to check. Must be unicode | ||
and not Python 2 `str`. | ||
:rtype: bool | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, let's move this to _internal_utils
. Also, I apologise about the confusion of unicode
: let's import str
from compat.py
and use it here for an assertion.
OK, we're all moved over to _internal_utils. |
(comment) Do we want to hold here until a decision is made on this so we're not having to back things out? |
It's certainly worth us holding off until we can resolve that ambiguity. |
That method sounds good to me. I'm not as familiar with the idna standard though, so while it sounds perfectly practical, I don't know if it fits in with the rest of the work around the standard. |
Ok then, let's go ahead with that and see how it goes. |
@@ -2127,6 +2133,7 @@ def test_preparing_url(self, url, expected): | |||
b"http://*", | |||
u"http://*.google.com", | |||
u"http://*", | |||
u"http://☃.net/" | |||
) | |||
) | |||
def test_preparing_bad_url(self, url): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hrmm, so I'm assuming the urls in this test_preparing_bad_url
are to show we prevent wildcards? If we fallback to simply checking for ASCII on failed IDNA encoding, these all pass now. Is there a discrete subset of ASCII URLs that we still want/need to catch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this is about preventing wildcards.
To find this kind of thing we really may need to add a custom validator that pulls the hostname apart and checks for wildcards. It's probably sufficient to confirm the URI doesn't start with u'*.'
.
I guess we should probably clarify what we're attempting to do here. The above comment was in reference to the the inability to use IPv6 with Requests 2.12+. Personally, I think that definitely needs to be addressed, the other complaints I have no opinion on. The fallback approach though essentially gives carte blanche to anything that can be ASCII encoded, which seems to defeat some benefits of doing IDNA encoding. They've special cased From a cursory reading of RFC5891, it seems like IDNA is only intended for domain names. That would suggest to me we shouldn't be passing IP addresses into this function to begin with. If we're ONLY attempting to handle IPv6, we can check to skip Edit: Sorry @Lukasa I just realized I essentially reiterated your entire thought process from the other thread. Vendoring in a dependency is likely not what we want to do. |
Yup, that's why I landed on the bypass: it just seems like the simplest approach that is also in line with what most browsers do (i.e. try to do the right thing but allow the wrong thing if it's unambiguous). In this case I'd like to filter out certain obviously bad URLs that IDNA was saving us from (those with leading wildcards, for example), but otherwise if they're the kind of garden-variety bad that issue #3683 is discussing (or if they aren't hostnames at all!) then we can probably just pass them through and let the rest of the codebase handle them. |
8460d7f
to
35d2add
Compare
Ok, I removed the initial check for This should work for our given test cases, as well as the other concerns about unix socket files and hostnames containing underscores, provided they're both only ASCII. Let me know if I'm leaving anything else out. |
35d2add
to
624f14b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok cool, this looks really good! I have a few really minor notes, but generally speaking this patch seems to be in really good shape.
(u'test', True), | ||
(u'ジェーピーニック', False), | ||
) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to see a test here that uses something that's present in Latin1, as well, just to assuage concerns.
( | ||
'http://[1200:0000:ab00:1234:0000:2552:7777:1313]:12345/', | ||
'http://[1200:0000:ab00:1234:0000:2552:7777:1313]:12345/' | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we test this in unicode as well please?
try: | ||
u_string.encode('ascii') | ||
return True | ||
except UnicodeError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should only be possible to see a UnicodeEncodeError
here now, as we're sure that the instance is str
. Let's tighten the exception handling so we don't accidentally mask weird bugs.
624f14b
to
eafb47e
Compare
Alright, tests extended and exception tightened. |
eafb47e
to
2c9a3e6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few more minor test changes.
@@ -2113,6 +2114,18 @@ class TestPreparingURLs(object): | |||
u'http://Königsgäßchen.de/straße'.encode('utf-8'), | |||
u'http://xn--knigsgchen-b4a3dun.de/stra%C3%9Fe' | |||
), | |||
( | |||
u'http://xn--n3h.net/'.encode('utf-8'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can just be a literal byte string (as in b''
), no need for a .encode
.
), | ||
( | ||
b'http://[1200:0000:ab00:1234:0000:2552:7777:1313]:12345/', | ||
'http://[1200:0000:ab00:1234:0000:2552:7777:1313]:12345/' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should presumably be a unicode string.
2c9a3e6
to
a83685c
Compare
Yep, those were oversights. Should be updated now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I think this is ready to go. @sigmavirus24, do you want me to wait for you to have time to take a look at this, or would you like me just to go ahead and merge?
Alright, I'm going to merge this now. @sigmavirus24 feel free to give feedback regardless, we can always revisit this. =) Thanks @nateprewitt! |
Guys, do you have any idea of when these changes could be released? |
I am tentatively considering next week for a possible release. No promises though. |
Thanks for prompt response! |
This addresses #3687 allowing URLs that are already idna encoded to pass through while everything else is still properly encoded.