Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

idna bypass #3695

Merged
merged 3 commits into from
Nov 25, 2016
Merged

idna bypass #3695

merged 3 commits into from
Nov 25, 2016

Conversation

nateprewitt
Copy link
Member

This addresses #3687 allowing URLs that are already idna encoded to pass through while everything else is still properly encoded.

@@ -825,3 +825,19 @@ def rewind_body(prepared_request):
"body for redirect.")
else:
raise UnrewindableBodyError("Unable to rewind request body for redirect.")

def is_ascii(data):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also rework this as:

    if isinstance(data, str):
        coding_func = getattr(data, 'encode')
    elif isinstance(data, bytes):
        coding_func = getattr(data, 'decode')
    else:
        return False

    # Try to encode/decode data into ascii
    try:
        coding_func('ascii')
        return True
    except (UnicodeError):
        return False

Either way it's kind of an ugly function but pulls that out of the method and hopefully can provide value elsewhere later.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'd like to condense the code in this form.

Copy link
Member

@Lukasa Lukasa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a really good start! I have a few miscellaneous notes, but nothing much.

host = idna.encode(host, uts46=True).decode('utf-8')
except (UnicodeError, idna.IDNAError):
raise InvalidURL('URL has an invalid label.')
if not host.startswith('xn--') or not is_ascii(host):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need a comment to justify why we're leaving some URLs alone: specifically, because we think they've already been IDNA-encoded and don't want to stomp on that logic.

We think this if they begin with the IDNA tag (xn--) and if all the bytes within them are ASCII.

@@ -825,3 +825,19 @@ def rewind_body(prepared_request):
"body for redirect.")
else:
raise UnrewindableBodyError("Unable to rewind request body for redirect.")

def is_ascii(data):
"""Determine is data can be encoded/decoded as ascii"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: if.

@@ -825,3 +825,19 @@ def rewind_body(prepared_request):
"body for redirect.")
else:
raise UnrewindableBodyError("Unable to rewind request body for redirect.")

def is_ascii(data):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'd like to condense the code in this form.

u'http://xn--n3h.net/'.encode('utf-8'),
u'http://xn--n3h.net/'
),
('http://xn--n3h.net/', 'http://xn--n3h.net/')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also confirm this works for things explicitly marked as byte strings.

@nateprewitt nateprewitt force-pushed the idna_bypass branch 2 times, most recently from e998ed1 to 126d3d2 Compare November 16, 2016 14:05
@nateprewitt
Copy link
Member Author

Alright, all systems are nominal. 🚀

Copy link
Member

@Lukasa Lukasa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like a change to a comment, please. 😄

except (UnicodeError, idna.IDNAError):
raise InvalidURL('URL has an invalid label.')
# If the host doesn't start with 'xn--' or contains non-ascii
# characters, it may not be idna encoded, so we'll do it here.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this comment needs to be reframed: almost all URIs aren't IDNA encoded, so we're just aiming to catch the edge case.

@nateprewitt
Copy link
Member Author

Comment updated, hopefully that's a bit closer.

Copy link
Member

@Lukasa Lukasa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much closer!

# their work. We determine something *looks* IDNA encoded by checking
# for the IDNA-encoding tag (xn--) at the start of the hostname, and
# confirming that the hostname is entirely ASCII characters.
if not host.startswith('xn--') or not is_ascii(host):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's try to avoid problems by setting this to u'xn--'. On top of this, given that we know that the string will always be unicode here, we can safely rewrite is_ascii to unicode_is_ascii which will just work for unicode strings for now.

Save ourselves the unneeded complexity.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Lukasa, sorry I deleted this the first time, thought I'd made a mistake but I've arrived back at the same conclusion. I'm onboard with u'xn--'.

While unicode_is_ascii is definitely doable/simpler, I do have a couple outstanding questions:

  1. The new function will only be the single try/except block, so does it really warrant being a util function at that point?
  2. While we are now explicitly stating that this is only for unicode, if you supply the function a bytestring without the current checks, it will evaluate to True in Python 2 and False in Python 3. If this ever gets reused in a place were people are pretty sure it will always be unicode but may not, that's a tricky place for errors to crop up.

The repro is simply b'test'.encode('ascii'). Python 2 will let you do this all day, Python 3 raises an AttributeError. We could alternatively wrap the try/except in a conditional block to check if it's a string, but that has the inverse problem of string in the format 'my string' being False in Python 2 and True in Python 3.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A single try...except block is still a totally valuable thing to have in one function, particularly when we're trying to use it in a conditional. So yeah, I have no problem with that.

Ultimately, I'm not concerned about the failure modes. If you're really worried, add assert isinstance(argument, unicode) where unicode is imported from compat.py. That should avoid the problem with unicode by adding a check that can be optimised out.

Copy link
Member Author

@nateprewitt nateprewitt Nov 16, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So unicode won't work here because it's not defined for Python 3 in compat. We bind str to unicode for Python 2. However, strings that are declared like 'my string', as I noted above, are not unicode and will evaluate to False in isinstance(string, str) for Python 2.

I can implement the standalone try/except block, I just wanted to make sure the subtle failure points were understood before making the changes. Bytestrings (including Python 2's default str) will inconsistently evaluate between versions, the conditional only toggles which value is returned respectively.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just another passing thought, perhaps this should be moved to _internal_utils due to the caveats? It just seems like this has enough nuance to not be a utility people are actively encouraged to use externally.

@nateprewitt
Copy link
Member Author

Alright, ASCII check updated 😃

Copy link
Member

@Lukasa Lukasa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, point noted: let's make most of the changes you suggested.

:param str u_string: unicode string to check. Must be unicode
and not Python 2 `str`.
:rtype: bool
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, let's move this to _internal_utils. Also, I apologise about the confusion of unicode: let's import str from compat.py and use it here for an assertion.

@nateprewitt
Copy link
Member Author

OK, we're all moved over to _internal_utils.

@nateprewitt
Copy link
Member Author

nateprewitt commented Nov 17, 2016

An alternative solution to this is to say that if the idna module fails to encode we'll just go ahead and try to encode as ASCII. If that works then we shrug our shoulders and say everything is probably ok, and if it fails we catch that and throw InvalidURL. @sigmavirus24, how does that sound?

(comment)

Do we want to hold here until a decision is made on this so we're not having to back things out?

@Lukasa
Copy link
Member

Lukasa commented Nov 17, 2016

It's certainly worth us holding off until we can resolve that ambiguity.

@sigmavirus24
Copy link
Contributor

That method sounds good to me. I'm not as familiar with the idna standard though, so while it sounds perfectly practical, I don't know if it fits in with the rest of the work around the standard.

@Lukasa
Copy link
Member

Lukasa commented Nov 19, 2016

Ok then, let's go ahead with that and see how it goes.

@@ -2127,6 +2133,7 @@ def test_preparing_url(self, url, expected):
b"http://*",
u"http://*.google.com",
u"http://*",
u"http://☃.net/"
)
)
def test_preparing_bad_url(self, url):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hrmm, so I'm assuming the urls in this test_preparing_bad_url are to show we prevent wildcards? If we fallback to simply checking for ASCII on failed IDNA encoding, these all pass now. Is there a discrete subset of ASCII URLs that we still want/need to catch?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is about preventing wildcards.

To find this kind of thing we really may need to add a custom validator that pulls the hostname apart and checks for wildcards. It's probably sufficient to confirm the URI doesn't start with u'*.'.

@nateprewitt
Copy link
Member Author

nateprewitt commented Nov 19, 2016

I guess we should probably clarify what we're attempting to do here. The above comment was in reference to the the inability to use IPv6 with Requests 2.12+. Personally, I think that definitely needs to be addressed, the other complaints I have no opinion on.

The fallback approach though essentially gives carte blanche to anything that can be ASCII encoded, which seems to defeat some benefits of doing IDNA encoding. They've special cased . delimiters in the IDNA library, I'm assuming for things like subdomains, which allows IPv4 addresses to pass through (unintentionally?). However, they don't appear to have taken into account IPv6 style addresses with : delimiters or wrapping brackets.

From a cursory reading of RFC5891, it seems like IDNA is only intended for domain names. That would suggest to me we shouldn't be passing IP addresses into this function to begin with.

If we're ONLY attempting to handle IPv6, we can check to skip host values that are entirely numeric or delimiters. Otherwise, if we're entertaining allowing things like underscores then maybe the passthrough is the only viable option to avoid constantly programming around use cases.

Edit: Sorry @Lukasa I just realized I essentially reiterated your entire thought process from the other thread. Vendoring in a dependency is likely not what we want to do. any([c.isalpha() for c in host]) is probably kludgy enough to keep out of the main path for Requests. So I guess that leaves the bypass.

@Lukasa
Copy link
Member

Lukasa commented Nov 20, 2016

Yup, that's why I landed on the bypass: it just seems like the simplest approach that is also in line with what most browsers do (i.e. try to do the right thing but allow the wrong thing if it's unambiguous). In this case I'd like to filter out certain obviously bad URLs that IDNA was saving us from (those with leading wildcards, for example), but otherwise if they're the kind of garden-variety bad that issue #3683 is discussing (or if they aren't hostnames at all!) then we can probably just pass them through and let the rest of the codebase handle them.

@nateprewitt
Copy link
Member Author

Ok, I removed the initial check for xn-- because we end up performing fairly redundant logic both before the try...except block and again in the except block itself.

This should work for our given test cases, as well as the other concerns about unix socket files and hostnames containing underscores, provided they're both only ASCII. Let me know if I'm leaving anything else out.

Copy link
Member

@Lukasa Lukasa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok cool, this looks really good! I have a few really minor notes, but generally speaking this patch seems to be in really good shape.

(u'test', True),
(u'ジェーピーニック', False),
)
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to see a test here that uses something that's present in Latin1, as well, just to assuage concerns.

(
'http://[1200:0000:ab00:1234:0000:2552:7777:1313]:12345/',
'http://[1200:0000:ab00:1234:0000:2552:7777:1313]:12345/'
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we test this in unicode as well please?

try:
u_string.encode('ascii')
return True
except UnicodeError:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should only be possible to see a UnicodeEncodeError here now, as we're sure that the instance is str. Let's tighten the exception handling so we don't accidentally mask weird bugs.

@nateprewitt
Copy link
Member Author

Alright, tests extended and exception tightened.

Copy link
Member

@Lukasa Lukasa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few more minor test changes.

@@ -2113,6 +2114,18 @@ class TestPreparingURLs(object):
u'http://Königsgäßchen.de/straße'.encode('utf-8'),
u'http://xn--knigsgchen-b4a3dun.de/stra%C3%9Fe'
),
(
u'http://xn--n3h.net/'.encode('utf-8'),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can just be a literal byte string (as in b''), no need for a .encode.

),
(
b'http://[1200:0000:ab00:1234:0000:2552:7777:1313]:12345/',
'http://[1200:0000:ab00:1234:0000:2552:7777:1313]:12345/'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should presumably be a unicode string.

@nateprewitt
Copy link
Member Author

Yep, those were oversights. Should be updated now.

Copy link
Member

@Lukasa Lukasa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I think this is ready to go. @sigmavirus24, do you want me to wait for you to have time to take a look at this, or would you like me just to go ahead and merge?

@Lukasa
Copy link
Member

Lukasa commented Nov 25, 2016

Alright, I'm going to merge this now. @sigmavirus24 feel free to give feedback regardless, we can always revisit this. =) Thanks @nateprewitt!

@Lukasa Lukasa merged commit 5c45494 into psf:master Nov 25, 2016
@nateprewitt nateprewitt deleted the idna_bypass branch November 25, 2016 14:39
@olgert
Copy link

olgert commented Nov 25, 2016

Guys, do you have any idea of when these changes could be released?

@Lukasa
Copy link
Member

Lukasa commented Nov 25, 2016

I am tentatively considering next week for a possible release. No promises though.

@olgert
Copy link

olgert commented Nov 25, 2016

Thanks for prompt response!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants