idna bypass #3695

nateprewitt · 2016-11-16T13:24:32Z

This addresses #3687 allowing URLs that are already idna encoded to pass through while everything else is still properly encoded.

nateprewitt · 2016-11-16T13:27:58Z

requests/utils.py

@@ -825,3 +825,19 @@ def rewind_body(prepared_request):
                                        "body for redirect.")
    else:
        raise UnrewindableBodyError("Unable to rewind request body for redirect.")
+
+def is_ascii(data):


We could also rework this as:

if isinstance(data, str): coding_func = getattr(data, 'encode') elif isinstance(data, bytes): coding_func = getattr(data, 'decode') else: return False # Try to encode/decode data into ascii try: coding_func('ascii') return True except (UnicodeError): return False

Either way it's kind of an ugly function but pulls that out of the method and hopefully can provide value elsewhere later.

I think I'd like to condense the code in this form.

Lukasa

This is a really good start! I have a few miscellaneous notes, but nothing much.

Lukasa · 2016-11-16T13:35:11Z

requests/models.py

-            host = idna.encode(host, uts46=True).decode('utf-8')
-        except (UnicodeError, idna.IDNAError):
-            raise InvalidURL('URL has an invalid label.')
+        if not host.startswith('xn--') or not is_ascii(host):


We need a comment to justify why we're leaving some URLs alone: specifically, because we think they've already been IDNA-encoded and don't want to stomp on that logic.

We think this if they begin with the IDNA tag (xn--) and if all the bytes within them are ASCII.

Lukasa · 2016-11-16T13:35:54Z

requests/utils.py

@@ -825,3 +825,19 @@ def rewind_body(prepared_request):
                                        "body for redirect.")
    else:
        raise UnrewindableBodyError("Unable to rewind request body for redirect.")
+
+def is_ascii(data):
+    """Determine is data can be encoded/decoded as ascii"""


Lukasa · 2016-11-16T13:36:42Z

requests/utils.py

@@ -825,3 +825,19 @@ def rewind_body(prepared_request):
                                        "body for redirect.")
    else:
        raise UnrewindableBodyError("Unable to rewind request body for redirect.")
+
+def is_ascii(data):


I think I'd like to condense the code in this form.

Lukasa · 2016-11-16T13:37:10Z

tests/test_requests.py

+                u'http://xn--n3h.net/'.encode('utf-8'),
+                u'http://xn--n3h.net/'
+            ),
+            ('http://xn--n3h.net/', 'http://xn--n3h.net/')


Let's also confirm this works for things explicitly marked as byte strings.

nateprewitt · 2016-11-16T14:53:24Z

Alright, all systems are nominal. 🚀

Lukasa

I'd like a change to a comment, please. 😄

Lukasa · 2016-11-16T14:58:50Z

requests/models.py

-        except (UnicodeError, idna.IDNAError):
-            raise InvalidURL('URL has an invalid label.')
+        # If the host doesn't start with 'xn--' or contains non-ascii
+        # characters, it may not be idna encoded, so we'll do it here.


I think this comment needs to be reframed: almost all URIs aren't IDNA encoded, so we're just aiming to catch the edge case.

nateprewitt · 2016-11-16T15:28:47Z

Comment updated, hopefully that's a bit closer.

Lukasa

Much closer!

Lukasa · 2016-11-16T15:32:35Z

requests/models.py

+        # their work. We determine something *looks* IDNA encoded by checking
+        # for the IDNA-encoding tag (xn--) at the start of the hostname, and
+        # confirming that the hostname is entirely ASCII characters.
+        if not host.startswith('xn--') or not is_ascii(host):


Let's try to avoid problems by setting this to u'xn--'. On top of this, given that we know that the string will always be unicode here, we can safely rewrite is_ascii to unicode_is_ascii which will just work for unicode strings for now.

Save ourselves the unneeded complexity.

@Lukasa, sorry I deleted this the first time, thought I'd made a mistake but I've arrived back at the same conclusion. I'm onboard with u'xn--'.

While unicode_is_ascii is definitely doable/simpler, I do have a couple outstanding questions:

The new function will only be the single try/except block, so does it really warrant being a util function at that point?

While we are now explicitly stating that this is only for unicode, if you supply the function a bytestring without the current checks, it will evaluate to True in Python 2 and False in Python 3. If this ever gets reused in a place were people are pretty sure it will always be unicode but may not, that's a tricky place for errors to crop up.

The repro is simply b'test'.encode('ascii'). Python 2 will let you do this all day, Python 3 raises an AttributeError. We could alternatively wrap the try/except in a conditional block to check if it's a string, but that has the inverse problem of string in the format 'my string' being False in Python 2 and True in Python 3.

A single try...except block is still a totally valuable thing to have in one function, particularly when we're trying to use it in a conditional. So yeah, I have no problem with that.

Ultimately, I'm not concerned about the failure modes. If you're really worried, add assert isinstance(argument, unicode) where unicode is imported from compat.py. That should avoid the problem with unicode by adding a check that can be optimised out.

So unicode won't work here because it's not defined for Python 3 in compat. We bind str to unicode for Python 2. However, strings that are declared like 'my string', as I noted above, are not unicode and will evaluate to False in isinstance(string, str) for Python 2.

I can implement the standalone try/except block, I just wanted to make sure the subtle failure points were understood before making the changes. Bytestrings (including Python 2's default str) will inconsistently evaluate between versions, the conditional only toggles which value is returned respectively.

Just another passing thought, perhaps this should be moved to _internal_utils due to the caveats? It just seems like this has enough nuance to not be a utility people are actively encouraged to use externally.

nateprewitt · 2016-11-16T18:15:37Z

Alright, ASCII check updated 😃

Lukasa

Ok, point noted: let's make most of the changes you suggested.

Lukasa · 2016-11-17T14:01:27Z

requests/utils.py

+    :param str u_string: unicode string to check. Must be unicode
+        and not Python 2 `str`.
+    :rtype: bool
+    """


Ok, let's move this to _internal_utils. Also, I apologise about the confusion of unicode: let's import str from compat.py and use it here for an assertion.

nateprewitt · 2016-11-17T14:59:15Z

OK, we're all moved over to _internal_utils.

nateprewitt · 2016-11-17T15:37:08Z

An alternative solution to this is to say that if the idna module fails to encode we'll just go ahead and try to encode as ASCII. If that works then we shrug our shoulders and say everything is probably ok, and if it fails we catch that and throw InvalidURL. @sigmavirus24, how does that sound?

(comment)

Do we want to hold here until a decision is made on this so we're not having to back things out?

Lukasa · 2016-11-17T15:51:20Z

It's certainly worth us holding off until we can resolve that ambiguity.

sigmavirus24 · 2016-11-18T22:48:20Z

That method sounds good to me. I'm not as familiar with the idna standard though, so while it sounds perfectly practical, I don't know if it fits in with the rest of the work around the standard.

Lukasa · 2016-11-19T12:26:15Z

Ok then, let's go ahead with that and see how it goes.

nateprewitt · 2016-11-19T15:38:16Z

tests/test_requests.py

@@ -2127,6 +2133,7 @@ def test_preparing_url(self, url, expected):
            b"http://*",
            u"http://*.google.com",
            u"http://*",
+            u"http://☃.net/"
        )
    )
    def test_preparing_bad_url(self, url):


Hrmm, so I'm assuming the urls in this test_preparing_bad_url are to show we prevent wildcards? If we fallback to simply checking for ASCII on failed IDNA encoding, these all pass now. Is there a discrete subset of ASCII URLs that we still want/need to catch?

Yeah, this is about preventing wildcards.

To find this kind of thing we really may need to add a custom validator that pulls the hostname apart and checks for wildcards. It's probably sufficient to confirm the URI doesn't start with u'*.'.

nateprewitt · 2016-11-19T20:13:09Z

I guess we should probably clarify what we're attempting to do here. The above comment was in reference to the the inability to use IPv6 with Requests 2.12+. Personally, I think that definitely needs to be addressed, the other complaints I have no opinion on.

The fallback approach though essentially gives carte blanche to anything that can be ASCII encoded, which seems to defeat some benefits of doing IDNA encoding. They've special cased . delimiters in the IDNA library, I'm assuming for things like subdomains, which allows IPv4 addresses to pass through (unintentionally?). However, they don't appear to have taken into account IPv6 style addresses with : delimiters or wrapping brackets.

From a cursory reading of RFC5891, it seems like IDNA is only intended for domain names. That would suggest to me we shouldn't be passing IP addresses into this function to begin with.

If we're ONLY attempting to handle IPv6, we can check to skip host values that are entirely numeric or delimiters. Otherwise, if we're entertaining allowing things like underscores then maybe the passthrough is the only viable option to avoid constantly programming around use cases.

Edit: Sorry @Lukasa I just realized I essentially reiterated your entire thought process from the other thread. Vendoring in a dependency is likely not what we want to do. any([c.isalpha() for c in host]) is probably kludgy enough to keep out of the main path for Requests. So I guess that leaves the bypass.

Lukasa · 2016-11-20T09:52:31Z

Yup, that's why I landed on the bypass: it just seems like the simplest approach that is also in line with what most browsers do (i.e. try to do the right thing but allow the wrong thing if it's unambiguous). In this case I'd like to filter out certain obviously bad URLs that IDNA was saving us from (those with leading wildcards, for example), but otherwise if they're the kind of garden-variety bad that issue #3683 is discussing (or if they aren't hostnames at all!) then we can probably just pass them through and let the rest of the codebase handle them.

nateprewitt · 2016-11-20T22:17:52Z

Ok, I removed the initial check for xn-- because we end up performing fairly redundant logic both before the try...except block and again in the except block itself.

This should work for our given test cases, as well as the other concerns about unix socket files and hostnames containing underscores, provided they're both only ASCII. Let me know if I'm leaving anything else out.

Lukasa

Ok cool, this looks really good! I have a few really minor notes, but generally speaking this patch seems to be in really good shape.

Lukasa · 2016-11-21T09:43:46Z

tests/test_utils.py

+                (u'test', True),
+                (u'ジェーピーニック', False),
+    )
+)


I'd like to see a test here that uses something that's present in Latin1, as well, just to assuage concerns.

Lukasa · 2016-11-21T09:44:04Z

tests/test_requests.py

+            (
+                'http://[1200:0000:ab00:1234:0000:2552:7777:1313]:12345/',
+                'http://[1200:0000:ab00:1234:0000:2552:7777:1313]:12345/'
+            )


Can we test this in unicode as well please?

Lukasa · 2016-11-21T09:44:57Z

requests/_internal_utils.py

+    try:
+        u_string.encode('ascii')
+        return True
+    except UnicodeError:


It should only be possible to see a UnicodeEncodeError here now, as we're sure that the instance is str. Let's tighten the exception handling so we don't accidentally mask weird bugs.

nateprewitt · 2016-11-21T14:44:59Z

Alright, tests extended and exception tightened.

Lukasa

Few more minor test changes.

Lukasa · 2016-11-21T15:06:39Z

tests/test_requests.py

@@ -2113,6 +2114,18 @@ class TestPreparingURLs(object):
                u'http://Königsgäßchen.de/straße'.encode('utf-8'),
                u'http://xn--knigsgchen-b4a3dun.de/stra%C3%9Fe'
            ),
+            (
+                u'http://xn--n3h.net/'.encode('utf-8'),


This can just be a literal byte string (as in b''), no need for a .encode.

Lukasa · 2016-11-21T15:07:11Z

tests/test_requests.py

+            ),
+            (
+                b'http://[1200:0000:ab00:1234:0000:2552:7777:1313]:12345/',
+                'http://[1200:0000:ab00:1234:0000:2552:7777:1313]:12345/'


This should presumably be a unicode string.

nateprewitt · 2016-11-21T15:23:41Z

Yep, those were oversights. Should be updated now.

Lukasa

Ok, I think this is ready to go. @sigmavirus24, do you want me to wait for you to have time to take a look at this, or would you like me just to go ahead and merge?

Lukasa · 2016-11-25T13:17:57Z

Alright, I'm going to merge this now. @sigmavirus24 feel free to give feedback regardless, we can always revisit this. =) Thanks @nateprewitt!

olgert · 2016-11-25T16:02:40Z

Guys, do you have any idea of when these changes could be released?

Lukasa · 2016-11-25T16:29:16Z

I am tentatively considering next week for a possible release. No promises though.

olgert · 2016-11-25T19:31:09Z

Thanks for prompt response!

nateprewitt commented Nov 16, 2016

View reviewed changes

Lukasa requested changes Nov 16, 2016

View reviewed changes

nateprewitt force-pushed the idna_bypass branch 2 times, most recently from e998ed1 to 126d3d2 Compare November 16, 2016 14:05

Lukasa requested changes Nov 16, 2016

View reviewed changes

nateprewitt force-pushed the idna_bypass branch from 08893ed to 1fd85b8 Compare November 16, 2016 18:39

sigmavirus24 mentioned this pull request Nov 17, 2016

URL has an invalid label. #3683

Closed

Lukasa requested changes Nov 17, 2016

View reviewed changes

nateprewitt commented Nov 19, 2016

View reviewed changes

nateprewitt force-pushed the idna_bypass branch from 8460d7f to 35d2add Compare November 20, 2016 21:56

nateprewitt force-pushed the idna_bypass branch from 35d2add to 624f14b Compare November 21, 2016 04:42

Lukasa requested changes Nov 21, 2016

View reviewed changes

pvizeli mentioned this pull request Nov 21, 2016

Underscore is no longer supported in URL host name! home-assistant/core#4493

Closed

nateprewitt force-pushed the idna_bypass branch from 624f14b to eafb47e Compare November 21, 2016 14:41

nateprewitt force-pushed the idna_bypass branch from eafb47e to 2c9a3e6 Compare November 21, 2016 14:51

Lukasa requested changes Nov 21, 2016

View reviewed changes

updated tests with IDNA encoded and IPv6 urls

d52e9b8

nateprewitt added 2 commits November 21, 2016 08:22

adding unicode_is_ascii utility function

264f5bd

modifying IDNA encoding check to allow fallback

a83685c

nateprewitt force-pushed the idna_bypass branch from 2c9a3e6 to a83685c Compare November 21, 2016 15:22

mmetak mentioned this pull request Nov 21, 2016

[Youtube] URL has an invalid label streamlink/streamlink#169

Closed

Lukasa approved these changes Nov 21, 2016

View reviewed changes

nateprewitt mentioned this pull request Nov 22, 2016

because of idna2008 enforcement some real urls that work in the browser are now broken #3687

Closed

kjd mentioned this pull request Nov 23, 2016

Codepoint U+005B not allowed at position1 in '[::1]' kjd/idna#29

Closed

Lukasa merged commit 5c45494 into psf:master Nov 25, 2016

nateprewitt deleted the idna_bypass branch November 25, 2016 14:39

nateprewitt mentioned this pull request Nov 30, 2016

docker-py breaks with some configurations of Requests v2.12.2 #3734

Closed

njsmith mentioned this pull request May 27, 2017

trio.ssl handling for Unicode (IDNA) domains is deeply broken python-trio/trio#11

Open

github-actions bot locked as resolved and limited conversation to collaborators Sep 8, 2021

idna bypass #3695

idna bypass #3695

Conversation

nateprewitt commented Nov 16, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Lukasa left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nateprewitt commented Nov 16, 2016

Lukasa left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nateprewitt commented Nov 16, 2016

Lukasa left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nateprewitt Nov 16, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nateprewitt commented Nov 16, 2016

Lukasa left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nateprewitt commented Nov 17, 2016

nateprewitt commented Nov 17, 2016 • edited Loading

Lukasa commented Nov 17, 2016

sigmavirus24 commented Nov 18, 2016

Lukasa commented Nov 19, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nateprewitt commented Nov 19, 2016 • edited Loading

Lukasa commented Nov 20, 2016

nateprewitt commented Nov 20, 2016

Lukasa left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nateprewitt commented Nov 21, 2016

Lukasa left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nateprewitt commented Nov 21, 2016

Lukasa left a comment

Choose a reason for hiding this comment

Lukasa commented Nov 25, 2016

olgert commented Nov 25, 2016

Lukasa commented Nov 25, 2016

olgert commented Nov 25, 2016

nateprewitt Nov 16, 2016 •

edited

Loading

nateprewitt commented Nov 17, 2016 •

edited

Loading

nateprewitt commented Nov 19, 2016 •

edited

Loading