http: disallow two-byte characters in URL path #16237

bennofs · 2017-10-16T15:08:09Z

This commit changes node's handling of two-byte characters in the path component
of an http URL. Previously, node would just strip the higher byte when
generating the request. So this code:

http.request({host: "example.com", port: "80", "/\uFF2e"})

would request http://example.com/. (. is the character for the byte 0x2e).

This is not useful and can in some cases allow filter evasion. With this
change, the code generates ERR_UNESCAPED_CHARACTERS, just like space and
control characters already did.

Checklist

make -j4 test (UNIX) passes
tests and/or benchmarks are included
commit message follows commit guidelines

Affected core subsystem(s)

http

lpinca · 2017-10-16T15:27:54Z

test/parallel/test-http-client-invalid-path.js

@@ -0,0 +1,33 @@
+// Copyright Joyent, Inc. and other Node contributors.


Nit: copyright is not needed for new tests.

joyeecheung · 2017-10-16T16:32:24Z

lib/_http_client.js

@@ -50,20 +50,20 @@ const errors = require('internal/errors');
 // checks can greatly outperform the equivalent regexp (tested in V8 5.4).
 function isInvalidPath(s) {


The comment above this function needs an update as well

apapirovski · 2017-10-16T16:48:40Z

Thanks for the PR @bennofs! I'll abstain from commenting on the correctness of this but there are some performance considerations here.

With this change, isInvalidPath is strictly slower (even for 1 or 2 chars) in v8 versions 6.1 & 6.2 than the RegExp version.
Even if it doesn't or the isInvalidPath perf can be improved, the threshold for which the RegExp is faster has moved down to about 10 characters from the previous 39.

bennofs · 2017-10-16T16:53:40Z

@apapirovski hmm ok. what did you use to test the performance of isInvalidPath? So I can play around with it and see if there's perhaps some way to optimize it (although I am not very familar with V8 performance).

It is probably impossible not to pay at least a small performance penalty, because we need to check more things than we currently do.

bennofs · 2017-10-16T16:58:24Z

For the motivation why this is important, see https://www.blackhat.com/docs/us-17/thursday/us-17-Tsai-A-New-Era-Of-SSRF-Exploiting-URL-Parser-In-Trending-Programming-Languages.pdf (NodeJS unicode failure).

tl;dr:

var base = "http://orange.tw/sandbox/";
var path = req.query.path;
if(path.indexOf("..") ==-1) {
  http.get(base + path, callback);
}

Such a filter can easily be bypassed with \uFF2e\uFF2e to get SSRF if this is not fixed.

mscdex · 2017-10-16T17:16:57Z

lib/_http_client.js

 //
 // This function is used in the case of small paths, where manual character code
 // checks can greatly outperform the equivalent regexp (tested in V8 5.4).
 function isInvalidPath(s) {
  var i = 0;
-  if (s.charCodeAt(0) <= 32) return true;
+  if (s.charCodeAt(0) <= 32 || s.charCodeAt(0) > 0xFF) return true;


Shouldn't this be > 127 instead of > 255 ? UTF-8 starts using multiple bytes after the last ASCII character (127). Or am I misunderstanding the PR description?

V8 strings are not UTF-8, but UTF-16 (UCS2). If it was UTF-8, this would not be necessary, as each byte in an UTF-8 multibyte sequence is > 127 so it would be impossible to get ascii characters that way, even if some bytes were stripped.

What happens is that the string ' \uff2e' is represented in node as (uint16_t*){0xff2e} (where {} is used here to denote an array). With the latin1 encoding (that's what gets used for headers), node internally calls WriteOneByte on that string to write it to the socket and WriteOneByte will strip the higher byte of the two-byte character, so we end up with 0x2e. It works fine for all characters that can be represented with a single byte (and in UCS2, that means all characters below 0x100) (whether any non-ascii characters make sense in this context is questionable, but it also doesn't break anything so I opted not to filter that).

With the latin1 encoding (that's what gets used for headers)

Unfortunately, it's not that simple. See #13296 (comment) for an explanation. The two word summary is "it depends" (on the encoding of the body, but that doesn't fit in two words.)

Aside: calling .charCodeAt() twice for the same index is somewhat wasteful, it would be better to cache the result in a variable. (V8's TurboFan compiler can inline the calls but I don't believe it will fold them.)

Oh, thanks for pointing me to that issue. I thought a bit about something like that while writing the patch, but I couldn't find out where the encoding is actually set.

At least as long as we aren't using chunked encoding (no idea what that is :))

Sorry for the delay, missed your replies. I don't know if your exact proposal would work but take a look at the _storeHeader() and _send() methods in lib/_http_outgoing.js, that's where the headers are put on the wire and their encoding is determined.

You may also want to look at the _storeHeader() calls in lib/_http_client.js, they compose the status line.

This raises the question if we actually want to support non-latin1 encoded request path? Maybe utf-8 makes sense in some contexts? I am not sure.

I spent some time browsing through open issues yesterday and there's already a bug logged against this behaviour. (Would need to go back to find it again.) IMO what mscdex said above makes sense, just cap it at 127 instead. The current behaviour is completely unpredictable and broken.

Here's the issue I was mentioning #13296 — there's some good discussion there too.

I would prefer what I said above, at least for 9.x (maybe different solution for LTS), but I'll defer to @bnoordhuis and others with more experience with the http module.

This raises the question if we actually want to support non-latin1 encoded request path? Maybe utf-8 makes sense in some contexts?

It's in fairly widespread use. Rejecting it outright would almost certainly result in quite a bit of fallout and lots of frustrated users. Rejecting malformed UTF-8 should be acceptable and appropriate, though.

(But as discussed, what to accept and what to reject depends on the encoding.)

apapirovski · 2017-10-16T17:33:30Z

@bennofs Totally understand re: the need for it, just don't want to comment since it's not my area of expertise.

Re: performance, I think the simplest solution here is to just completely remove isInvalidPath if this change is going to stick and then just inline the regExp.test() statement into the if condition and you could also pull out the construction of the actual RegExp outside of the ClientRequest so it doesn't run each time.

In terms of running benchmarks, here's the guide for it: https://github.com/nodejs/node/blob/master/doc/guides/writing-and-running-benchmarks.md — the relevant test here would be http/create-clientrequest.js but if you want to run it locally, I would tweak the parameters of the benchmark to something like:

len: [1, 8, 16],
n: [1e7]

(You don't need the large len tests and n = 1e6 is quite noisy.)

Let me know if any of that didn't make sense.

mcollina · 2017-10-17T14:50:55Z

Can you please check if this degrades our "simple" http benchmark?

apapirovski · 2017-10-17T15:12:39Z

@mcollina This just affects ClientRequest so I don't think it would, would it? The best benchmark for this should be create-clientrequest which does get degraded as per above.

addaleax

The PR seems fine to me, but: semver-major?

apapirovski · 2017-10-19T16:24:22Z

@addaleax (or anyone else that knows) what's the policy around semver-major things that have a security impact (which this does)?

addaleax · 2017-10-19T16:30:17Z

@apapirovski What we can do is landing a semver-major patch in an existing release line if we think it’s justified for security reasons, and provide users a way to opt out of the patch via a runtime flag.

I think this can be seen as a semver-patch, I just want to make sure we’re explicit about that :)

(Also, I’d take it as a good sign that #8923 was landed in LTS and nobody yelled at us about it.)

jasnell · 2017-10-20T16:16:21Z

It makes me quite sad to see yet another example of where our loose conformance to the spec is causing yet another issue. Running the URL through the WHATWG URL parser produces a reasonable result... e.g.

> new url.URL('http://example.org/\uFF2e')
URL {
  href: 'http://example.org/%EF%BC%AE',
  origin: 'http://example.org',
  protocol: 'http:',
  username: '',
  password: '',
  host: 'example.org',
  hostname: 'example.org',
  port: '',
  pathname: '/%EF%BC%AE',
  search: '',
  searchParams: URLSearchParams {},
  hash: '' }

I'm +1 on making this change as semver-patch in all release lines, but I would argue even stronger for deprecating the use of the legacy and buggy url.parse() based approach in favor of using the new WHATWG URL parser for everything http related in core.

This commit changes node's handling of two-byte characters in the path component of an http URL. Previously, node would just strip the higher byte when generating the request. So this code: ``` http.request({host: "example.com", port: "80", "/\uFF2e"}) ``` would request `http://example.com/.` (`.` is the character for the byte `0x2e`). This is not useful and can in some cases lead to filter evasion. With this change, the code generates `ERR_UNESCAPED_CHARACTERS`, just like space and control characters already did.

Using the "optimized" version was not significantly faster and even slower for larger n.

bennofs · 2017-11-22T17:45:37Z

Is there anything I would need to do to move this PR forward?

jasnell · 2017-11-22T17:47:40Z

CI: https://ci.nodejs.org/job/node-test-pull-request/11638/

Just needs CI and signoff. Ping @nodejs/tsc ... PTAL

addaleax · 2017-11-22T18:16:37Z

I think this is ready? semver-major?

apapirovski · 2017-11-22T18:17:19Z

This has performance implications in a hot-path code and I did request changes, albeit not with the red checkmark. I would really like to see this benchmarked and feedback addressed.

apapirovski · 2017-11-22T18:19:04Z

Also, the upper threshold is currently somewhat arbitrary given the way we handle encoding of headers.

bennofs · 2017-11-22T21:45:39Z

@apapirovski here is the benchmark run (with the pure regex based implementation):

Rscript benchmark/compare.R  < old-new.csv
                                                improvement confidence      p.value
 http/create-clientrequest.js n=10000000 len=1      -3.36 %         ** 4.738245e-03
 http/create-clientrequest.js n=10000000 len=16      4.91 %        *** 7.801409e-05
 http/create-clientrequest.js n=10000000 len=8      -2.03 %            1.028356e-01

apapirovski · 2017-11-22T21:46:54Z

IMO we should just switch to the pure RegExp implementation given those numbers (and confidence levels). It removes some pretty ugly code and really simplifies things.

bennofs · 2017-11-22T22:19:39Z

@apapirovski yes I already pushed the pure regex implementation

jasnell · 2017-11-22T22:21:01Z

@mscdex ... PTAL

apapirovski · 2017-11-22T22:21:16Z

@bennofs Sorry, that's my bad. For some reason github kept showing me the outdated review comments as unaddressed instead of hiding as it does normally. 😞

jasnell · 2017-11-22T22:23:39Z

I'd like @mscdex to take one final look at this before it lands :-)

BridgeAR · 2017-11-23T11:17:07Z

CI https://ci.nodejs.org/job/node-test-pull-request/11665/

bnoordhuis · 2017-11-23T11:48:47Z

People who signed off on this: you do all realize this PR rejects everything that isn't Latin-1? Are you prepared to deal with the fallout?

Let's at least run citgm: ~~https://ci.nodejs.org/view/Node.js-citgm/job/citgm-smoker/1103/~~ https://ci.nodejs.org/view/Node.js-citgm/job/citgm-smoker/1104/

jasnell · 2017-11-23T15:22:37Z

Tbh, yes and no. Yes because strict checking has proven time and again to be the better route from a security POV. No because of the breakage that may occur. The code change looks good, but I'm still on the fence about changing this particular bit of code.

bennofs · 2017-11-23T15:33:11Z

The reason I suggested this change is that it never worked correctly as far as I can tell. And the incorrect behaviour is so broken IMO that it's not useful (except maybe if you want to transmit raw 2-bytes over URLs? seems likely...)

addaleax · 2017-12-01T20:18:56Z

I would like to land this in the near future if there are no objections.

apapirovski · 2017-12-02T17:02:32Z

ping @bnoordhuis — do you want to make your concerns explicit via the red X?

bnoordhuis · 2017-12-03T10:56:33Z

No, I'll just be an interested onlooker in this one.

TimothyGu · 2017-12-07T22:11:34Z

lib/_http_client.js

-    if (s.charCodeAt(i) <= 32) return true;
-  return false;
-}
+const INVALID_PATH_REGEX = /[\u0000-\u0020\u0100-\uffff]/;


I'd simplify this with /[^\x21-\xff]/ or /[^\u0021-\u00ff]/

BridgeAR · 2017-12-12T14:32:02Z

Landed in b961d9f.
I changed the RegExp to the suggested one by @TimothyGu.

This commit changes node's handling of two-byte characters in the path component of an http URL. Previously, node would just strip the higher byte when generating the request. So this code: ``` http.request({host: "example.com", port: "80", "/Ｎ"}) ``` would request `http://example.com/.` (`.` is the character for the byte `0x2e`). This is not useful and can in some cases lead to filter evasion. With this change, the code generates `ERR_UNESCAPED_CHARACTERS`, just like space and control characters already did. PR-URL: #16237 Reviewed-By: James M Snell <[email protected]> Reviewed-By: Anna Henningsen <[email protected]> Reviewed-By: Anatoli Papirovski <[email protected]> Reviewed-By: Ruben Bridgewater <[email protected]> Reviewed-By: Timothy Gu <[email protected]>

nodejs-github-bot added the http Issues or PRs related to the http subsystem. label Oct 16, 2017

lpinca reviewed Oct 16, 2017

View reviewed changes

joyeecheung reviewed Oct 16, 2017

View reviewed changes

jasnell requested review from mscdex and bnoordhuis October 16, 2017 17:03

mscdex reviewed Oct 16, 2017

View reviewed changes

addaleax reviewed Oct 19, 2017

View reviewed changes

bennofs added 4 commits November 22, 2017 18:40

fixup! http: disallow two-byte characters in URL path

eab523c

fixup! http: disallow two-byte characters in URL path

58ac26f

http: always use regex for invalid path check

60b3290

Using the "optimized" version was not significantly faster and even slower for larger n.

jasnell approved these changes Nov 22, 2017

View reviewed changes

addaleax approved these changes Nov 22, 2017

View reviewed changes

bennofs force-pushed the master branch from 31e71d1 to 60b3290 Compare November 22, 2017 21:43

apapirovski approved these changes Nov 22, 2017

View reviewed changes

BridgeAR approved these changes Nov 23, 2017

View reviewed changes

BridgeAR added the author ready PRs that have at least one approval, no pending requests for changes, and a CI started. label Nov 23, 2017

Mickael-van-der-Beek mentioned this pull request Nov 29, 2017

Node.js / HTTP-Parser not handling UTF-8 encoded HTTP header values #17390

Closed

apapirovski mentioned this pull request Dec 7, 2017

http: simplify isInvalidPath #17525

Closed

3 tasks

TimothyGu approved these changes Dec 7, 2017

View reviewed changes

MylesBorins force-pushed the master branch from b7405ab to 7f086dd Compare December 8, 2017 16:37

addaleax added the semver-major PRs that contain breaking changes and should be released in the next major version. label Dec 10, 2017

BridgeAR closed this Dec 12, 2017

addaleax removed the author ready PRs that have at least one approval, no pending requests for changes, and a CI started. label Dec 13, 2017

ChALkeR added the security Issues and PRs related to security. label Jan 28, 2018

		@@ -0,0 +1,33 @@
		// Copyright Joyent, Inc. and other Node contributors.

		@@ -50,20 +50,20 @@ const errors = require('internal/errors');
		// checks can greatly outperform the equivalent regexp (tested in V8 5.4).
		function isInvalidPath(s) {

http: disallow two-byte characters in URL path #16237

http: disallow two-byte characters in URL path #16237

Conversation

bennofs commented Oct 16, 2017

Checklist

Affected core subsystem(s)

Choose a reason for hiding this comment

Choose a reason for hiding this comment

apapirovski commented Oct 16, 2017 • edited Loading

bennofs commented Oct 16, 2017

bennofs commented Oct 16, 2017 • edited Loading

mscdex Oct 16, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

apapirovski commented Oct 16, 2017

mcollina commented Oct 17, 2017

apapirovski commented Oct 17, 2017 • edited Loading

addaleax left a comment

Choose a reason for hiding this comment

apapirovski commented Oct 19, 2017

addaleax commented Oct 19, 2017

jasnell commented Oct 20, 2017

bennofs commented Nov 22, 2017

jasnell commented Nov 22, 2017

addaleax commented Nov 22, 2017

apapirovski commented Nov 22, 2017

apapirovski commented Nov 22, 2017

bennofs commented Nov 22, 2017

apapirovski commented Nov 22, 2017 • edited Loading

bennofs commented Nov 22, 2017

jasnell commented Nov 22, 2017

apapirovski commented Nov 22, 2017

jasnell commented Nov 22, 2017

BridgeAR commented Nov 23, 2017

bnoordhuis commented Nov 23, 2017 • edited Loading

jasnell commented Nov 23, 2017

bennofs commented Nov 23, 2017

addaleax commented Dec 1, 2017

apapirovski commented Dec 2, 2017

bnoordhuis commented Dec 3, 2017

Choose a reason for hiding this comment

BridgeAR commented Dec 12, 2017

apapirovski commented Oct 16, 2017 •

edited

Loading

bennofs commented Oct 16, 2017 •

edited

Loading

mscdex Oct 16, 2017 •

edited

Loading

apapirovski commented Oct 17, 2017 •

edited

Loading

apapirovski commented Nov 22, 2017 •

edited

Loading

bnoordhuis commented Nov 23, 2017 •

edited

Loading