Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

http: disallow two-byte characters in URL path #16237

Closed
wants to merge 4 commits into from

Conversation

bennofs
Copy link
Contributor

@bennofs bennofs commented Oct 16, 2017

This commit changes node's handling of two-byte characters in the path component
of an http URL. Previously, node would just strip the higher byte when
generating the request. So this code:

http.request({host: "example.com", port: "80", "/\uFF2e"})

would request http://example.com/. (. is the character for the byte 0x2e).

This is not useful and can in some cases allow filter evasion. With this
change, the code generates ERR_UNESCAPED_CHARACTERS, just like space and
control characters already did.

Checklist
  • make -j4 test (UNIX) passes
  • tests and/or benchmarks are included
  • commit message follows commit guidelines
Affected core subsystem(s)

http

@nodejs-github-bot nodejs-github-bot added the http Issues or PRs related to the http subsystem. label Oct 16, 2017
@@ -0,0 +1,33 @@
// Copyright Joyent, Inc. and other Node contributors.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: copyright is not needed for new tests.

@@ -50,20 +50,20 @@ const errors = require('internal/errors');
// checks can greatly outperform the equivalent regexp (tested in V8 5.4).
function isInvalidPath(s) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment above this function needs an update as well

@apapirovski
Copy link
Member

apapirovski commented Oct 16, 2017

Thanks for the PR @bennofs! I'll abstain from commenting on the correctness of this but there are some performance considerations here.

  1. With this change, isInvalidPath is strictly slower (even for 1 or 2 chars) in v8 versions 6.1 & 6.2 than the RegExp version.

  2. Even if it doesn't or the isInvalidPath perf can be improved, the threshold for which the RegExp is faster has moved down to about 10 characters from the previous 39.

@bennofs
Copy link
Contributor Author

bennofs commented Oct 16, 2017

@apapirovski hmm ok. what did you use to test the performance of isInvalidPath? So I can play around with it and see if there's perhaps some way to optimize it (although I am not very familar with V8 performance).

It is probably impossible not to pay at least a small performance penalty, because we need to check more things than we currently do.

@bennofs
Copy link
Contributor Author

bennofs commented Oct 16, 2017

For the motivation why this is important, see https://www.blackhat.com/docs/us-17/thursday/us-17-Tsai-A-New-Era-Of-SSRF-Exploiting-URL-Parser-In-Trending-Programming-Languages.pdf (NodeJS unicode failure).

tl;dr:

var base = "http://orange.tw/sandbox/";
var path = req.query.path;
if(path.indexOf("..") ==-1) {
  http.get(base + path, callback);
}

Such a filter can easily be bypassed with \uFF2e\uFF2e to get SSRF if this is not fixed.

//
// This function is used in the case of small paths, where manual character code
// checks can greatly outperform the equivalent regexp (tested in V8 5.4).
function isInvalidPath(s) {
var i = 0;
if (s.charCodeAt(0) <= 32) return true;
if (s.charCodeAt(0) <= 32 || s.charCodeAt(0) > 0xFF) return true;
Copy link
Contributor

@mscdex mscdex Oct 16, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be > 127 instead of > 255 ? UTF-8 starts using multiple bytes after the last ASCII character (127). Or am I misunderstanding the PR description?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

V8 strings are not UTF-8, but UTF-16 (UCS2). If it was UTF-8, this would not be necessary, as each byte in an UTF-8 multibyte sequence is > 127 so it would be impossible to get ascii characters that way, even if some bytes were stripped.

What happens is that the string ' \uff2e' is represented in node as (uint16_t*){0xff2e} (where {} is used here to denote an array). With the latin1 encoding (that's what gets used for headers), node internally calls WriteOneByte on that string to write it to the socket and WriteOneByte will strip the higher byte of the two-byte character, so we end up with 0x2e. It works fine for all characters that can be represented with a single byte (and in UCS2, that means all characters below 0x100) (whether any non-ascii characters make sense in this context is questionable, but it also doesn't break anything so I opted not to filter that).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the latin1 encoding (that's what gets used for headers)

Unfortunately, it's not that simple. See #13296 (comment) for an explanation. The two word summary is "it depends" (on the encoding of the body, but that doesn't fit in two words.)

Aside: calling .charCodeAt() twice for the same index is somewhat wasteful, it would be better to cache the result in a variable. (V8's TurboFan compiler can inline the calls but I don't believe it will fold them.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, thanks for pointing me to that issue. I thought a bit about something like that while writing the patch, but I couldn't find out where the encoding is actually set.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least as long as we aren't using chunked encoding (no idea what that is :))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay, missed your replies. I don't know if your exact proposal would work but take a look at the _storeHeader() and _send() methods in lib/_http_outgoing.js, that's where the headers are put on the wire and their encoding is determined.

You may also want to look at the _storeHeader() calls in lib/_http_client.js, they compose the status line.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This raises the question if we actually want to support non-latin1 encoded request path? Maybe utf-8 makes sense in some contexts? I am not sure.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I spent some time browsing through open issues yesterday and there's already a bug logged against this behaviour. (Would need to go back to find it again.) IMO what mscdex said above makes sense, just cap it at 127 instead. The current behaviour is completely unpredictable and broken.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's the issue I was mentioning #13296 — there's some good discussion there too.

I would prefer what I said above, at least for 9.x (maybe different solution for LTS), but I'll defer to @bnoordhuis and others with more experience with the http module.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This raises the question if we actually want to support non-latin1 encoded request path? Maybe utf-8 makes sense in some contexts?

It's in fairly widespread use. Rejecting it outright would almost certainly result in quite a bit of fallout and lots of frustrated users. Rejecting malformed UTF-8 should be acceptable and appropriate, though.

(But as discussed, what to accept and what to reject depends on the encoding.)

@apapirovski
Copy link
Member

@bennofs Totally understand re: the need for it, just don't want to comment since it's not my area of expertise.

Re: performance, I think the simplest solution here is to just completely remove isInvalidPath if this change is going to stick and then just inline the regExp.test() statement into the if condition and you could also pull out the construction of the actual RegExp outside of the ClientRequest so it doesn't run each time.

In terms of running benchmarks, here's the guide for it: https://github.com/nodejs/node/blob/master/doc/guides/writing-and-running-benchmarks.md — the relevant test here would be http/create-clientrequest.js but if you want to run it locally, I would tweak the parameters of the benchmark to something like:

len: [1, 8, 16],
n: [1e7]

(You don't need the large len tests and n = 1e6 is quite noisy.)

Let me know if any of that didn't make sense.

@mcollina
Copy link
Member

Can you please check if this degrades our "simple" http benchmark?

@apapirovski
Copy link
Member

apapirovski commented Oct 17, 2017

@mcollina This just affects ClientRequest so I don't think it would, would it? The best benchmark for this should be create-clientrequest which does get degraded as per above.

Copy link
Member

@addaleax addaleax left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR seems fine to me, but: semver-major?

@apapirovski
Copy link
Member

@addaleax (or anyone else that knows) what's the policy around semver-major things that have a security impact (which this does)?

@addaleax
Copy link
Member

@apapirovski What we can do is landing a semver-major patch in an existing release line if we think it’s justified for security reasons, and provide users a way to opt out of the patch via a runtime flag.

I think this can be seen as a semver-patch, I just want to make sure we’re explicit about that :)

(Also, I’d take it as a good sign that #8923 was landed in LTS and nobody yelled at us about it.)

@jasnell
Copy link
Member

jasnell commented Oct 20, 2017

It makes me quite sad to see yet another example of where our loose conformance to the spec is causing yet another issue. Running the URL through the WHATWG URL parser produces a reasonable result... e.g.

> new url.URL('http://example.org/\uFF2e')
URL {
  href: 'http://example.org/%EF%BC%AE',
  origin: 'http://example.org',
  protocol: 'http:',
  username: '',
  password: '',
  host: 'example.org',
  hostname: 'example.org',
  port: '',
  pathname: '/%EF%BC%AE',
  search: '',
  searchParams: URLSearchParams {},
  hash: '' }

I'm +1 on making this change as semver-patch in all release lines, but I would argue even stronger for deprecating the use of the legacy and buggy url.parse() based approach in favor of using the new WHATWG URL parser for everything http related in core.

This commit changes node's handling of two-byte characters in the path component
of an http URL. Previously, node would just strip the higher byte when
generating the request. So this code:

```
http.request({host: "example.com", port: "80", "/\uFF2e"})
```

would request `http://example.com/.` (`.` is the character for the byte `0x2e`).

This is not useful and can in some cases lead to filter evasion. With this
change, the code generates `ERR_UNESCAPED_CHARACTERS`, just like space and
control characters already did.
Using the "optimized" version was not significantly faster and even
slower for larger n.
@bennofs
Copy link
Contributor Author

bennofs commented Nov 22, 2017

Is there anything I would need to do to move this PR forward?

@jasnell
Copy link
Member

jasnell commented Nov 22, 2017

CI: https://ci.nodejs.org/job/node-test-pull-request/11638/

Just needs CI and signoff. Ping @nodejs/tsc ... PTAL

@addaleax
Copy link
Member

I think this is ready? semver-major?

@apapirovski
Copy link
Member

This has performance implications in a hot-path code and I did request changes, albeit not with the red checkmark. I would really like to see this benchmarked and feedback addressed.

@apapirovski
Copy link
Member

Also, the upper threshold is currently somewhat arbitrary given the way we handle encoding of headers.

@bennofs
Copy link
Contributor Author

bennofs commented Nov 22, 2017

@apapirovski here is the benchmark run (with the pure regex based implementation):

Rscript benchmark/compare.R  < old-new.csv
                                                improvement confidence      p.value
 http/create-clientrequest.js n=10000000 len=1      -3.36 %         ** 4.738245e-03
 http/create-clientrequest.js n=10000000 len=16      4.91 %        *** 7.801409e-05
 http/create-clientrequest.js n=10000000 len=8      -2.03 %            1.028356e-01

@apapirovski
Copy link
Member

apapirovski commented Nov 22, 2017

IMO we should just switch to the pure RegExp implementation given those numbers (and confidence levels). It removes some pretty ugly code and really simplifies things.

@bennofs
Copy link
Contributor Author

bennofs commented Nov 22, 2017

@apapirovski yes I already pushed the pure regex implementation

@jasnell
Copy link
Member

jasnell commented Nov 22, 2017

@mscdex ... PTAL

@apapirovski
Copy link
Member

@bennofs Sorry, that's my bad. For some reason github kept showing me the outdated review comments as unaddressed instead of hiding as it does normally. 😞

@jasnell
Copy link
Member

jasnell commented Nov 22, 2017

I'd like @mscdex to take one final look at this before it lands :-)

@BridgeAR
Copy link
Member

@BridgeAR BridgeAR added the author ready PRs that have at least one approval, no pending requests for changes, and a CI started. label Nov 23, 2017
@bnoordhuis
Copy link
Member

bnoordhuis commented Nov 23, 2017

People who signed off on this: you do all realize this PR rejects everything that isn't Latin-1? Are you prepared to deal with the fallout?

Let's at least run citgm: https://ci.nodejs.org/view/Node.js-citgm/job/citgm-smoker/1103/ https://ci.nodejs.org/view/Node.js-citgm/job/citgm-smoker/1104/

@jasnell
Copy link
Member

jasnell commented Nov 23, 2017

Tbh, yes and no. Yes because strict checking has proven time and again to be the better route from a security POV. No because of the breakage that may occur. The code change looks good, but I'm still on the fence about changing this particular bit of code.

@bennofs
Copy link
Contributor Author

bennofs commented Nov 23, 2017

The reason I suggested this change is that it never worked correctly as far as I can tell. And the incorrect behaviour is so broken IMO that it's not useful (except maybe if you want to transmit raw 2-bytes over URLs? seems likely...)

@addaleax
Copy link
Member

addaleax commented Dec 1, 2017

I would like to land this in the near future if there are no objections.

@apapirovski
Copy link
Member

ping @bnoordhuis — do you want to make your concerns explicit via the red X?

@bnoordhuis
Copy link
Member

No, I'll just be an interested onlooker in this one.

@apapirovski apapirovski mentioned this pull request Dec 7, 2017
3 tasks
if (s.charCodeAt(i) <= 32) return true;
return false;
}
const INVALID_PATH_REGEX = /[\u0000-\u0020\u0100-\uffff]/;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd simplify this with /[^\x21-\xff]/ or /[^\u0021-\u00ff]/

@addaleax addaleax added the semver-major PRs that contain breaking changes and should be released in the next major version. label Dec 10, 2017
@BridgeAR
Copy link
Member

Landed in b961d9f.
I changed the RegExp to the suggested one by @TimothyGu.

@BridgeAR BridgeAR closed this Dec 12, 2017
BridgeAR pushed a commit that referenced this pull request Dec 12, 2017
This commit changes node's handling of two-byte characters in
the path component of an http URL. Previously, node would just
strip the higher byte when generating the request. So this code:

```
http.request({host: "example.com", port: "80", "/N"})
```

would request `http://example.com/.`
(`.` is the character for the byte `0x2e`).

This is not useful and can in some cases lead to filter evasion.
With this change, the code generates `ERR_UNESCAPED_CHARACTERS`,
just like space and control characters already did.

PR-URL: #16237
Reviewed-By: James M Snell <[email protected]>
Reviewed-By: Anna Henningsen <[email protected]>
Reviewed-By: Anatoli Papirovski <[email protected]>
Reviewed-By: Ruben Bridgewater <[email protected]>
Reviewed-By: Timothy Gu <[email protected]>
@addaleax addaleax removed the author ready PRs that have at least one approval, no pending requests for changes, and a CI started. label Dec 13, 2017
@ChALkeR ChALkeR added the security Issues and PRs related to security. label Jan 28, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
http Issues or PRs related to the http subsystem. security Issues and PRs related to security. semver-major PRs that contain breaking changes and should be released in the next major version.
Projects
None yet
Development

Successfully merging this pull request may close these issues.