-
Notifications
You must be signed in to change notification settings - Fork 29.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
http: disallow two-byte characters in URL path #16237
Conversation
@@ -0,0 +1,33 @@ | |||
// Copyright Joyent, Inc. and other Node contributors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: copyright is not needed for new tests.
lib/_http_client.js
Outdated
@@ -50,20 +50,20 @@ const errors = require('internal/errors'); | |||
// checks can greatly outperform the equivalent regexp (tested in V8 5.4). | |||
function isInvalidPath(s) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment above this function needs an update as well
Thanks for the PR @bennofs! I'll abstain from commenting on the correctness of this but there are some performance considerations here.
|
@apapirovski hmm ok. what did you use to test the performance of It is probably impossible not to pay at least a small performance penalty, because we need to check more things than we currently do. |
For the motivation why this is important, see https://www.blackhat.com/docs/us-17/thursday/us-17-Tsai-A-New-Era-Of-SSRF-Exploiting-URL-Parser-In-Trending-Programming-Languages.pdf (NodeJS unicode failure). tl;dr: var base = "http://orange.tw/sandbox/";
var path = req.query.path;
if(path.indexOf("..") ==-1) {
http.get(base + path, callback);
} Such a filter can easily be bypassed with |
lib/_http_client.js
Outdated
// | ||
// This function is used in the case of small paths, where manual character code | ||
// checks can greatly outperform the equivalent regexp (tested in V8 5.4). | ||
function isInvalidPath(s) { | ||
var i = 0; | ||
if (s.charCodeAt(0) <= 32) return true; | ||
if (s.charCodeAt(0) <= 32 || s.charCodeAt(0) > 0xFF) return true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be > 127
instead of > 255
? UTF-8 starts using multiple bytes after the last ASCII character (127). Or am I misunderstanding the PR description?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
V8 strings are not UTF-8, but UTF-16 (UCS2). If it was UTF-8, this would not be necessary, as each byte in an UTF-8 multibyte sequence is > 127
so it would be impossible to get ascii characters that way, even if some bytes were stripped.
What happens is that the string ' \uff2e'
is represented in node as (uint16_t*){0xff2e}
(where {}
is used here to denote an array). With the latin1 encoding (that's what gets used for headers), node internally calls WriteOneByte
on that string to write it to the socket and WriteOneByte
will strip the higher byte of the two-byte character, so we end up with 0x2e
. It works fine for all characters that can be represented with a single byte (and in UCS2, that means all characters below 0x100
) (whether any non-ascii characters make sense in this context is questionable, but it also doesn't break anything so I opted not to filter that).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the latin1 encoding (that's what gets used for headers)
Unfortunately, it's not that simple. See #13296 (comment) for an explanation. The two word summary is "it depends" (on the encoding of the body, but that doesn't fit in two words.)
Aside: calling .charCodeAt()
twice for the same index is somewhat wasteful, it would be better to cache the result in a variable. (V8's TurboFan compiler can inline the calls but I don't believe it will fold them.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, thanks for pointing me to that issue. I thought a bit about something like that while writing the patch, but I couldn't find out where the encoding is actually set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At least as long as we aren't using chunked encoding (no idea what that is :))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay, missed your replies. I don't know if your exact proposal would work but take a look at the _storeHeader()
and _send()
methods in lib/_http_outgoing.js
, that's where the headers are put on the wire and their encoding is determined.
You may also want to look at the _storeHeader()
calls in lib/_http_client.js
, they compose the status line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This raises the question if we actually want to support non-latin1 encoded request path? Maybe utf-8
makes sense in some contexts? I am not sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I spent some time browsing through open issues yesterday and there's already a bug logged against this behaviour. (Would need to go back to find it again.) IMO what mscdex said above makes sense, just cap it at 127 instead. The current behaviour is completely unpredictable and broken.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's the issue I was mentioning #13296 — there's some good discussion there too.
I would prefer what I said above, at least for 9.x (maybe different solution for LTS), but I'll defer to @bnoordhuis and others with more experience with the http
module.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This raises the question if we actually want to support non-latin1 encoded request path? Maybe utf-8 makes sense in some contexts?
It's in fairly widespread use. Rejecting it outright would almost certainly result in quite a bit of fallout and lots of frustrated users. Rejecting malformed UTF-8 should be acceptable and appropriate, though.
(But as discussed, what to accept and what to reject depends on the encoding.)
@bennofs Totally understand re: the need for it, just don't want to comment since it's not my area of expertise. Re: performance, I think the simplest solution here is to just completely remove In terms of running benchmarks, here's the guide for it: https://github.com/nodejs/node/blob/master/doc/guides/writing-and-running-benchmarks.md — the relevant test here would be len: [1, 8, 16],
n: [1e7] (You don't need the large Let me know if any of that didn't make sense. |
Can you please check if this degrades our "simple" http benchmark? |
@mcollina This just affects |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PR seems fine to me, but: semver-major?
@addaleax (or anyone else that knows) what's the policy around semver-major things that have a security impact (which this does)? |
@apapirovski What we can do is landing a semver-major patch in an existing release line if we think it’s justified for security reasons, and provide users a way to opt out of the patch via a runtime flag. I think this can be seen as a semver-patch, I just want to make sure we’re explicit about that :) (Also, I’d take it as a good sign that #8923 was landed in LTS and nobody yelled at us about it.) |
It makes me quite sad to see yet another example of where our loose conformance to the spec is causing yet another issue. Running the URL through the WHATWG URL parser produces a reasonable result... e.g. > new url.URL('http://example.org/\uFF2e')
URL {
href: 'http://example.org/%EF%BC%AE',
origin: 'http://example.org',
protocol: 'http:',
username: '',
password: '',
host: 'example.org',
hostname: 'example.org',
port: '',
pathname: '/%EF%BC%AE',
search: '',
searchParams: URLSearchParams {},
hash: '' } I'm +1 on making this change as semver-patch in all release lines, but I would argue even stronger for deprecating the use of the legacy and buggy |
This commit changes node's handling of two-byte characters in the path component of an http URL. Previously, node would just strip the higher byte when generating the request. So this code: ``` http.request({host: "example.com", port: "80", "/\uFF2e"}) ``` would request `http://example.com/.` (`.` is the character for the byte `0x2e`). This is not useful and can in some cases lead to filter evasion. With this change, the code generates `ERR_UNESCAPED_CHARACTERS`, just like space and control characters already did.
Using the "optimized" version was not significantly faster and even slower for larger n.
Is there anything I would need to do to move this PR forward? |
CI: https://ci.nodejs.org/job/node-test-pull-request/11638/ Just needs CI and signoff. Ping @nodejs/tsc ... PTAL |
I think this is ready? |
This has performance implications in a hot-path code and I did request changes, albeit not with the red checkmark. I would really like to see this benchmarked and feedback addressed. |
Also, the upper threshold is currently somewhat arbitrary given the way we handle encoding of headers. |
@apapirovski here is the benchmark run (with the pure regex based implementation):
|
IMO we should just switch to the pure RegExp implementation given those numbers (and confidence levels). It removes some pretty ugly code and really simplifies things. |
@apapirovski yes I already pushed the pure regex implementation |
@mscdex ... PTAL |
@bennofs Sorry, that's my bad. For some reason github kept showing me the outdated review comments as unaddressed instead of hiding as it does normally. 😞 |
I'd like @mscdex to take one final look at this before it lands :-) |
People who signed off on this: you do all realize this PR rejects everything that isn't Latin-1? Are you prepared to deal with the fallout? Let's at least run citgm: |
Tbh, yes and no. Yes because strict checking has proven time and again to be the better route from a security POV. No because of the breakage that may occur. The code change looks good, but I'm still on the fence about changing this particular bit of code. |
The reason I suggested this change is that it never worked correctly as far as I can tell. And the incorrect behaviour is so broken IMO that it's not useful (except maybe if you want to transmit raw 2-bytes over URLs? seems likely...) |
I would like to land this in the near future if there are no objections. |
ping @bnoordhuis — do you want to make your concerns explicit via the red X? |
No, I'll just be an interested onlooker in this one. |
if (s.charCodeAt(i) <= 32) return true; | ||
return false; | ||
} | ||
const INVALID_PATH_REGEX = /[\u0000-\u0020\u0100-\uffff]/; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd simplify this with /[^\x21-\xff]/
or /[^\u0021-\u00ff]/
Landed in b961d9f. |
This commit changes node's handling of two-byte characters in the path component of an http URL. Previously, node would just strip the higher byte when generating the request. So this code: ``` http.request({host: "example.com", port: "80", "/N"}) ``` would request `http://example.com/.` (`.` is the character for the byte `0x2e`). This is not useful and can in some cases lead to filter evasion. With this change, the code generates `ERR_UNESCAPED_CHARACTERS`, just like space and control characters already did. PR-URL: #16237 Reviewed-By: James M Snell <[email protected]> Reviewed-By: Anna Henningsen <[email protected]> Reviewed-By: Anatoli Papirovski <[email protected]> Reviewed-By: Ruben Bridgewater <[email protected]> Reviewed-By: Timothy Gu <[email protected]>
This commit changes node's handling of two-byte characters in the path component
of an http URL. Previously, node would just strip the higher byte when
generating the request. So this code:
would request
http://example.com/.
(.
is the character for the byte0x2e
).This is not useful and can in some cases allow filter evasion. With this
change, the code generates
ERR_UNESCAPED_CHARACTERS
, just like space andcontrol characters already did.
Checklist
make -j4 test
(UNIX) passesAffected core subsystem(s)
http