Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

URI.encode gives a false sense of security; proper escaping is missing #10603

Closed
oprypin opened this issue Apr 5, 2021 · 14 comments · Fixed by #11248
Closed

URI.encode gives a false sense of security; proper escaping is missing #10603

oprypin opened this issue Apr 5, 2021 · 14 comments · Fixed by #11248
Assignees
Labels
kind:bug A bug in the code. Does not apply to documentation, specs, etc. security topic:stdlib:networking

Comments

@oprypin
Copy link
Member

oprypin commented Apr 5, 2021

Expanding from #7997 (comment)

Consider this usecase:

Someone is making a variable part in their URL/path,
such as /items/#{name},
and, naturally, they form the URL by writing
"/items/#{URI.encode(name)}"

Then there appears an item named "item #9".
What they expect:
"http://example.org/items/item%20%239"
What they get:
"http://example.org/items/item%20#9"

So basically the escaping doesn't save you from anything, you still get a path with arbitrary "control" characters such as # and ?. It's kinda worse than writing nothing, because writing nothing would trigger a comment in code review "why didn't you escape it?".
In the big table below you can see the data behind my claim -- just how few kinds of characters actually get escaped by URI.encode.


Crystal used to have URI.escape which did exactly the right thing, and that behavior is also what most programming languages offer. E.g. Python:

urllib.parse.quote("item #9") == "item%20%239"

But #7997 removed it.

The straightforward options available now both produce the wrong result for this context:

URI.encode("item #9") == "item%20#9"
URI.encode_www_form("item #9") == "item+%239"

And these two lengthy expressions are the options that you have now to do the correct thing in the typical usecase that I've shown:

URI.encode_www_form("item #9", space_to_plus: false) == "item%20%239"
String.build { |io| URI.encode("item #9", io, &->URI.unreserved?(UInt8)) } == "item%20%239"

And honestly, I prefer the latter, because I'm not encoding a WWW form.

Speaking of which, URI.encode_www_form is very good for WWW forms (and ?query parameters).

But as for URI.encode, I claim that:

  1. there either isn't a usecase for it in its current shape, or the usecase is super specific and rare;
  2. it's dangerous to have it featured the most prominently, because people use it without realizing that it gives them no safety.

#3515 (comment) shows "passing the whole URL to it" as a usecase, and lists some programming languages that have this function. But I think it misses the fact that many programming languages actually don't have this function in any shape. And that Ruby actually obsoleted it as dangerous
(the comment pointing that out somehow wasn't fully considered)

To expand on (1):

I think passing the whole URL to any escaping function is usually wrong, as different parts of it are subject to different escaping requirements. E.g. the separators / control chars mustn't be escaped, the path needs to be fully escaped, the query params need to be form-escaped. Avoiding to escape some classes of characters is only a workaround to make this use case kinda work.

Not that it's an invalid use case. Of course, for Web browsers escaping the whole URL is an everyday task. But this probably needs to be done by parsing it and escaping different parts differently, as the same comment suggests. And that's not what happens in URI.encode.
I also think that maybe Web browsers already deal with kinda pre-escaped URLs, so for them maybe there isn't a dangerous operation such as (what people may assume to be safe but will definitely never work) URI.escape("http://example.org/items/#{name}").

Regardless whether this usecase is satisfied (maybe it still is, I'm not sure), the main issue (2) still stands: this function is not appropriate for almost all other use cases, but people assume that it is.


Finally, we get to the table, comparing these functions in 4 programming languages:

code ╲ input ! " # $ % & ' ( ) * + , - . / : ; < = > ? @ [ \ ] ^ _ ` { | } ~ q й
cr URI.encode %20 ! %22 # $ %25 & ' ( ) * + , - . / : ; %3C = %3E ? @ [ %5C ] %5E _ %60 %7B %7C %7D ~ q %D0%B9
cr URI.escape (gone) %20 %21 %22 %23 %24 %25 %26 %27 %28 %29 %2A %2B %2C - . %2F %3A %3B %3C %3D %3E %3F %40 %5B %5C %5D %5E _ %60 %7B %7C %7D ~ q %D0%B9
cr URI.encode_www_form + %21 %22 %23 %24 %25 %26 %27 %28 %29 %2A %2B %2C - . %2F %3A %3B %3C %3D %3E %3F %40 %5B %5C %5D %5E _ %60 %7B %7C %7D ~ q %D0%B9
py urllib.parse.quote %20 %21 %22 %23 %24 %25 %26 %27 %28 %29 %2A %2B %2C - . / %3A %3B %3C %3D %3E %3F %40 %5B %5C %5D %5E _ %60 %7B %7C %7D ~ q %D0%B9
py urllib.parse.quote('') %20 %21 %22 %23 %24 %25 %26 %27 %28 %29 %2A %2B %2C - . %2F %3A %3B %3C %3D %3E %3F %40 %5B %5C %5D %5E _ %60 %7B %7C %7D ~ q %D0%B9
py urllib.parse.quote_plus + %21 %22 %23 %24 %25 %26 %27 %28 %29 %2A %2B %2C - . %2F %3A %3B %3C %3D %3E %3F %40 %5B %5C %5D %5E _ %60 %7B %7C %7D ~ q %D0%B9
rb URI.encode (gone) %20 ! %22 %23 $ %25 & ' ( ) * + , - . / : ; %3C = %3E ? @ [ %5C ] %5E _ %60 %7B %7C %7D ~ q %D0%B9
rb URI.encode_www_form + %21 %22 %23 %24 %25 %26 %27 %28 %29 * %2B %2C - . %2F %3A %3B %3C %3D %3E %3F %40 %5B %5C %5D %5E _ %60 %7B %7C %7D %7E q %D0%B9
rb CGI.escape + %21 %22 %23 %24 %25 %26 %27 %28 %29 %2A %2B %2C - . %2F %3A %3B %3C %3D %3E %3F %40 %5B %5C %5D %5E _ %60 %7B %7C %7D ~ q %D0%B9
js encodeURI %20 ! %22 # $ %25 & ' ( ) * + , - . / : ; %3C = %3E ? @ %5B %5C %5D %5E _ %60 %7B %7C %7D ~ q %D0%B9
js encodeURIComponent %20 ! %22 %23 %24 %25 %26 ' ( ) * %2B %2C - . %2F %3A %3B %3C %3D %3E %3F %40 %5B %5C %5D %5E _ %60 %7B %7C %7D ~ q %D0%B9
go url.PathEscape %20 %21 %22 %23 $ %25 & %27 %28 %29 %2A + %2C - . %2F : %3B %3C = %3E %3F @ %5B %5C %5D %5E _ %60 %7B %7C %7D ~ q %D0%B9`
go url.QueryExcape + %21 %22 %23 %24 %25 %26 %27 %28 %29 %2A %2B %2C - . %2F %3A %3B %3C %3D %3E %3F %40 %5B %5C %5D %5E _ %60 %7B %7C %7D ~ q %D0%B9
The program that made this
def row(*items)
  puts "| " + items.to_a.flatten.map(&.gsub("|", "\\|").as(String)).join(" | ") + " |"
end

def process_lines(command, *args, input)
  inp = IO::Memory.new(input.join("\n"))
  outp = IO::Memory.new
  Process.run(command, args, input: inp, output: outp)
  outp.to_s.lines
end

chars = (32.chr...127.chr).reject { |c| 'a' <= c <= 'z' || 'A' <= c <= 'Z' || '0' <= c <= '9' } + ["q", "й"]

row "code ╲ input", chars.map { |c| "#{c}" }
row "---", chars.map { ":-:" }

require "uri"
row
{
  "`URI.encode`"            => ->(c : String) { URI.encode(c) },
  "`URI.escape` **(gone)**" => ->(c : String) { URI.encode_www_form(c, space_to_plus: false) },
  "`URI.encode_www_form`"   => ->(c : String) { URI.encode_www_form(c) },
}.each do |(name, expr)|
  row "cr #{name}", chars.map { |c| "#{expr.call(c.to_s)}" }
end

row
{
  "`urllib.parse.quote`"      => "urllib.parse.quote(c)",
  "`urllib.parse.quote('')`"  => "urllib.parse.quote(c, safe='')",
  "`urllib.parse.quote_plus`" => "urllib.parse.quote_plus(c)",
}.each do |(name, expr)|
  script = %(import sys, urllib.parse; print('\\n'.join(#{expr} for c in sys.stdin.read().splitlines())))
  lines = process_lines("python3", "-c", script, input: chars)
  row "py #{name}", lines.map { |c| "#{c}" }
end

row
{
  "`URI.encode` **(gone)**" => "URI.encode(c)",
  "`URI.encode_www_form`"   => "URI.encode_www_form_component(c)",
  "`CGI.escape`"            => "CGI.escape(c)",
}.each do |(name, expr)|
  script = %(require 'uri'; require 'cgi'; ARGF.read.split("\\n").each { |c| puts #{expr} })
  lines = process_lines("ruby", "-e", script, input: chars)
  row "rb #{name}", lines.map { |c| "#{c}" }
end

row
{
  "`encodeURI`"          => "encodeURI(c)",
  "`encodeURIComponent`" => "encodeURIComponent(c)",
}.each do |(name, expr)|
  script = %(require('fs').readFileSync(0, 'utf-8').split('\\n').forEach(c => console.log(#{expr})))
  lines = process_lines("node", "-e", script, input: chars)
  row "js #{name}", lines.map { |c| "#{c}" }
end

Elixir (not in this table) is in the same bad situation as Crystal (actually even seems less customizable), and people are arriving to the same workaround. Unfortunately, Elixir's behavior was used as an argument to apply this behavior.

Anyway, as you can see from the table, Python and Ruby don't have a function that doesn't escape # ? & (well, in Ruby it's only obsoleted w/ a warning).


Now let's expand on my claim (2) - with particular examples that it is causing danger in the wild:

For all of these cases, the payload/value can contain &foo=bar& at any point to specify any other arbitrary parameter.

But these simply should've been URI.encode_www_form.
This one, however, is the usecase that I'm talking about:

Here's a full list (GitHub search, annotated by me) of what I looked through. I wasn't very attentive for the whole list though, maybe some comments are wrong.


More interesting fallout that should've been predicted in #7997: many of these vulnerabilities (this one and this one at least) were introduced by simply replacing the correct URI.escape with the wrong URI.encode, as the deprecation message told.


I think the only path forward is to deprecate URI.encode, so people can default to safe alternatives. Even if its behavior is wanted for the rare occasion, it should be reintroduced under some long name that doesn't let you run into it accidentally.

@oprypin oprypin added the kind:bug A bug in the code. Does not apply to documentation, specs, etc. label Apr 5, 2021
@straight-shoota
Copy link
Member

The use case for URI.encode is pretty generic actually: It transforms URIs to a safe encoded representation, where all non-ASCII or otherwise invalid characters are represented as escape sequences. This is useful for example when a URI contains Unicode characters, which are used for user presentation, but network protocols use the encoded form.

Agreed, this might not be the most common use case, even in web typical applications. And users seem to assume the method encodes restricted characters as well, which is obviously bad.

So yes, claim 2) unfortunately looks very reasonable.

I don't understand the example in OpenAPITools/openapi-generator. What is it about? Why does it replace %2F in the encoded string with /?

I think the only path forward is to deprecate URI.encode, so people can default to safe alternatives. Even if its behavior is wanted for the rare occasion, it should be reintroduced under some long name that doesn't let you run into it accidentally.

I suppose there's no other way to help users avoid shooting themselves in the foot, then.

@oprypin
Copy link
Member Author

oprypin commented Apr 5, 2021

I don't understand the example in OpenAPITools/openapi-generator. What is it about? Why does it replace %2F in the encoded string with /?

I guess it wanted to escape everything except / back when it was using URI.escape, but then it was blindly changed to URI.encode which doesn't escape that in the first place.

And the purpose of not escaping / might be to allow inserting arbitrary paths as part of that path, not just 1 path component. Which seems wrong in that context anyway -- why should a username be a path?

Side note: Python escapes everything except / by default, but that's easy to change (refer back to the table).

@straight-shoota
Copy link
Member

So, to move this forward, I think we need to do the following:

  1. Add a new set of methods that encode/decode all but unreserved characters. Maybe with an optional parameter to configure encoding of /.
  2. Move implementation of non-yielding URI.encode/URI.decode methods to a more obstructive name.
  3. Deprecate non-yielding URI.encode/URI.decode methods.

The implementation is trivial, we just need names.

I think URI.escape could be a viable option for 1. For that basic use case that's a fitting name. An alternative would be URI.quote akin to the Python method.

As an obstructive name, maybe URI.encode_fully/URI.decode_fully could do?

@straight-shoota
Copy link
Member

straight-shoota commented Aug 30, 2021

In Go, there are two methods called PathEscape and QueryEscape. I like that these methods very explicitly convey semantics by describing the use case.

PathEscape escapes a string so that it can safely be placed in a path segment. It makes sure to escape all characters that could have a special meaning in a URI path segment according to RFC 3986 §3.3 (i.e. reservered characters /;,?). All other reserverd and unreservered characters are passed through without escaping.

QueryEscape works just like our URI.encode_www_form, escaping everything but unreserved characters.

The internal implementation also provides escaping modes for hosts/zones, paths, user/password components, and fragments (https://cs.opensource.google/go/go/+/refs/tags/go1.17:src/net/url/url.go;l=100). It seems these modes are only used by URL functions directly and not exposed publicly.

I could see our API going in a similar direction, adding URI.escape_path and maybe renaming URI.encode_www_form to URI.escape_query.

A difficulty with URI.escape_path is that it could apply to both a full path or just a path segment (the latter escapes / for example). I think there are use cases for both. A value extrapolated into a path could be enforced to add only a single path component, but could also be allowed to add multiple ones (think about the reverse of a splat URL parameter).
We could have a parameter to toggle segment mode (with the default being more restrictive) or have two separate methods URI.escape_path_segment and URI.escape_path_full (I don't really like those names).

@oprypin
Copy link
Member Author

oprypin commented Sep 15, 2021

URI.encode_www_form is totally fine. Renaming it to something slightly better, in my view, is not worth it.

https://docs.python.org/3/library/urllib.parse.html#urllib.parse.quote_plus

as required for quoting HTML form values when building up a query string to go into a URL

-- even sounds similar to the name.


For escaping path [components], Python happens to hit good usability.

https://docs.python.org/3/library/urllib.parse.html#urllib.parse.quote

The calls are urllib.parse.quote(x) (implied safe='/') and urllib.parse.quote(x, safe='')

If the method were to be named encode_path, it could exactly mimick this (and yes, I am suggesting accepting a string rather than a block)


Whether the current encode remains available I don't have an opinion, but it could be renamed encode_uri.

The name encode_fully I don't like because the selection of characters it touches is actually the farthest from full.


Yielding encode seems fine to keep as is.

@straight-shoota
Copy link
Member

straight-shoota commented Sep 20, 2021

That seems fine by me.

encode_www_form and yielding encode should stay as they are.

encode must be deprecated. Whether it stays available under a different name, I'm not sure either. But let's push that back for a moment.

The main problem is to design the API for encoding URI paths and path components.

So you're suggesting def URI.encode_path(string, *, safe = "/"). And you would call it as URI.encode(path) for a path and URI.encode(path, safe: "") for a path component.

I like about the safe parameter, that it is technically more expressive than a flag that toggles between path and path segment encoding. But it's used in the wrong way. The acceptance of / is implicit and when you configure to encode it via safe = "", it's not clear what this parameter actually does. It would make more sense if the default was the empty string and you would use safe = "/" for encoding paths with multiple segments.

Another issue: The method name encode_path expresses a very specific intent, which has just two variants (whether to encode / or not). There's no need for more options. A flexible argument like safe doesn't make much sense there. We already have the yielding encode for flexible use cases.
That's different in Python, where urllib.parse.quote is more generic and combines the features of multiple Crystal methods.

So I believe encode_path with a flag parameter to configure encoding of / would be superior.
Alternatively, there could be two separate methods. That might be even better. URI.encode_path and URI.encode_path_segment.
Or just a single method to encode a path component.

Maybe it's best if we start with a method URI.encode_path(string : String) to encode a single path component.
Support for paths consisting of multiple components can be added later as an enhancement.

@beta-ziliani
Copy link
Member

I agree with @straight-shoota that it doesn't make sense to have a parameter unless we're up for a generic method. Then, whether we ought to have a flag or two different methods, I don't have strong feelings, but for consistency, I'm slightly more in favor of a flag, since we have the space_to_plus flag already.

BTW, your report on this matter is gold, thanks a lot @oprypin 🙇 !

@oprypin
Copy link
Member Author

oprypin commented Sep 20, 2021

@straight-shoota
It seems that we're arriving at the option where escaping paths is done by two separate methods.

Because if it's one method, let me explain why my suggestion looked like that and the issue I have with your suggestion: encode_path should mean actually "encode path", but if you're saying that "/" is unsafe by default, then it's encoding a path component by default. At that point, rather than having

encode_path()  # component
encode_path(actually_a_path: true)

I'd prefer

encode_path_component()
encode_path()

You could indeed still suggest something with escaping a path by default, like

encode_path(component: true)
encode_path()

but I don't like these as much, and naming the parameter could be difficult.


And, in the absence of opinions, I'd say let's keep the current encode under a different name, for a clearer migration path. The name I already suggested: encode_uri.

@straight-shoota straight-shoota self-assigned this Sep 20, 2021
@straight-shoota
Copy link
Member

Yes, I'm definitely favouring two separate methods now.

@beta-ziliani space_to_plus is a flag already in use, but it's not so much about defining the semantics of the encoded value (is it a path or a path segment?), but rather allows opting for the behaviour of an older version of the URI standard.

@oprypin
Copy link
Member Author

oprypin commented Sep 20, 2021

@straight-shoota I see you self-assigned the task :)

I just had a particular idea about the documentation: each of these methods should have an example "what if I want to do what this method does but with one particular character (say, ~) escaped / not escaped" - which would actually in every case just be a code snippet using URI.encode {}

@straight-shoota
Copy link
Member

Yeah, I've got a fix. It's just waiting on #11124 for facilitating proper specs.

I'm not sure I entirely understand what you want to put in the doc comment. It's probably best if you show an example (can do that in the PR).

@straight-shoota
Copy link
Member

I've looked a more closely at the implementation of Go's PathEscape and QueryEscape. The latter seems really equivalent to URI.encode_www_form and urllib.parse.quote_plus.

PathEscape leaves some restricted characters unencoded: $&+:=@. I don't think that not encoding + is a good idea because decoding could be ambiguous.
But the others appear to be safe. They don't have any reserved meaning in a path and are and allowed (see implementation comment at https://cs.opensource.google/go/go/+/refs/tags/go1.17.1:src/net/url/url.go;l=138). Per RFC, valid characters in a path segment are - besides unreserved characters - all sub-delims (!$&'()*+,;=), : and @.
I don't know why for example ! and * are still encoded though 🤷

code ╲ input ! " # $ % & ' ( ) * + , - . / : ; < = > ? @ [ \ ] ^ _ ` { | } ~ q й
py urllib.parse.quote('') %20 %21 %22 %23 %24 %25 %26 %27 %28 %29 %2A %2B %2C - . %2F %3A %3B %3C %3D %3E %3F %40 %5B %5C %5D %5E _ %60 %7B %7C %7D ~ q %D0%B9
go url.PathEscape %20 %21 %22 %23 $ %25 & %27 %28 %29 %2A + %2C - . %2F : %3B %3C = %3E %3F @ %5B %5C %5D %5E _ %60 %7B %7C %7D ~ q %D0%B9`
go url.QueryExcape + %21 %22 %23 %24 %25 %26 %27 %28 %29 %2A %2B %2C - . %2F %3A %3B %3C %3D %3E %3F %40 %5B %5C %5D %5E _ %60 %7B %7C %7D ~ q %D0%B9
cr URI.encode_www_form + %21 %22 %23 %24 %25 %26 %27 %28 %29 %2A %2B %2C - . %2F %3A %3B %3C %3D %3E %3F %40 %5B %5C %5D %5E _ %60 %7B %7C %7D ~ q %D0%B9

(I have also added these two methods to the table in the OP).

Should we consider the escape rules from Go's PathEscape for our implementation?

In this context, there's also the question of non-escaping , and ; (alongside /) when encoding multiple path segments. Go's multisegment implementation (not exposed as a public function) recognizes /,; as segment separators. , and ; are mentioned in the RFC as being used a separators in non-hierarchical paths.

@oprypin
Copy link
Member Author

oprypin commented Sep 24, 2021

I honestly have no interest in trying to follow what Go does if it's different from all the other examples :/

@oprypin
Copy link
Member Author

oprypin commented Sep 24, 2021

Or let's put it a different way: do you foresee more user requests saying "why did it escape this character?" or "why didn't it escape this character?" Particularly consider the latter if it escapes less than Ruby. I think it's much better to just go with the majority from these examples.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind:bug A bug in the code. Does not apply to documentation, specs, etc. security topic:stdlib:networking
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants