Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature add ability to turn off canonicalization in System.Uri #52628

Closed
normj opened this issue May 12, 2021 · 24 comments · Fixed by #59173
Closed

Feature add ability to turn off canonicalization in System.Uri #52628

normj opened this issue May 12, 2021 · 24 comments · Fixed by #59173
Assignees
Labels
api-suggestion Early API idea and discussion, it is NOT ready for implementation area-System.Net
Milestone

Comments

@normj
Copy link

normj commented May 12, 2021

Background and Motivation

This request is coming for the AWS SDK for .NET to support a use case for Amazon S3 the AWS object storage service.

S3 stores objects with an object key name. The key name supports any UTF-8 character and the key name is used as part of the resource path. Some users have a scenario where they are storing objects with "./" or "../" in the object key. The other AWS SDKs handle this behavior fine. In .NET we can't handle the scenario because the System.Uri class always canonicalizes the resource path.

For example if a user is trying to get an object at "http://mybucket/s3.amazonaws.com/foo/../bar" the Uri class transforms this to "http://mybucket/s3.amazonaws.com/bar". This is now changing the object key from "foo/../bar" to "bar". That of course causes the user to attempt to access a different object then intended. Also since our requests are signed they get a signature error because the request changed after signing happened.

I'm not saying Uri canonicalizing is wrong but what I would like is a way to specify to the Uri class to turn off canonicalizing. I see Uri class does have the Canonicalize method that ideally we could have subclassed and skip the code in Canonicalize but that method is deprecated and is not called.

Proposed API

For flexiblity sake and avoiding yet another boolean to the constructor I propose adding another enum called UriParserOptions that can be passed into the constructor. That would provide room in the future for other possible options that need to be set.

Usage Examples

[Flags]
enum UriParserOptions { DisableCanonicalization = 1 };

Uri s3Uri = new Uri("http://mybucket/s3.amazonaws.com/foo/../bar", UriParserOptions.DisableCanonicalization);

Alternative Designs

Thought about adding constructors with extra an boolean. I think this is infeasible because there is already the constructor that takes in a boolean for dontEscape. Adding another boolean will likely get confused with the dontEscape boolean and you would have to extend the dontEscape constructors to avoid collisions with the boolean but those constructors are obsolete. Also the boolean approach is too single purpose and doesn't leave any room for handling any other future scenarios.

Another alternative design I decided against was adding a property on the Uri class to disable canonicalization but that would violate the design of keeping Uri immutable.

Risks

Adds new constructors to the new URI class. URI class being so low level this might break somebodies reflection code that expect the class to never change. Seems unlikely but I recognize this is some of the really old code in the .NET codebase.

@normj normj added the api-suggestion Early API idea and discussion, it is NOT ready for implementation label May 12, 2021
@dotnet-issue-labeler dotnet-issue-labeler bot added area-System.Net untriaged New issue has not been triaged by the area owner labels May 12, 2021
@ghost
Copy link

ghost commented May 12, 2021

Tagging subscribers to this area: @dotnet/ncl
See info in area-owners.md if you want to be subscribed.

Issue Details

Background and Motivation

This request is coming for the AWS SDK for .NET to support a use case for Amazon S3 the AWS object storage service.

S3 stores objects with an object key name. The key name supports any UTF-8 character and the key name is used as part of the resource path. Some users have a scenario where they are storing objects with "./" or "../" in the object key. The other AWS SDKs handle this behavior fine. In .NET we can't handle the scenario because the System.Uri class always canonicalizes the resource path.

For example if a user is trying to get an object at "http://mybucket/s3.amazonaws.com/foo/../bar" the Uri class transforms this to "http://mybucket/s3.amazonaws.com/bar". This is now changing the object key from "foo/../bar" to "bar". That of course causes the user to attempt to access a different object then intended. Also since our requests are signed they get a signature error because the request changed after signing happened.

I'm not saying Uri canonicalizing is wrong but what I would like is a way to specify to the Uri class to turn off canonicalizing. I see Uri class does have the Canonicalize method that ideally we could have subclassed and skip the code in Canonicalize but that method is deprecated and is not called.

Proposed API

For flexiblity sake and avoiding yet another boolean to the constructor I propose adding another enum called UriParserOptions that can be passed into the constructor. That would provide room in the future for other possible options that need to be set.

Usage Examples

[Flags]
enum UriParserOptions { DisableCanonicalization = 1 };

Uri s3Uri = new Uri("http://mybucket/s3.amazonaws.com/foo/../bar", UriParserOptions.DisableCanonicalization);

Alternative Designs

Thought about adding constructors with extra an boolean. I think this is infeasible because there is already the constructor that takes in a boolean for dontEscape. Adding another boolean will likely get confused with the dontEscape boolean and you would have to extend the dontEscape constructors to avoid collisions with the boolean but those constructors are obsolete. Also the boolean approach is too single purpose and doesn't leave any room for handling any other future scenarios.

Another alternative design I decided against was adding a property on the Uri class to disable canonicalization but that would violate the design of keeping Uri immutable.

Risks

Adds new constructors to the new URI class. URI class being so low level this might break somebodies reflection code that expect the class to never change. Seems unlikely but I recognize this is some of the really old code in the .NET codebase.

Author: normj
Assignees: -
Labels:

api-suggestion, area-System.Net, untriaged

Milestone: -

@MihaZupan
Copy link
Member

MihaZupan commented May 12, 2021

Have you considered encoding the key into the query instead?

"http://mybucket/s3.amazonaws.com/?key=" + Uri.EscapeDataString("foo/../bar")
// http://mybucket/s3.amazonaws.com/?key=foo%2F..%2Fbar

What is stopping users from deciding to use other problematic characters as part of the key?

There are lots of ways to change/make the Uri invalid when just appending random data.
For example the casing of percent-encoded data might change. Some things might get escaped, some unescaped ...

There are other problems with such Uris - they will fall over when hitting proxies, used in a browser etc.

Using properly escaped queries on the other hand, you can store arbitrary bytes safely and still be confident that the shape will be preserved through different systems.

If you are only looking at the validation part of Uri, you can customize the parsing behavior for custom schemes somewhat:

const string AwsScheme = "aws-object";
UriParser.Register(new GenericUriParser(GenericUriParserOptions.DontCompressPath), AwsScheme, 443);

string uriString = $"{AwsScheme}://mybucket/s3.amazonaws.com/foo/../bar";

Console.WriteLine(new Uri(uriString).AbsoluteUri);
// aws-object://mybucket/s3.amazonaws.com/foo/../bar

HttpClient will not let you send such a request though.

@karelz
Copy link
Member

karelz commented May 12, 2021

What about using Uri.OriginalString to avoid canonicalization and any other changes to the Uri we may do? Would that work for you?
Or are you relying on some changes to the Uri (except the relative path removal)?

@normj
Copy link
Author

normj commented May 14, 2021

@MihaZupan I can't change S3 to use the query string for object key. It is too built into the nature of S3 at this point. Escaping problematic characters is generally not a problem because that doesn't really change the request since the resource path gets unescaped on the S3 side.

@karelz Ideally Uri.OriginalString would work but we aren't the ones consuming the Uri class. HttpClient is consuming the Uri class and I don't know of a way to tell it to use OriginalString.

@karelz
Copy link
Member

karelz commented May 14, 2021

I think we need to first understand this better.

  1. I am curious what URI RFCs say about '..' and '.' -- @MihaZupan do you have links to that part? Is canonicalization optional?
  2. I am very surprised that "a/../b" is not the same as "b". What is the story behind that? What does ".." and "." mean in S3 context? Why are those keys not equivalent? Was it designed on purpose or is it accidental side-effect of implementation (which can't be fixed due to app compat)?
  3. How common is the pattern where S3 customers use ".." and "."? Does everyone hit it? Some specific subservices? Or is it a corner case?
  4. I am surprised that other AWS SDKs don't have the problem -- which languages are those? Is that normal behavior of those languages to NOT canonicalize Uri? Or do they expose similar opt-ins you proposed above?
  5. Why is ".." and "." treated differently than escaping characters, etc.? What is so special on them?
  6. Do we expect other non-S3 servers/services in the world to need this "weird" pattern?

Overall this feels kind of ugly and hacky to me. I am afraid that it may not be the only thing needed to make .NET AWS SDK work well with S3.
That said, S3 is large part of web, so I can imagine we would accommodate if there are reasonable answers to the questions above ...

@geoffkizer any additional thoughts here?

@normj
Copy link
Author

normj commented May 19, 2021

Answering what I can

  1. S3 treats the resource path as basically an opaque string. It puts no special logic on any of the characters in the resource path. So "." and ".." means nothing to S3. In an overly simplistic way of thinking the resource path is taken as a whole string for a dictionary key.

  2. We have gotten a few reports of customers hitting this blocking issue with .NET over the years. I don't think it is common but the fact there is no workaround can be disaster for a prod application that hits this problem.

  3. I don't know the full list but I know Javascript and Python are not affected by this behavior

  4. From S3 point of view ".." and "." are not special. But the URI class canonicalizes even if the periods are escaped. For example if I run the following.

var uri = new Uri("http://s3.amazonaws.com/foo/bar/%2E%2E/text.txt");
Console.WriteLine(uri.ToString());

I get the following

http://s3.amazonaws.com/foo/text.txt

So even though I escaped the periods the canalization happen removing the parent folder "bar" from the resource path.

@KalleOlaviNiemitalo
Copy link

From looking at the source, I suspect IriHelper unescapes the %2E when it shouldn't. Can the IRI parsing be disabled?

@karelz
Copy link
Member

karelz commented May 19, 2021

Decoding %2E and interpreting it as ./.. is surprising to me. @MihaZupan what do you think? Is that intentional or a bug?
Assuming it is a bug, if we fix that part, would that satisfy S3 needs @normj?

@normj
Copy link
Author

normj commented May 19, 2021

If you fix that behavior I'm fine making the SDK encode periods.

@MihaZupan
Copy link
Member

It is not a bug. See RFC3986 Section 2.3. Unreserved Characters (emphasis mine)

Characters that are allowed in a URI but do not have a reserved
purpose are called unreserved. These include uppercase and lowercase
letters, decimal digits, hyphen, period, underscore, and tilde.

unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"

URIs that differ in the replacement of an unreserved character with
its corresponding percent-encoded US-ASCII octet are equivalent: they
identify the same resource
. However, URI comparison implementations
do not always perform normalization prior to comparison (see Section
6). For consistency, percent-encoded octets in the ranges of ALPHA
(%41-%5A and %61-%7A), DIGIT (%30-%39), hyphen (%2D), period (%2E),
underscore (%5F), or tilde (%7E) should not be created by URI
producers
and, when found in a URI, should be decoded to their
corresponding unreserved characters
by URI normalizers.

Further see c# Uri docs - remarks

Escaped characters (also known as percent-encoded octets) that don't have a reserved purpose are decoded (also known as being unescaped). These unreserved characters include uppercase and lowercase letters (%41-%5A and %61-%7A), decimal digits (%30-%39), hyphen (%2D), period (%2E), underscore (%5F), and tilde (%7E).

Canonicalizes the path for hierarchical URIs by compacting sequences such as /./, /../, and // (whether or not the sequence is escaped).

As part of canonicalization in the constructor for some schemes, dot-segments and empty segments (/./, /../, and //) are compacted (in other words, they are removed). The schemes for which Uri compacts segments include http, https, ...

Regarding the overall discussion, I would also point out RFC3986 Section 6.2.2.3. Path Segment Normalization

The complete path segments "." and ".." are intended only for use
within relative references
(Section 4.1) and are removed as part of
the reference resolution process
(Section 5.2). However, some
deployed implementations incorrectly assume that reference resolution
is not necessary when the reference is already a URI and thus fail to
remove dot-segments when they occur in non-relative paths. URI
normalizers should remove dot-segments
by applying the
remove_dot_segments algorithm to the path, as described in
Section 5.2.4.

It is critical from a security perspective (path traversal) that Uri decodes sequences like %2E and normalizes (compresses) the path.

@MihaZupan I can't change S3 to use the query string for object key. It is too built into the nature of S3 at this point.

I was not implying to replace all usages.
The AWS service could start accepting the query string at the edge, rewriting it to a path (or vice-versa).

Can the IRI parsing be disabled?

It can be for custom schemes, but not for internally-recognized ones like http.

@KalleOlaviNiemitalo
Copy link

If you percent-encode the slashes instead, will that preserve the dots and keep S3 happy?

@MihaZupan
Copy link
Member

MihaZupan commented May 20, 2021

I am still worried that this is just focusing on one specific symptom (case of ..) of the general problem (blindly stuffing bytes into Uri).

Gathering from the documentation it seems that other problematic characters are also allowed. What happens if the key contains:

  • % - how do you distinguish that from escaped characters?
  • ? - This may mess up query parsers
  • # - Uri will treat everything after it as a fragment so anything following it will be dropped (fragments are not sent over the wire, so if you concat other query parameters after it they will all be dropped)

@karelz
Copy link
Member

karelz commented May 20, 2021

@normj looking at RFC snippets from @MihaZupan above in #52628 (comment), it seems that removing canonicalization from Uri would go against RFC and may be just one of the problems S3 may hit (as hinted in #52628 (comment)).

I am hesitant to create an API option to make Uri class intentionally RFC non-compliant.

Can you help us understand what are the scenarios where the customers hit these problems in production? Did S3 force them to use ./.., or was it customers' choice which "just works" in other languages, but does not work in C#?
I am trying to understand the end-to-end scenario and decide if S3 is forcing the world into violating RFC, or if it is just implementation side-effect, but not recommended by S3 which just happens to work (kept in S3 for backward compatibility).
The fact that other languages do not adhere to the RFC either is somewhat surprising to me ... it would be nice if someone can validate that claim on a small HelloWorld-style repro (without S3).

@KalleOlaviNiemitalo
Copy link

using System;

namespace ConsoleApp1
{
    class Program
    {
        static void Main(string[] args)
        {
            foreach (string uriString in args)
            {
                var uri = new Uri(uriString);
                Console.WriteLine($"Input: {uriString}");
                Console.WriteLine($"OriginalString: {uri.OriginalString}");
                Console.WriteLine($"AbsoluteUri: {uri.AbsoluteUri}");
                Console.WriteLine($"AbsolutePath: {uri.AbsolutePath}");
                Console.WriteLine($"LocalPath: {uri.LocalPath}");
                Console.WriteLine();
            }
        }
    }
}
<?xml version="1.0" encoding="utf-8"?>
<configuration>
  <startup>
    <supportedRuntime version="v4.0" sku=".NETFramework,Version=v4.8"/>
  </startup>
  <uri>
    <iriParsing enabled="false"/>
  </uri>
</configuration>
Input: http://example/abc%2F..%2Fdef
OriginalString: http://example/abc%2F..%2Fdef
AbsoluteUri: http://example/abc%2F..%2Fdef
AbsolutePath: /abc%2F..%2Fdef
LocalPath: /abc/../def

Input: http://example/abc%2F%2E%2E%2Fdef
OriginalString: http://example/abc%2F%2E%2E%2Fdef
AbsoluteUri: http://example/abc%2F..%2Fdef
AbsolutePath: /abc%2F..%2Fdef
LocalPath: /abc/../def

Input: http://example/abc/%2E%2E/def
OriginalString: http://example/abc/%2E%2E/def
AbsoluteUri: http://example/def
AbsolutePath: /def
LocalPath: /def

I get the same results with each of the following:

  • .NET Framework 4.8, <iriParsing enabled="true"/>
  • .NET Framework 4.8, <iriParsing enabled="false"/>
  • .NET 5.0.6

So the IRI parsing is not relevant after all, and the behavior is not a regression.

AbsoluteUri does preserve "%2F", which may then be viable as a workaround, if S3 treats it as equivalent to an unencoded slash.

@scalablecory
Copy link
Contributor

HTTP demands URIs, so this is technically also asking for HttpClient (or rather SocketsHttpHandler) to implement non-standard behavior as well.

This is the kind of thing non-validating LLHTTP would be perfect for.

@KalleOlaviNiemitalo
Copy link

I guess LLHTTP means #525.

@normj
Copy link
Author

normj commented May 27, 2021

@karelz Yes, I recognize this is going against the RFC. Which is why I definitely don't want to change any existing behavior just add a back door for use cases that are not RFC compatible.

S3 is not forcing anybody to use ./ and ../ in their scenarios. Some customers are just choosing to have object keys with those characters in there. Since S3 considers the object key an opaque string at the service level ./ and ../ mean nothing. I wish S3 would have had more restrictions on the characters of an object key other then saying UTF-8 but we are 15 years too late on changing that decisions.

I talked to our Go team and Go allows you to set a RawQuery field to skip any language specific encodings and allow the Go developer handle the encoding themselves.
https://github.com/aws/aws-sdk-go/blob/e2d6cb448883e4f4fcc5246650f89bde349041ec/private/protocol/rest/build.go#L129

@MihaZupan
Copy link
Member

@normj could you please confirm whether other cases pointed out in #52628 (comment) are also affecting you

@normj
Copy link
Author

normj commented May 29, 2021

We have no issue using the characters % ? # with our SDK. The following code successfully puts an object in S3, confirms we get the key back in the exact same form and can get the object's content back.

using System;
using System.IO;
using System.Linq;

using Amazon;
using Amazon.S3;
using Amazon.S3.Model;

using var s3Client = new AmazonS3Client(RegionEndpoint.USEast1);

var bucketName = "normj-east1";
var objectKey = "%?#";

await s3Client.PutObjectAsync(new PutObjectRequest
{
    BucketName = bucketName,
    Key = objectKey,
    ContentBody = "hello"
});

var response = await s3Client.ListObjectsAsync(bucketName);
Console.WriteLine($"Found Key: {response.S3Objects.Any(x => string.Equals(objectKey, x.Key))}");

using var getResponse = await s3Client.GetObjectAsync(bucketName, objectKey);
Console.WriteLine($"Content: {new StreamReader(getResponse.ResponseStream).ReadToEnd()}");

The output:

Found Key: True
Content: hello

In this case the special characters are percent encoded so the previous SDK calls go to the following URI

https://normj-east1.s3.amazonaws.com/%25%3F%23

But with ../ those periods are removed during canonicalization before our SDK has a chance to encode the problematic characters.

@MihaZupan
Copy link
Member

Does S3 differentiate between %2F and /?
If not, you can encode the slashes and the segment will not be removed:

new Uri("http://foo/one%2Ftwo%2F..%2F%three").AbsoluteUri
// http://foo/one%2Ftwo%2F..%2F%25three

@karelz karelz added this to the Future milestone Jun 3, 2021
@karelz karelz removed the untriaged New issue has not been triaged by the area owner label Jun 3, 2021
@normj
Copy link
Author

normj commented Jun 11, 2021

@MihaZupan That solution comes close to working but the it doesn't work when the S3 object starts with periods like ./bar or ../foo because the URI becomes http://normj-east1.s3.amazonaws.com/..%2Ffoo. We can't escape the first slash and so the URI class will convert this to http://normj-east1.s3.amazonaws.com/foo.

@KalleOlaviNiemitalo
Copy link

KalleOlaviNiemitalo commented Jun 11, 2021

PowerShell 7.1.3 using .NET 5.0.4:

PS C:\> [System.Uri]::new("http://normj-east1.s3.amazonaws.com/..%2Ffoo").AbsoluteUri
http://normj-east1.s3.amazonaws.com/..%2Ffoo
PS C:\> [System.Uri]::new("http://normj-east1.s3.amazonaws.com/..%2Ffoo").Segments
/
..%2Ffoo

i.e. it did not decode to http://normj-east1.s3.amazonaws.com/../foo and normalize to http://normj-east1.s3.amazonaws.com/foo.

@MihaZupan
Copy link
Member

@normj I can confirm the behavior Kalle described: for a Uri like http://normj-east1.s3.amazonaws.com/..%2Ffoo, the segment will not be removed.

Can you please confirm that those are the right examples?

@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Sep 17, 2021
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Sep 17, 2021
@karelz karelz modified the milestones: Future, 6.0.0 Sep 20, 2021
@karelz
Copy link
Member

karelz commented Sep 20, 2021

Part of larger Uri API addition tracked in #59099. This particular part should be addressed now:

Fixed in 6.0 RC2 in PR #59274.
Fixed in 7.0 (main) in PR #59173.

@karelz karelz closed this as completed Sep 20, 2021
@ghost ghost locked as resolved and limited conversation to collaborators Nov 3, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
api-suggestion Early API idea and discussion, it is NOT ready for implementation area-System.Net
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants