Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net/http: RawPath shows inconsistent, case-sensitive behaviour with percentage encoded Unicode URI strings #33596

Open
knadh opened this issue Aug 12, 2019 · 21 comments · May be fixed by #53848
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@knadh
Copy link

knadh commented Aug 12, 2019

What version of Go are you using (go version)?

$ go version
go version go1.12.7 linux/amd64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

Any environment

go env Output
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/user/.cache/go-build"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"
GOPATH="/home/user/go"
GOPROXY=""
GORACE=""
GOROOT="/usr/local/go"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/linux_amd64"
GCCGO="gccgo"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build606111517=/tmp/go-build -gno-record-gcc-switches"

What did you do?

package main

import (
	"fmt"
	"html/template"
	"net/http"
	"net/url"
)

type content struct {
	TplEncoded      string
	ManuallyEncoded template.URL

	ShowPaths bool
	RawPath   string
	Path      string
}

func main() {
	tpl, _ := template.New("test").Parse(`<!doctype html>
	<head>
		<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
		<meta charset="utf-8" />
	</head>
	<body>
		{{ if .ShowPaths }}
			<p>RawPath = {{ .RawPath }}</p>
			<p>Path = {{ .Path }}</p>
		{{ else }}
			<a href="/link/{{ .TplEncoded }}">Template encoded link</a><br />
			<a href="/link/{{ .ManuallyEncoded }}">Manually encoded link</a>
			<br />
			<p>View this page's source to see the (lower/upper)case difference
			in the links</p>
		{{ end }}
	</body>
	</html>`)

	// Renders the root with good and bad links.
	http.HandleFunc("/", func(w http.ResponseWriter, r *http.Request) {
		s := "😋" // Unicode emoji.
		tpl.Execute(w, content{
			// html/template encodes into lowercase characters.
			TplEncoded: s,

			// url.PathEscape encodes into uppercase characters.
			ManuallyEncoded: template.URL(url.PathEscape(s)),
		})
	})

	// This handler produces inconsistent RawPath based on (upper/lower)case encoding in the URI.
	http.HandleFunc("/link/", func(w http.ResponseWriter, r *http.Request) {
		tpl.Execute(w, content{
			ShowPaths: true,
			RawPath:   r.URL.RawPath,
			Path:      r.URL.Path,
		})
	})

	fmt.Println("Go to http://127.0.0.1:8080")
	http.ListenAndServe(":8080", nil)
}

What did you expect to see?

url.PathEscape("😋") => %F0%9F%98%8B

/link/%F0%9F%98%8B (A) and /link/%f0%9f%98%8b (B) (upper and lower case respectively) are equivalent as per RFC 3986. An http.HandlerFunc() handling either of the URLs is expected to show consistent behaviour.

What did you see instead?

An http handler that processes the identical URIs A and B behaves differently. B, which has uppercase characters, produces an empty http.Request.URL.RawPath where as A that has lowercase characters produces an http.Request.URL.RawPath with unescaped characters. This breaks Unicode URL handling in popular HTTP routers like chi and httprouter.

Discovered this inconsistency when using html/template that encodes Unicode strings in <a> to have lowercase characters as opposed to url.PathEscape that produces uppercase characters.

@andybons andybons added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Aug 12, 2019
@andybons andybons added this to the Unplanned milestone Aug 12, 2019
@knadh
Copy link
Author

knadh commented Aug 13, 2019

go/src/net/url/url.go

Lines 673 to 686 in 61bb56a

func (u *URL) setPath(p string) error {
path, err := unescape(p, encodePath)
if err != nil {
return err
}
u.Path = path
if escp := escape(path, encodePath); p == escp {
// Default encoding is fine.
u.RawPath = ""
} else {
u.RawPath = p
}
return nil
}

I think this is happening in net/url.URL.setPath(). Line 674 unescapes lowercased encoding successfully, and line 679 escapes the unescaped value again which now produces uppercased encoding. Thus, the original lowercase encoded value, and the re-encoded uppercase value, although equivalent semantically, don't match (p == escp), resulting in RawPath being set.

@aldas
Copy link

aldas commented Apr 30, 2022

This is testcase to show case-sensitive behavior. When unescaped character is uppercase in raw form RawPath will not be filled as escape will provide same value as p at if escp := escape(path, encodePath); p == escp {

func TestURL_ParseEscapingHexValues(t *testing.T) {
	var testCases = []struct {
		name          string
		when          string
		expectPath    string
		expectRawPath string
	}{
		{
			name:          "uppercase F in %3f",
			when:          "/ab%3Fde",
			expectPath:    "/ab?de",
			expectRawPath: `/ab%3Fde`, // actual RawPath is empty and test fails
		},
		{
			name:          "lowercase f in %3f",
			when:          "/ab%3fde",
			expectPath:    "/ab?de",
			expectRawPath: `/ab%3fde`,
		},
	}

	for _, tc := range testCases {
		t.Run(tc.name, func(t *testing.T) {
			u, _ := url.Parse(tc.when)

			if u.Path != tc.expectPath {
				t.Errorf("path `%s` is not `%s`", u.Path, tc.expectPath)
			}
			if u.RawPath != tc.expectRawPath {
				t.Errorf("RawPath `%s` is not `%s`", u.RawPath, tc.expectRawPath)
			}
		})
	}
}

subnut added a commit to subnut/go that referenced this issue Jul 13, 2022
subnut added a commit to subnut/go that referenced this issue Jul 13, 2022
@subnut subnut linked a pull request Jul 13, 2022 that will close this issue
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/417395 mentions this issue: net/url: normalize hex values before comparision

@neild
Copy link
Contributor

neild commented Jul 15, 2022

Use URL.EscapedPath to get the escaped path.

Parse only sets RawPath when the default escaped form of u.Path is different from the original path. When the two are the same, it leaves RawPath unset, since EscapedPath will reconstruct the original encoding.

See, for example: https://go.dev/play/p/TCNnJEl77Zb

@aldas
Copy link

aldas commented Jul 15, 2022

This is not that useful when you compare performance difference between using RawPath and calling EscapedPath. If code for setting path knows how to deal with hex values it is perfectly normal to expect it to work with upper-case and lower-case hex values.

@subnut
Copy link

subnut commented Jul 16, 2022

Parse only sets RawPath when the default escaped form of u.Path is different from the original path

I agree that "?" and "%3f" are different strings. But how are "%3f" and "%3F" different strings?

RFC 3986, Section 2.1 states that -

  The uppercase hexadecimal digits 'A' through 'F' are equivalent to
  the lowercase digits 'a' through 'f', respectively.  If two URIs
  differ only in the case of hexadecimal digits used in percent-encoded
  octets, they are equivalent.

So, according to RFC3986, "%3f" and "%3F" are equivalent.

@neild
Copy link
Contributor

neild commented Jul 16, 2022

RawPath is only set when the default escaped form of u.Path is different from the original path.

The way to get the escaped path of a url.URL is the EscapedPath method, which will consistently return the original path prior to parsing. See https://go.dev/play/p/TCNnJEl77Zb for an example: EscapedPath preserves the original path of both %3f and %3F.

@subnut
Copy link

subnut commented Jul 16, 2022

RFC3986 §2.1

For consistency, URI producers and normalizers should use
uppercase hexadecimal digits for all percent-encodings.

IMO, the fact that u.EscapedPath() is returning a percent-encoded string that has lowercase hexadecimal digits is itself a bug.

EscapedPath preserves the original path of both %3f and %3F

IMO, it shouldn't. It should return %3F in both cases.

@subnut
Copy link

subnut commented Jul 16, 2022

Or maybe we should keep EscapedPath unchanged (to not break applications that rely on this behaviour), and introduce a new function to compare two URLs?

@gazerro
Copy link
Contributor

gazerro commented Jul 16, 2022

@subnut

URI producers and normalizers should use
uppercase hexadecimal digits for all percent-encodings.

From its documentation, EscapedPath returns u.RawPath when it is a valid escaping of u.Path. Otherwise it computes an escaped form on its own. Only in this latter case it can be considered an URI producer.

But note that the RFC uses the word should and not must, and in a RFC this difference is relevant. So I think this behavior shouldn't be considered a bug anyway.

@neild
Copy link
Contributor

neild commented Jul 16, 2022

IMO, the fact that u.EscapedPath() is returning a percent-encoded string that has lowercase hexadecimal digits is itself a bug.

For a URL constructed by url.Parse, EscapedPath returns the original path, preserving the exact escaping used in that original path.

For a URL constructed manually, such as url.URL{Path: "a?b"}, EscapedPath uses upper case hexadecimal digits to escape the path as recommended by RFC 3986.

@aldas
Copy link

aldas commented Jul 16, 2022

Guys, you do understand that we are dealing with things that clients send over the (inter)net.

RawPath value being different if path contains %3f or %3F is unexpected even so that both Path are unescaped fine but somehow RawPath result is empty/filled for one case.

https://go.dev/play/p/566KcdtC18-

func main() {
	lowerCase, _ := url.Parse("/ab%3fde")
	fmt.Printf("RawPath for /ab%%3fde is \"%s\" and Path is \"%s\"\n", lowerCase.RawPath, lowerCase.Path)

	upperCase, _ := url.Parse("/ab%3Fde")
	fmt.Printf("RawPath for /ab%%3Fde is \"%s\" and Path is \"%s\"\n", upperCase.RawPath, upperCase.Path)
}

creates output:

RawPath for /ab%3fde is "/ab%3fde" and Path is "/ab?de"
RawPath for /ab%3Fde is "" and Path is "/ab?de"

how is this consistent behavior?

@gazerro
Copy link
Contributor

gazerro commented Jul 16, 2022

@aldas I think you are giving RawPath a meaning it doesn't have. It is not meant to represent the encoded path. It exists only and exclusively as a support to the EscapedPath method.

The documentation says encoded path hint (see EscapedPath method) and the commit message 874a605 that added RawPath is more explicit: Clients should use EscapedPath instead of reading RawPath directly.

@neild Would it be better if the RawPath documentation was rewritten as encoded path hint (use EscapedPath instead of reading it directly)?

@aldas
Copy link

aldas commented Jul 16, 2022

From docs I see and agree that RawPath is to differentiate if you had / (unescaped) or %2f (escaped character) but I really do not find anything about it being case sensitive for hex chars in escaped characters.

go/src/net/url/url.go

Lines 662 to 669 in 2aa473c

// setPath sets the Path and RawPath fields of the URL based on the provided
// escaped path p. It maintains the invariant that RawPath is only specified
// when it differs from the default encoding of the path.
// For example:
// - setPath("/foo/bar") will set Path="/foo/bar" and RawPath=""
// - setPath("/foo%2fbar") will set Path="/foo/bar" and RawPath="/foo%2fbar"
// setPath will return an error only if the provided path contains an invalid
// escaping.

even docs have lower case hex value and no mentions that %2f %2F would be different in any way

So it should be?

 // - setPath("/foo/bar")   will set Path="/foo/bar" and RawPath="" 
 // - setPath("/foo%2Fbar") will set Path="/foo/bar" and RawPath="" 
 // - setPath("/foo%2fbar") will set Path="/foo/bar" and RawPath="/foo%2fbar" 

this %2f %2F differentiation looks so bizarre.

@gazerro
Copy link
Contributor

gazerro commented Jul 16, 2022

The documentation of the URL type says

... RawPath, an optional field which only gets set if the default encoding is different from Path.

where with default we should mean the encoding used by the PathEscape function.

This documentation is wrong, it should be

... RawPath, an optional field which only gets set if the default encoding of Path is different from the original path.

that is consistent with the setPath documentation.

@aldas In light of this, your previous example

RawPath for /ab%3fde is "/ab%3fde" and Path is "/ab?de"
RawPath for /ab%3Fde is "" and Path is "/ab?de"

behaves correctly and explains the "bizarre" behavior.

Also this example is correct

RawPath for /ab%2fde is "/ab%2fde" and Path is "/ab/de"
RawPath for /ab%2Fde is "/ab%2Fde" and Path is "/ab/de"

If the original path is /ab%2Fde, Path is /ab/de and its default encoding is /ab/de. Because /ab/de is different from
/ab%2Fde, RawPath is set to /ab%2Fde and not to an empty string.

@aldas
Copy link

aldas commented Jul 17, 2022

Could someone explain me if I develop HTTP server related code - request router or "middleware" or http handler and I need to get actual path part that client sent, where can I get it without doing extra work? Real world example - I want to serve file downloads where part of path is the file name and therefore I can not use Path because it is being unescaped automatically.

func handler(w http.ResponseWriter, r *http.Request) {
	io.WriteString(w, fmt.Sprintf("raw path part: %s\n", r.URL.RawPath))
}

func main() {
	http.HandleFunc("/", handler)
	if err := http.ListenAndServe(":8080", nil); err != http.ErrServerClosed {
		log.Print(err)
	}
}

"most" of the time RawPath works but when we are dealing with request containing escaped values somehow the case of escaped hex value matter.

so for curl http://localhost:8080/file/report%3fpart1.csv I can get hold of "raw"/actual part what was sent
but for curl http://localhost:8080/file/report%3Fpart1.csv I can not - unless I do the extra work (that was already done in setPath) and call EscapedPath().

Why do we need to call EscapedPath() for uppercase curl http://localhost:8080/file/report%3Fpart1.csv and do escape again inside EscapedPath(), the same work twice as setPath called also escape on same path? This just baffles me.

I am not native English speaker - so to summarize:
Why do upper-case hex value %2F requires twice the amount of work to get HTTP request "raw path part" that client actually sent?

@gazerro
Copy link
Contributor

gazerro commented Jul 17, 2022

Could someone explain me if I develop HTTP server related code - request router or "middleware" or http handler and I need to get actual path part that client sent, where can I get it without doing extra work?

http.Request.RequestURI

@aldas
Copy link

aldas commented Jul 17, 2022

This is going nowhere. http.Request.RequestURI contains query and fragments parts and forces you to parse it to get path part. Now you are still doing extra work like you would do with EscapePath but atleast you can now if you are dealing with upper/lower case hex value.


I assume %2f %2F are so completely different things and case matters for when comes to URL that is sourced from internetz.

So at least docs should explictly state in public API that these %2f %2F are different.

For example setPath method has useful comment about RawPath behaviour but setPath is private method so you probably will not see that

go/src/net/url/url.go

Lines 662 to 669 in 2aa473c

// setPath sets the Path and RawPath fields of the URL based on the provided
// escaped path p. It maintains the invariant that RawPath is only specified
// when it differs from the default encoding of the path.
// For example:
// - setPath("/foo/bar") will set Path="/foo/bar" and RawPath=""
// - setPath("/foo%2fbar") will set Path="/foo/bar" and RawPath="/foo%2fbar"
// setPath will return an error only if the provided path contains an invalid
// escaping.

It would be useful if URL comment would explicitly state that case matters for escaped values. Currently this is not obvious from

go/src/net/url/url.go

Lines 351 to 358 in 2aa473c

// Note that the Path field is stored in decoded form: /%47%6f%2f becomes /Go/.
// A consequence is that it is impossible to tell which slashes in the Path were
// slashes in the raw URL and which were %2f. This distinction is rarely important,
// but when it is, the code should use RawPath, an optional field which only gets
// set if the default encoding is different from Path.
//
// URL's String method uses the EscapedPath method to obtain the path. See the
// EscapedPath method for more details.

// This distinction is rarely important,
// but when it is, the code should use RawPath,

this part does not take into consideration that for %2F RawPath is empty and you have no way knowing if URL was escaped or not. It would be useful if there would be example for 3 cases like that:

 // - "/foo/bar"   will set Path="/foo/bar" and RawPath="" 
 // - "/foo%2Fbar" will set Path="/foo/bar" and RawPath="" 
 // - "/foo%2fbar" will set Path="/foo/bar" and RawPath="/foo%2fbar" 

RawPath comment is not that useful as it circular reference with EscapePath and does not help you with hidden cotchas. I personally would think that it would be helpful if there would be displaimer that Raw does not mean "pure" in this field context and you can not assume part of original string that was parsed into URL when it comes to hex digits.

RawPath string // encoded path hint (see EscapedPath method)

and EscapePath() comment

go/src/net/url/url.go

Lines 685 to 693 in 2aa473c

// EscapedPath returns the escaped form of u.Path.
// In general there are multiple possible escaped forms of any path.
// EscapedPath returns u.RawPath when it is a valid escaping of u.Path.
// Otherwise EscapedPath ignores u.RawPath and computes an escaped
// form on its own.
// The String and RequestURI methods use EscapedPath to construct
// their results.
// In general, code should call EscapedPath instead of
// reading u.RawPath directly.

says that

EscapedPath returns u.RawPath when it is a valid escaping of u.Path

does this sentence means that path that contains %2F was not valid?


Now take these strings for example and see what happens when you send escaped value

https://go.dev/play/p/LNKg1upv9ZP

func main() {
	for _, p := range []string{
		"report.csv%3fpart1?a=1", // lowercase ?
		"report.csv%3Fpart1?a=1", // uppercase ? <-- RawPath does not exists
		"%5e",                    // lowercase ^
		"%5E",                    // uppercase ^ <-- RawPath does not exists
		"%7ebashrc",              // lowercase ~
		"%7Ebashrc",              // uppercase ~ <-- RawPath exists
	} {
		u, err := url.Parse(p)
		if err != nil {
			fmt.Println(err)
			continue
		}
		fmt.Printf("original: %s\t, Path: %s, EscapedPath(): %s\t,RawPath: %s\n", p, u.Path, u.EscapedPath(), u.RawPath)
	}
}

Output

original: report.csv%3fpart1?a=1	, Path: report.csv?part1, EscapedPath(): report.csv%3fpart1	,RawPath: report.csv%3fpart1
original: report.csv%3Fpart1?a=1	, Path: report.csv?part1, EscapedPath(): report.csv%3Fpart1	,RawPath: 
original: %5e	, Path: ^, EscapedPath(): %5e	,RawPath: %5e
original: %5E	, Path: ^, EscapedPath(): %5E	,RawPath: 
original: %7ebashrc	, Path: ~bashrc, EscapedPath(): %7ebashrc	,RawPath: %7ebashrc
original: %7Ebashrc	, Path: ~bashrc, EscapedPath(): %7Ebashrc	,RawPath: %7Ebashrc

somehow now RawPath does not care about hexadecimal digits case for ~ and is filled.

This is not intuitive at all.

@aldas
Copy link

aldas commented Jul 17, 2022

This is commit that added RawPath 874a605

In that commit message @rsc wrote

Split the difference by treating u.RawPath as a hint: the observation
is that there are many valid encodings of u.Path. If u.RawPath is one
of them, use it.

Does "many valid encodings" mean that 3f and 3F are both valid or not? By my understanding of escaping and https://datatracker.ietf.org/doc/html/rfc3986#section-2.1 stating

The uppercase hexadecimal digits 'A' through 'F' are equivalent to
the lowercase digits 'a' through 'f', respectively.

would it mean that RawPath should get filled for both cases? To hint that original path was encoded.

At the moment cases like %3F or %5E this knowledge is lost but RawPath exists to hint that and for performance reasons this is very useful. Because if URL was not encoded you can/could opt out of calling EscapedPath or doing extra work to parse actual raw/original path from http.Request.RequestURI


and this question still remains:

Why do upper-case hex values like %3F %5E requires twice the amount of work to get HTTP request "raw path part" that client actually sent? setPath calls escape once and not when you are forced to use EscapedPath you are calling escape twice.

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/418035 mentions this issue: net/url: clarify RawPath documentation

@neild
Copy link
Contributor

neild commented Jul 18, 2022

The original implementation of URL did not preserve the original unescaped form of a URL. After parsing a URL string, there was no way to distinguish between a/b, a%2fb, and a%2Fb, since these all escape to the same string. The escaped form of Path was accessible via the EscapedPath method.

Go 1.15 changed the EscapedPath method to return the original path. It did this by adding the RawPath field, which is set only when the escaped path is different from the default escaping of Path.

I think there's an argument that RawPath should have been an unexported field, to avoid confusion, but it's far too late to worry about that now. The behavior of RawPath is fairly straightforward, however:

  • Code which parses raw URLs will set RawPath if the escaped path is different from Path.
  • EscapedPath will return RawPath if it contains a valid escaping of Path.

The RawPath field is not present for efficiency purposes. It exists to allow EscapedPath to return the original encoded form of a URL. Most code should use EscapedPath rather than accessing RawPath directly.

gopherbot pushed a commit that referenced this issue Aug 9, 2022
Consistently recommend using EscapedPath rather than RawPath directly.

For #33596.

Change-Id: Ibe5c2dfa7fe6b1fbc540efed6db1291fc6532726
Reviewed-on: https://go-review.googlesource.com/c/go/+/418035
Run-TryBot: Damien Neil <[email protected]>
Reviewed-by: Brad Fitzpatrick <[email protected]>
Reviewed-by: Than McIntosh <[email protected]>
TryBot-Result: Gopher Robot <[email protected]>
jproberts pushed a commit to jproberts/go that referenced this issue Aug 10, 2022
Consistently recommend using EscapedPath rather than RawPath directly.

For golang#33596.

Change-Id: Ibe5c2dfa7fe6b1fbc540efed6db1291fc6532726
Reviewed-on: https://go-review.googlesource.com/c/go/+/418035
Run-TryBot: Damien Neil <[email protected]>
Reviewed-by: Brad Fitzpatrick <[email protected]>
Reviewed-by: Than McIntosh <[email protected]>
TryBot-Result: Gopher Robot <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants