You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As @fonsp pointed out in #39, URIs.jl does not technically handle Unicode characters correctly, at least according to RFC 3986. IETF RFC 3986 Sec. 1.2.1 implies that URIs should only contain characters from the US-ASCII charset and should percent-encode additional characters (RFC 3987 makes this a little more explicit). URIs.jl, however, will accept and work with any string as its input regardless of the underlying character set:
After diving into it for a bit, there seems to be a split in how the standard / canonical library for URI handling works in many other languages. In JavaScript, Go, and Rust, passing in a URI that uses Unicode will either force the URI to be percent-encoded or raise an error:
Rust's http crate will actually panic if you try to feed it a Unicode URI at all, e.g.:
use http::Uri;fnmain(){let uri = Uri::from_static("https://a/🌟/e");println!("{}", uri.path());}
$ cargo run
Finished dev [unoptimized + debuginfo] target(s) in 0.01s
Running `target/debug/uri`
thread 'main' panicked at 'static str is not valid URI: invalid uri character', /home/kernelmethod/.cargo/registry/src/github.aaakk.us.kg-1ecc6299db9ec823/http-0.2.7/src/uri/mod.rs:365:23
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
But this isn't universally the case: in Python and Java, the Unicode encoding is preserved:
One potential difference between these languages is that Java's java.net.URI tries to comply with RFC 2936, whereas Python's urllib.parse.urlparse seems to try to comply with a mix of standards.
In any case, there's a bit of a dilemma here -- this library doesn't quite implement the RFC as specified, which is also an issue that has cropped up in other places, e.g. in the implementation of normpath#20 and joinpath (related issue: #18). As far as this issue is concerned, it seems like there are three ways URIs.jl could go:
Percent-encode strings when we generate a URI to ensure compliance to the spec;
Implement RFC 3987 under the hood, which does permit Unicode characters; or
Keep the library's current behavior and try to specify which parts of URIs.jl comply with which RFCs, similar to what Python does for its urllib.parse module.
I would think that option (1) is the most preferable of all of these -- this library says that it implements URIs according to RFC 3986, so it should comply with that RFC.
The text was updated successfully, but these errors were encountered:
As @fonsp pointed out in #39, URIs.jl does not technically handle Unicode characters correctly, at least according to RFC 3986. IETF RFC 3986 Sec. 1.2.1 implies that URIs should only contain characters from the US-ASCII charset and should percent-encode additional characters (RFC 3987 makes this a little more explicit). URIs.jl, however, will accept and work with any string as its input regardless of the underlying character set:
After diving into it for a bit, there seems to be a split in how the standard / canonical library for URI handling works in many other languages. In JavaScript, Go, and Rust, passing in a URI that uses Unicode will either force the URI to be percent-encoded or raise an error:
Javascript
Go
Rust
Rust's
http
crate will actually panic if you try to feed it a Unicode URI at all, e.g.:But this isn't universally the case: in Python and Java, the Unicode encoding is preserved:
Python
Java
One potential difference between these languages is that Java's
java.net.URI
tries to comply with RFC 2936, whereas Python'surllib.parse.urlparse
seems to try to comply with a mix of standards.In any case, there's a bit of a dilemma here -- this library doesn't quite implement the RFC as specified, which is also an issue that has cropped up in other places, e.g. in the implementation of
normpath
#20 andjoinpath
(related issue: #18). As far as this issue is concerned, it seems like there are three ways URIs.jl could go:urllib.parse
module.I would think that option (1) is the most preferable of all of these -- this library says that it implements URIs according to RFC 3986, so it should comply with that RFC.
The text was updated successfully, but these errors were encountered: