-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"return the normalized absolute URL" for invalid URLs? #9
Comments
I think the parsers shouldn't (or don't need to) validate URLs. From https://en.wikipedia.org/wiki/URL_normalization#Normalizations_that_preserve_semantics, I think converting the scheme and domain to lower case could be a nice-to-have. I don't think the other normalizations in that section are necessary, though. Thoughts? |
In the Python implementation, the Python stdlib functions used because they handle resolving relative URLs throw an exception when they encounter an invalid URL. It does not happen due to any extra normalization attempt (although it might be another issue to clarify what "normalization" means, or if it should be removed). I'd side with the quoted advice: Invalid URLs can not receive any URL-specific handling and thus will be returned as-is. This could be added to the spec as something like (please suggest better wording!)
In mf2py, this would then be implemented by catching any such exceptions and returning the original value instead. |
Can you provide an actual markup example that demonstrates the question? For absolute URL values, the parser should just pass the value through, no processing or validation required. For relative URL values, presumably "document's language's rules for resolving relative URLs" already handles making sure the base URL is valid. Thus resolving a relative URL value could at most require URL escaping, which we could consider adding. However I’d rather first see an actual example markup that demonstrates how you could pass an "invalid [relative] URL" to the mf2 parser before adding another processing requirement (URL escaping) to the spec. Otherwise I'd prefer to close this issue with no changes needed. (Originally published at: http://tantek.com/2018/107/t1/) |
Copy-Paste from the page mentioned, which crashed (and actually still does) the mf2 parser: <a href="http://www.southside.de]" rel="nofollow">http://www.southside.de]</a> or if you prefer it in an mf2 thing: <a href="http://www.southside.de]" class="h-card">http://www.southside.de]</a> Normalization without further specification is not just applicable to relative URLs, but not clear for things that are not valid URLs. |
After reminding myself about what this was about again, there actually seem to be two things mentioned in this one issue. These may need to be addressed separately:
My first thought is to drop normalisation references in the spec. Normalisation means something specific in the RFC for URLs, and I do not think mf2 parsers need to get into that. The same goes for non-URLs: rather than putting the onus on the mf2 parsers to do all sorts of URL parsing, lets just assume that an unrecognised URL is valid for whatever the author is intending. Thus I would propose something like:
With this, parsers only need to recognise when a gotten value is something they can resolve. Which sounds like a much clearer expectation to me. (The term “relative-URL string” is being borrowed from the WHATWG URL spec here.) [Edited to mark the |
(fwiw mf2py no longer crashes on invalid URLs like these.) |
This issue came up in chat (via @gRegorLove) and is related to microformats/tests#112:
@Zegnat noted in a reply:
Links, Prior Art, etc.Converting an empty path to a
In the IndieAuth spec, Section 3.3 URL Canonicalization addresses normalization/canonicalization (emphasis added):
JavaScript's native new URL('http://example.com').pathname
// returns "/" Ruby's native irb(main)> URI.parse('http://example.com').normalize
=> #<URI::HTTP http://example.com/> The popular Addressable Ruby gem also behaves similarly: irb(main)> Addressable::URI.parse('http://example.com').normalize
=> #<Addressable::URI URI:http://example.com/> Those are the two languages I'm most familiar with so I'm curious to learn what other languages (Python, Go, etc.) do. Questions
On the last question, my vote is to normalize empty paths to Thanks for reading and considering this proposal! Edit: |
An updated proposal:
|
+1 for @gRegorLove’s proposal! |
I know I am always overly sensitive to spec language, but I would like to define what we mean by “path” if we are going to refer to parts of a URL. Too many different specifications have come and gone renaming parts of URLs. Probably best illustrated by @tantek back in 2011: URL parts as (re)named over the years. I would also like to clarify that we will be foregoing any and all forms of normalisation with this proposal? To get around issues as the original reason for this to have been filed (an errant My over-the-top-specificity counter proposal, this would cover failure states as well as give us some nice-to-haves like lowercased schemes & hostnames, and non-empty paths:
I would love to fall somewhere between what @gRegorLove and I are proposing if we can iterate on something that is a) easy to understand for implementers, and b) leads to a useful output for consumers. (Edited 2020-05-02 to add lowercasing of hostnames, noticed this when testing the Live URL Viewer.) |
Per microformats/tests#112 (review): should we split out normalisation to a different ticket and focus only on invalid URLs here? It sounds like everyone’s preferred behaviour would be to pass on the value as-is if the parser for any reasons fails to find a “normalized absolute URL of the gotten value”. So the following change to the specification would clarify this:
This change is kept minimal as it is not meant to clarify what is meant by “normalized absolute URL”, it only clarifies that if a parser cannot obtain such a value (as the Python parser previously displayed when crashing on |
I think opening a new issue is reasonable given the discussion happening over at microformats/test#112. "What is a normalized URL" is a distinct issue from handling invalid URLs. Apologies for conflating the two here.
+1 approval for your specific change regarding handling of invalid URLs. |
A separate issue sounds good. The primary reason for my proposal was because microformats/tests#112 appeared like it would be merged soon and I wanted the spec to be updated beforehand. I'm definitely in favor of an incremental spec change to that end. I don't have strong opinions about referencing WHATWG url parsing. That might make sense in the larger normalization conversation. |
Going to bump this as I'm helping implementing a Rust parser and this URL normalization is something I can work around but would like to know if I have to 😉 |
For context, the Rust parser will add a trailing path (as per WHATWG conventions). |
I realized we never split off the separate issue for normalization, so I've done that with #58. For this issue of invalid URLs, I think @Zegnat's suggestion in #9 (comment) is solid. Repeating it here with a small fix so the
I'm +1 for this and confirmed php-mf2 does this currently. Can we revisit this, get some votes and any objections/feedback? |
Thanks! 🙌🏻
For clarity's sake, can we get one or two examples of input and expected output? I think the proposed verbiage changes makes sense, but want to make sure we have some examples we can quickly spin into tests for the tests repo. |
http://microformats.org/wiki/microformats2-parsing#parsing_a_u-_property
I looked at this in the context of microformats/mf2py#79, which is a crash due to the attempt to normalize the invalid URL
http://www.southside.de]
. Obviously, crashing the parser is not good behavior. Feedback in IRC can be summarized as "if it is not a valid URL, just pass the raw value through". Given that further steps in the parsing allow for arbitrary values to be returned and the consumer thus has to be prepared to handle any of them anyways this seems acceptable, but I'd still like to see it clarified in the parsing documentation. (An alternative would be dropping the value entirely, but I'm not sure if this is not more surprising and as far as I know isn't done in any other case of mf2 parsing)(Some background reading regarding URL parsing and normalization:
RFC3986 - Uniform Resource Identifier (URI): Generic Syntax and the WHATWG URL Standard both clearly describe the URL as invalid. The WHATWG spec explicitly describes parsing to "return failure" for invalid URLs)
The text was updated successfully, but these errors were encountered: