Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Impossible DID URL ABNF using eager evaluation without backtracking? #344

Closed
wyc opened this issue Jul 12, 2020 · 14 comments
Closed

Impossible DID URL ABNF using eager evaluation without backtracking? #344

wyc opened this issue Jul 12, 2020 · 14 comments
Assignees
Labels
pending close Issue will be closed shortly if no objections pr exists There is an open PR to address this issue

Comments

@wyc
Copy link
Contributor

wyc commented Jul 12, 2020

Hi, I've tried to implement the ABNF DID URL found in section 3.2, but couldn't get it to work. However, DID Scheme is working fine. I am using the Rust pest library due to its rather direct representation of ABNF. One of the following is true: my implementation is incorrect, the ABNF notation is wrong, or the ABNF notation must be evaluated with some form of laziness. I have provided steps to reproduce this error, an explanation as to why it's possibly an error, an analysis of what could be wrong, and also how we might address this.

Steps to reproduce

  1. Visit https://pest.rs/#editor and enter the following into the Grammar section:
// DID Scheme https://w3c.github.io/did-core/#did-syntax
// (accessed 20200711)
//
// did                = "did:" method-name ":" method-specific-id
// method-name        = 1*method-char
// method-char        = %x61-7A / DIGIT
// method-specific-id = *( *idchar ":" ) 1*idchar
// idchar             = ALPHA / DIGIT / "." / "-" / "_"
did_scheme = { "did:" ~ method_name ~ ":" ~ method_specific_id }
method_name = { method_char+ }
method_char = _{ ASCII_ALPHA_LOWER | ASCII_DIGIT }
method_specific_id = { ( idchar* ~ ":" )* ~ idchar* }
idchar = _{ ASCII_ALPHANUMERIC | "." | "-" | "_" }

// DID URL https://w3c.github.io/did-core/#did-url-syntax
// (accessed 20200711)
// NOTE: The spec defines path-abempty, fragment, and pchar as those from
// RFC3986 (URI: Generic Syntax) so we have included the relevant rules
// in the section below.
//
// did-url            = did path-abempty [ "?" did-query ]
//                      [ "#" fragment ]
// did-query          = param *( "&" param )
// param              = param-name "=" param-value
// param-name         = 1*pchar
// param-value        = *pchar
did_url = { did_scheme ~ path_abempty ~ ( "?" ~ did_query )? ~ ( "#" ~ fragment )? }
did_query = { param ~ ( "&" ~ param )* }
param = { param_name ~ "=" ~ param_value }
param_name = { pchar+ }
param_value = { pchar* }

// RFC3986 https://tools.ietf.org/html/rfc3986
// (accessed 20200711, relevant rules only)
//
// path-abempty  = *( "/" segment )
// segment       = *pchar
// fragment      = *( pchar / "/" / "?" )
// pchar         = unreserved / pct-encoded / sub-delims / ":" / "@"
// pct-encoded   = "%" HEXDIG HEXDIG
// unreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~"
// reserved      = gen-delims / sub-delims
// gen-delims    = ":" / "/" / "?" / "#" / "[" / "]" / "@"
// sub-delims    = "!" / "$" / "&" / "'" / "(" / ")"
//               / "*" / "+" / "," / ";" / "="
path_abempty = { ( "/" ~ segment )* }
segment = { pchar* }
fragment = { ( pchar | "/" | "?" )* }
pchar = _{ unreserved | pct_encoded | sub_delims | ":" | "@" }
pct_encoded = _{ "%" ~ ASCII_HEX_DIGIT ~ ASCII_HEX_DIGIT }
unreserved = _{ ASCII_ALPHA | ASCII_DIGIT | "-" | "." | "_" | "~" }
reserved = _{ gen_delims | sub_delims }
gen_delims = _{ ":" | "/" | "?" | "#" | "[" | "]" | "@" }
sub_delims = _{ "!" | "$" | "&" | "'" | "(" | ")"
              | "*" | "+" | "," | ";" | "=" }
  1. Enter the following into the Input section:
did:deadbeef:cafe/sub/path/?p1=v1&p2=v2
  1. Select "did_url" next to the Input box.

  2. Notice how under did_url, did_scheme and path_abempty are correctly populated, yet did_query is nowhere to be found. It did not fully parse.

Why it's possibly an error

It's possibly an error because the parser did not recognize ?p1=v1&p2=v2 as a valid DID query string. If you would like to see it explicitly fail, you can replace the line

did_url = { did_scheme ~ path_abempty ~ ( "?" ~ did_query )? ~ ( "#" ~ fragment )? }

with

did_url = { did_scheme ~ path_abempty ~ ( "?" ~ did_query ) ~ ( "#" ~ fragment )? }

, thereby requiring the ("?" ~ did_query) rule instead of allowing it to be optional.

What could be wrong

I think what could be going wrong here is related to the rule of

param_name = { pchar+ }

. Specifically, pchar from RFC3986 contains the sub_delims rule, which includes "=" as a valid atom. This may be crossing wires with the assumption of param, which is defined as

param = { param_name ~ "=" ~ param_value }

. With pest, the required "=" in param never has a chance to be parsed because it is already eaten by param_name. Therefore, the rule of param is impossible with the current ABNF notation if we assume eager evaluation, which is the default evaluation mode I've seen across BNF-like parsers including pest and ANTLR.

Ways forward

To remedy this (or not), we could consider one of the following:

  1. Decide that the common eager evaluation mode is not supported, and all parsers must evaluate somewhat lazily.
  2. Borrow the RFC3986 definition for query for did_query: query = *( pchar / "/" / "?" ). This means that we will lose the param, param_name, and param_value rules, which would be nice to have during token interpretation.
  3. Figure out a restricted version pchar excluding "=" and possibly other atoms to use to avoid the "="-eaten-by-param_name problem.

If I have some time soon, I will try to implement a parser combinator for this with lazy evaluation in Rust or Haskell to see if it works.

Can someone confirm that this might be a problem? If so, what do folks think is the best course of action here?

@wyc
Copy link
Contributor Author

wyc commented Jul 12, 2020

BTW, this is could be in the same class of "parser-ate-my-homework" behavior found in #333.

@wyc
Copy link
Contributor Author

wyc commented Jul 13, 2020

I got my grammar to parse did_query correctly with the following new/modified rules:

param_name = { pchar_noeq+ }
param_value = { pchar_noamp* }
pchar_noeq = _{ unreserved | pct_encoded | sub_delims_noeq | ":" | "@" }
sub_delims_noeq = _{ "!" | "$" | "&" | "'" | "(" | ")"
                   | "*" | "+" | "," | ";" }
pchar_noamp = _{ unreserved | pct_encoded | sub_delims_noamp | ":" | "@" }
sub_delims_noamp = _{ "!" | "$" | "'" | "(" | ")"
                    | "*" | "+" | "," | ";" | "=" }

I am ensuring that param_name cannot contain a literal "=" and param_value cannot contain a literal "&". However, these characters can still be percent encoded if they are needed. These definitions feel clumsy, and I'm wondering if there's a better rule than pchar from which to derive param_name and param_value.

@peacekeeper peacekeeper self-assigned this Jul 13, 2020
@peacekeeper
Copy link
Contributor

@wyc I would like to thank you for this very detailed bug report!

I am using parse2, and I have come across the same problem. The current ABNF doesn't seem to be "completely wrong", but average parsers that don't support backtracking have problems with it. I think I also agree with your proposed third way forward, i.e. figure out a restricted version of pchar.

What do you think about doing something like this:

param-name = 1*param-char
param-value = *param-char
param-char = unreserved / pct-encoded

?

@wyc
Copy link
Contributor Author

wyc commented Jul 13, 2020

You're right that it's not a matter of eagerness of evaluation but backtracking support! Thanks, I'll update the title.

I think we should strive as much as possible to maintain interoperability with query parameters able to be used within HTTP URLs, and the proposed reduced set of characters would preclude being able to copy just any HTTP URL query parameters directly after the DID path. For example, the following direct transposition would be invalid without converting to percent encoding:

https://example.com/path?@title=my+first+blog+post
did:web:example.com/path?@title=my+first+blog+post

The + character is found in sub_delims, which is excluded. I think this direct interoperability is an important consideration in the interests of providing low friction onramps to DID-based resources and services.

I think the reason we're encountering this issue is that we're trying to tokenize the parameter names and values at the grammar layer, whereas HTTP URI implementations use the URL BNF only for query string validation, but handle the parsing of the query string with a different more sophisticated algorithm:
https://www.w3.org/TR/2014/REC-html5-20141028/forms.html#url-encoded-form-data (okay so this is for form data, but I couldn't find one specifically for query strings)

If we look at net/url.parseQuery from Go's stdlib, we can see that it follows the broad steps of that algorithm (but also it includes support for the now-invalid ";" separator). So my recommendation is now to (1) find the cleanest way to support all the characters supported in the HTTP URL implementations, (2) leave this logic to the implementors and add a reference to an appropriate algorithm (or define one), or (3) both 1 and 2. 1 is most important to me, and I value this interoperability more than having param_name and param_value tokens to interpret. People will probably use existing URL parsers for these parameters and fragments, so we should try our best to ensure that those existing implementations do not accept DID URLs that are "technically invalid".

@wyc wyc changed the title Impossible DID URL ABNF Notation with Eager Evaluation? Impossible DID URL ABNF using Eager Evaluation Without Backtracking? Jul 13, 2020
@wyc wyc changed the title Impossible DID URL ABNF using Eager Evaluation Without Backtracking? Impossible DID URL ABNF using eager evaluation without backtracking? Jul 13, 2020
@wyc
Copy link
Contributor Author

wyc commented Jul 15, 2020

@peacekeeper what do you think of this?

// abnf
param-name = 1*param-name-char
param-value = *param-value-char
param-name-char = param-base-char / "&"
param-value-char = param-base-char / "="
param-base-char = unreserved / pct-encoded / ":" / "@"
                / "!" / "$" / "'" / "(" / ")"
                / "*" / "+" / "," / ";"

// pest
param_name = { param_name_char+ }
param_value = { param_value_char* }
param_name_char = { param_base_char | "&" }
param_value_char = { param_base_char | "=" }
param_base_char = _{ unreserved | pct_encoded | ":" | "@"
                   | "!" | "$" | "'" | "(" | ")"
                   | "*" | "+" | "," | ";" }

This should be logically compatible with all URL query strings while also giving us the param_name/param_value tokens. It is more complicated, unfortunately. We can also consider disallowing ";" in param_base_char to support legacy URLs that use this as the separator instead of '"&"`...

@jonnycrunch
Copy link
Contributor

I replicated this in the CDDL syntax and thus may need to revisit this for synergy.

@peacekeeper
Copy link
Contributor

@wyc sorry for the late response. I think your latest version isn't 100% semantically correct, since the & and = characters aren't actually part of the name and the value? Or are you intentionally saying that & can be part of a name, and = can be part of a value?

BTW in agree in principle with preserving the name and value tokens in the ABNF!

@wyc
Copy link
Contributor Author

wyc commented Jul 21, 2020

@peacekeeper Yes, to be as permissive as possible, I did mean to say that & can be part of a name and = can be part of a value, as in:

did:meth:spid/path?unique&verified=true&b64encoded-name=ZnJhbms=

We can disallow & and = from both param name and values, but need to see if this is used in any URLs today and how existing parsers handle them. If there is broad impl support, then perhaps we allow them, and if not, then disallow.

If we disallowed, it would simplify it to something like your proposal with extra chars:

// abnf
param-name = 1*param-char
param-value = *param-char
param-char = unreserved / pct-encoded / ":" / "@"
           / "!" / "$" / "'" / "(" / ")"
           / "*" / "+" / "," / ";"

Do you think this evaluation of existing impls and usage is a good approach?

@peacekeeper
Copy link
Contributor

@wyc I agree that maximum compatibility with query parameters in other URI schemes is desirable.

After thinking this through a bit, I think you are right that either 1. we get rid of the param-name / param-value tokens, just use RFC3986's query rule, and leave parsing of names/values to an algorithm rather than the grammar, or 2. we have a relatively complex ABNF like the ones you proposed.

I'm a bit surprised there doesn't seem to be an "official" reliable ABNF grammar anywhere for query parameters that are part of a URL, at least I couldn't find anything.

What would your recommended solution be at this point?

@wyc
Copy link
Contributor Author

wyc commented Jul 27, 2020

@peacekeeper It is with a heavy heart that I recommend we go with option 1 and remove the param-name/param-value tokens, erring on the side of simplicity. While I will be sad to see them go, I think we'll find solace in our improved alignment with RFC3986 and widespread URI parsing implementations today.

If you are agreed, we should ask for any objections in the next DID-WG call.

@peacekeeper
Copy link
Contributor

I think I agree. Do you want to create a PR to make this change and then we see if there are any objections?

@wyc
Copy link
Contributor Author

wyc commented Jul 28, 2020

Sure I'll do this today

wyc added a commit to wyc/did-core that referenced this issue Jul 28, 2020
- Use `query` from RFC3986 (URI) instead of parsing param names and values
- Correct RFC5324 (MIB for Fibre-Channel Security) to RFC5234 (ABNF)

See discussion here for reasoning:
w3c#344
wyc added a commit to wyc/did-core that referenced this issue Jul 28, 2020
- Use `query` from RFC3986 (URI) instead of parsing param names and values
- Correct RFC5324 (MIB for Fibre-Channel Security) to RFC5234 (ABNF)

See discussion here for reasoning:
w3c#344
wyc added a commit to wyc/did-core that referenced this issue Jul 28, 2020
- Use `query` from RFC3986 (URI) instead of parsing param names and values
- Correct RFC5324 (MIB for Fibre-Channel Security) to RFC5234 (ABNF)

See discussion here for reasoning:
w3c#344
wyc added a commit to wyc/did-core that referenced this issue Jul 28, 2020
- Use `query` from RFC3986 (URI) instead of parsing param names and values
- Correct RFC5324 (MIB for Fibre-Channel Security) to RFC5234 (ABNF)

See discussion here for reasoning:
w3c#344
wyc added a commit to wyc/did-core that referenced this issue Jul 28, 2020
- Use `query` from RFC3986 (URI) instead of parsing param names and values
- Correct RFC5324 (MIB for Fibre-Channel Security) to RFC5234 (ABNF)

See discussion here for reasoning:
w3c#344
@peacekeeper peacekeeper added the pr exists There is an open PR to address this issue label Jul 30, 2020
msporny pushed a commit that referenced this issue Aug 6, 2020
- Use `query` from RFC3986 (URI) instead of parsing param names and values
- Correct RFC5324 (MIB for Fibre-Channel Security) to RFC5234 (ABNF)

See discussion here for reasoning:
#344
@peacekeeper
Copy link
Contributor

@wyc I think you could close this issue now that #360 has been merged, do you agree? Thanks again for your work on this!

@brentzundel brentzundel added the pending close Issue will be closed shortly if no objections label Aug 26, 2020
@wyc
Copy link
Contributor Author

wyc commented Aug 27, 2020

Great, thanks for all the input.

@wyc wyc closed this as completed Aug 27, 2020
msporny pushed a commit to w3c/controller-document that referenced this issue Jun 2, 2024
- Use `query` from RFC3986 (URI) instead of parsing param names and values
- Correct RFC5324 (MIB for Fibre-Channel Security) to RFC5234 (ABNF)

See discussion here for reasoning:
w3c/did-core#344
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pending close Issue will be closed shortly if no objections pr exists There is an open PR to address this issue
Projects
None yet
Development

No branches or pull requests

4 participants