Impossible DID URL ABNF using eager evaluation without backtracking? #344

wyc · 2020-07-12T05:00:49Z

Hi, I've tried to implement the ABNF DID URL found in section 3.2, but couldn't get it to work. However, DID Scheme is working fine. I am using the Rust pest library due to its rather direct representation of ABNF. One of the following is true: my implementation is incorrect, the ABNF notation is wrong, or the ABNF notation must be evaluated with some form of laziness. I have provided steps to reproduce this error, an explanation as to why it's possibly an error, an analysis of what could be wrong, and also how we might address this.

Steps to reproduce

Visit https://pest.rs/#editor and enter the following into the Grammar section:

// DID Scheme https://w3c.github.io/did-core/#did-syntax
// (accessed 20200711)
//
// did                = "did:" method-name ":" method-specific-id
// method-name        = 1*method-char
// method-char        = %x61-7A / DIGIT
// method-specific-id = *( *idchar ":" ) 1*idchar
// idchar             = ALPHA / DIGIT / "." / "-" / "_"
did_scheme = { "did:" ~ method_name ~ ":" ~ method_specific_id }
method_name = { method_char+ }
method_char = _{ ASCII_ALPHA_LOWER | ASCII_DIGIT }
method_specific_id = { ( idchar* ~ ":" )* ~ idchar* }
idchar = _{ ASCII_ALPHANUMERIC | "." | "-" | "_" }

// DID URL https://w3c.github.io/did-core/#did-url-syntax
// (accessed 20200711)
// NOTE: The spec defines path-abempty, fragment, and pchar as those from
// RFC3986 (URI: Generic Syntax) so we have included the relevant rules
// in the section below.
//
// did-url            = did path-abempty [ "?" did-query ]
//                      [ "#" fragment ]
// did-query          = param *( "&" param )
// param              = param-name "=" param-value
// param-name         = 1*pchar
// param-value        = *pchar
did_url = { did_scheme ~ path_abempty ~ ( "?" ~ did_query )? ~ ( "#" ~ fragment )? }
did_query = { param ~ ( "&" ~ param )* }
param = { param_name ~ "=" ~ param_value }
param_name = { pchar+ }
param_value = { pchar* }

// RFC3986 https://tools.ietf.org/html/rfc3986
// (accessed 20200711, relevant rules only)
//
// path-abempty  = *( "/" segment )
// segment       = *pchar
// fragment      = *( pchar / "/" / "?" )
// pchar         = unreserved / pct-encoded / sub-delims / ":" / "@"
// pct-encoded   = "%" HEXDIG HEXDIG
// unreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~"
// reserved      = gen-delims / sub-delims
// gen-delims    = ":" / "/" / "?" / "#" / "[" / "]" / "@"
// sub-delims    = "!" / "$" / "&" / "'" / "(" / ")"
//               / "*" / "+" / "," / ";" / "="
path_abempty = { ( "/" ~ segment )* }
segment = { pchar* }
fragment = { ( pchar | "/" | "?" )* }
pchar = _{ unreserved | pct_encoded | sub_delims | ":" | "@" }
pct_encoded = _{ "%" ~ ASCII_HEX_DIGIT ~ ASCII_HEX_DIGIT }
unreserved = _{ ASCII_ALPHA | ASCII_DIGIT | "-" | "." | "_" | "~" }
reserved = _{ gen_delims | sub_delims }
gen_delims = _{ ":" | "/" | "?" | "#" | "[" | "]" | "@" }
sub_delims = _{ "!" | "$" | "&" | "'" | "(" | ")"
              | "*" | "+" | "," | ";" | "=" }

Enter the following into the Input section:

did:deadbeef:cafe/sub/path/?p1=v1&p2=v2

Select "did_url" next to the Input box.
Notice how under did_url, did_scheme and path_abempty are correctly populated, yet did_query is nowhere to be found. It did not fully parse.

Why it's possibly an error

It's possibly an error because the parser did not recognize ?p1=v1&p2=v2 as a valid DID query string. If you would like to see it explicitly fail, you can replace the line

did_url = { did_scheme ~ path_abempty ~ ( "?" ~ did_query )? ~ ( "#" ~ fragment )? }

with

did_url = { did_scheme ~ path_abempty ~ ( "?" ~ did_query ) ~ ( "#" ~ fragment )? }

, thereby requiring the ("?" ~ did_query) rule instead of allowing it to be optional.

What could be wrong

I think what could be going wrong here is related to the rule of

param_name = { pchar+ }

. Specifically, pchar from RFC3986 contains the sub_delims rule, which includes "=" as a valid atom. This may be crossing wires with the assumption of param, which is defined as

param = { param_name ~ "=" ~ param_value }

. With pest, the required "=" in param never has a chance to be parsed because it is already eaten by param_name. Therefore, the rule of param is impossible with the current ABNF notation if we assume eager evaluation, which is the default evaluation mode I've seen across BNF-like parsers including pest and ANTLR.

Ways forward

To remedy this (or not), we could consider one of the following:

Decide that the common eager evaluation mode is not supported, and all parsers must evaluate somewhat lazily.
Borrow the RFC3986 definition for query for did_query: query = *( pchar / "/" / "?" ). This means that we will lose the param, param_name, and param_value rules, which would be nice to have during token interpretation.
Figure out a restricted version pchar excluding "=" and possibly other atoms to use to avoid the "="-eaten-by-param_name problem.

If I have some time soon, I will try to implement a parser combinator for this with lazy evaluation in Rust or Haskell to see if it works.

Can someone confirm that this might be a problem? If so, what do folks think is the best course of action here?

The text was updated successfully, but these errors were encountered:

wyc · 2020-07-12T05:08:47Z

BTW, this is could be in the same class of "parser-ate-my-homework" behavior found in #333.

wyc · 2020-07-13T14:54:27Z

I got my grammar to parse did_query correctly with the following new/modified rules:

param_name = { pchar_noeq+ }
param_value = { pchar_noamp* }
pchar_noeq = _{ unreserved | pct_encoded | sub_delims_noeq | ":" | "@" }
sub_delims_noeq = _{ "!" | "$" | "&" | "'" | "(" | ")"
                   | "*" | "+" | "," | ";" }
pchar_noamp = _{ unreserved | pct_encoded | sub_delims_noamp | ":" | "@" }
sub_delims_noamp = _{ "!" | "$" | "'" | "(" | ")"
                    | "*" | "+" | "," | ";" | "=" }

I am ensuring that param_name cannot contain a literal "=" and param_value cannot contain a literal "&". However, these characters can still be percent encoded if they are needed. These definitions feel clumsy, and I'm wondering if there's a better rule than pchar from which to derive param_name and param_value.

peacekeeper · 2020-07-13T21:28:35Z

@wyc I would like to thank you for this very detailed bug report!

I am using parse2, and I have come across the same problem. The current ABNF doesn't seem to be "completely wrong", but average parsers that don't support backtracking have problems with it. I think I also agree with your proposed third way forward, i.e. figure out a restricted version of pchar.

What do you think about doing something like this:

param-name = 1*param-char
param-value = *param-char
param-char = unreserved / pct-encoded

?

wyc · 2020-07-13T21:54:25Z

You're right that it's not a matter of eagerness of evaluation but backtracking support! Thanks, I'll update the title.

I think we should strive as much as possible to maintain interoperability with query parameters able to be used within HTTP URLs, and the proposed reduced set of characters would preclude being able to copy just any HTTP URL query parameters directly after the DID path. For example, the following direct transposition would be invalid without converting to percent encoding:

https://example.com/path?@title=my+first+blog+post
did:web:example.com/path?@title=my+first+blog+post

The + character is found in sub_delims, which is excluded. I think this direct interoperability is an important consideration in the interests of providing low friction onramps to DID-based resources and services.

I think the reason we're encountering this issue is that we're trying to tokenize the parameter names and values at the grammar layer, whereas HTTP URI implementations use the URL BNF only for query string validation, but handle the parsing of the query string with a different more sophisticated algorithm:
https://www.w3.org/TR/2014/REC-html5-20141028/forms.html#url-encoded-form-data (okay so this is for form data, but I couldn't find one specifically for query strings)

If we look at net/url.parseQuery from Go's stdlib, we can see that it follows the broad steps of that algorithm (but also it includes support for the now-invalid ";" separator). So my recommendation is now to (1) find the cleanest way to support all the characters supported in the HTTP URL implementations, (2) leave this logic to the implementors and add a reference to an appropriate algorithm (or define one), or (3) both 1 and 2. 1 is most important to me, and I value this interoperability more than having param_name and param_value tokens to interpret. People will probably use existing URL parsers for these parameters and fragments, so we should try our best to ensure that those existing implementations do not accept DID URLs that are "technically invalid".

wyc · 2020-07-15T13:25:46Z

@peacekeeper what do you think of this?

// abnf
param-name = 1*param-name-char
param-value = *param-value-char
param-name-char = param-base-char / "&"
param-value-char = param-base-char / "="
param-base-char = unreserved / pct-encoded / ":" / "@"
                / "!" / "$" / "'" / "(" / ")"
                / "*" / "+" / "," / ";"

// pest
param_name = { param_name_char+ }
param_value = { param_value_char* }
param_name_char = { param_base_char | "&" }
param_value_char = { param_base_char | "=" }
param_base_char = _{ unreserved | pct_encoded | ":" | "@"
                   | "!" | "$" | "'" | "(" | ")"
                   | "*" | "+" | "," | ";" }

This should be logically compatible with all URL query strings while also giving us the param_name/param_value tokens. It is more complicated, unfortunately. We can also consider disallowing ";" in param_base_char to support legacy URLs that use this as the separator instead of '"&"`...

jonnycrunch · 2020-07-21T15:51:11Z

I replicated this in the CDDL syntax and thus may need to revisit this for synergy.

peacekeeper · 2020-07-21T15:59:34Z

@wyc sorry for the late response. I think your latest version isn't 100% semantically correct, since the & and = characters aren't actually part of the name and the value? Or are you intentionally saying that & can be part of a name, and = can be part of a value?

BTW in agree in principle with preserving the name and value tokens in the ABNF!

wyc · 2020-07-21T16:45:00Z

@peacekeeper Yes, to be as permissive as possible, I did mean to say that & can be part of a name and = can be part of a value, as in:

did:meth:spid/path?unique&verified=true&b64encoded-name=ZnJhbms=

We can disallow & and = from both param name and values, but need to see if this is used in any URLs today and how existing parsers handle them. If there is broad impl support, then perhaps we allow them, and if not, then disallow.

If we disallowed, it would simplify it to something like your proposal with extra chars:

// abnf
param-name = 1*param-char
param-value = *param-char
param-char = unreserved / pct-encoded / ":" / "@"
           / "!" / "$" / "'" / "(" / ")"
           / "*" / "+" / "," / ";"

Do you think this evaluation of existing impls and usage is a good approach?

peacekeeper · 2020-07-27T11:07:14Z

@wyc I agree that maximum compatibility with query parameters in other URI schemes is desirable.

After thinking this through a bit, I think you are right that either 1. we get rid of the param-name / param-value tokens, just use RFC3986's query rule, and leave parsing of names/values to an algorithm rather than the grammar, or 2. we have a relatively complex ABNF like the ones you proposed.

I'm a bit surprised there doesn't seem to be an "official" reliable ABNF grammar anywhere for query parameters that are part of a URL, at least I couldn't find anything.

What would your recommended solution be at this point?

wyc · 2020-07-27T12:42:45Z

@peacekeeper It is with a heavy heart that I recommend we go with option 1 and remove the param-name/param-value tokens, erring on the side of simplicity. While I will be sad to see them go, I think we'll find solace in our improved alignment with RFC3986 and widespread URI parsing implementations today.

If you are agreed, we should ask for any objections in the next DID-WG call.

peacekeeper · 2020-07-28T10:01:05Z

I think I agree. Do you want to create a PR to make this change and then we see if there are any objections?

wyc · 2020-07-28T12:03:24Z

Sure I'll do this today

- Use `query` from RFC3986 (URI) instead of parsing param names and values - Correct RFC5324 (MIB for Fibre-Channel Security) to RFC5234 (ABNF) See discussion here for reasoning: w3c#344

- Use `query` from RFC3986 (URI) instead of parsing param names and values - Correct RFC5324 (MIB for Fibre-Channel Security) to RFC5234 (ABNF) See discussion here for reasoning: #344

peacekeeper · 2020-08-26T13:28:02Z

@wyc I think you could close this issue now that #360 has been merged, do you agree? Thanks again for your work on this!

wyc · 2020-08-27T21:35:27Z

Great, thanks for all the input.

- Use `query` from RFC3986 (URI) instead of parsing param names and values - Correct RFC5324 (MIB for Fibre-Channel Security) to RFC5234 (ABNF) See discussion here for reasoning: w3c/did-core#344

peacekeeper self-assigned this Jul 13, 2020

wyc changed the title ~~Impossible DID URL ABNF Notation with Eager Evaluation?~~ Impossible DID URL ABNF using Eager Evaluation Without Backtracking? Jul 13, 2020

wyc changed the title ~~Impossible DID URL ABNF using Eager Evaluation Without Backtracking?~~ Impossible DID URL ABNF using eager evaluation without backtracking? Jul 13, 2020

wyc added a commit to wyc/did-core that referenced this issue Jul 28, 2020

Simplify DID URL ABNF

7d4585c

- Use `query` from RFC3986 (URI) instead of parsing param names and values - Correct RFC5324 (MIB for Fibre-Channel Security) to RFC5234 (ABNF) See discussion here for reasoning: w3c#344

wyc mentioned this issue Jul 28, 2020

Simplify DID URL ABNF #360

Merged

wyc added a commit to wyc/did-core that referenced this issue Jul 28, 2020

Simplify DID URL ABNF

9239dfd

- Use `query` from RFC3986 (URI) instead of parsing param names and values - Correct RFC5324 (MIB for Fibre-Channel Security) to RFC5234 (ABNF) See discussion here for reasoning: w3c#344

wyc added a commit to wyc/did-core that referenced this issue Jul 28, 2020

Simplify DID URL ABNF

a0fd56e

- Use `query` from RFC3986 (URI) instead of parsing param names and values - Correct RFC5324 (MIB for Fibre-Channel Security) to RFC5234 (ABNF) See discussion here for reasoning: w3c#344

wyc added a commit to wyc/did-core that referenced this issue Jul 28, 2020

Simplify DID URL ABNF

67b550c

- Use `query` from RFC3986 (URI) instead of parsing param names and values - Correct RFC5324 (MIB for Fibre-Channel Security) to RFC5234 (ABNF) See discussion here for reasoning: w3c#344

wyc added a commit to wyc/did-core that referenced this issue Jul 28, 2020

Simplify DID URL ABNF

2353f5d

- Use `query` from RFC3986 (URI) instead of parsing param names and values - Correct RFC5324 (MIB for Fibre-Channel Security) to RFC5234 (ABNF) See discussion here for reasoning: w3c#344

peacekeeper added the pr exists There is an open PR to address this issue label Jul 30, 2020

msporny pushed a commit that referenced this issue Aug 6, 2020

Simplify DID URL ABNF

3cc2641

- Use `query` from RFC3986 (URI) instead of parsing param names and values - Correct RFC5324 (MIB for Fibre-Channel Security) to RFC5234 (ABNF) See discussion here for reasoning: #344

brentzundel added the pending close Issue will be closed shortly if no objections label Aug 26, 2020

wyc closed this as completed Aug 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Impossible DID URL ABNF using eager evaluation without backtracking? #344

Impossible DID URL ABNF using eager evaluation without backtracking? #344

wyc commented Jul 12, 2020 •

edited

Loading

wyc commented Jul 12, 2020

wyc commented Jul 13, 2020

peacekeeper commented Jul 13, 2020

wyc commented Jul 13, 2020 •

edited

Loading

wyc commented Jul 15, 2020 •

edited

Loading

jonnycrunch commented Jul 21, 2020

peacekeeper commented Jul 21, 2020

wyc commented Jul 21, 2020 •

edited

Loading

peacekeeper commented Jul 27, 2020

wyc commented Jul 27, 2020 •

edited

Loading

peacekeeper commented Jul 28, 2020

wyc commented Jul 28, 2020

peacekeeper commented Aug 26, 2020

wyc commented Aug 27, 2020

Impossible DID URL ABNF using eager evaluation without backtracking? #344

Impossible DID URL ABNF using eager evaluation without backtracking? #344

Comments

wyc commented Jul 12, 2020 • edited Loading

Steps to reproduce

Why it's possibly an error

What could be wrong

Ways forward

wyc commented Jul 12, 2020

wyc commented Jul 13, 2020

peacekeeper commented Jul 13, 2020

wyc commented Jul 13, 2020 • edited Loading

wyc commented Jul 15, 2020 • edited Loading

jonnycrunch commented Jul 21, 2020

peacekeeper commented Jul 21, 2020

wyc commented Jul 21, 2020 • edited Loading

peacekeeper commented Jul 27, 2020

wyc commented Jul 27, 2020 • edited Loading

peacekeeper commented Jul 28, 2020

wyc commented Jul 28, 2020

peacekeeper commented Aug 26, 2020

wyc commented Aug 27, 2020

wyc commented Jul 12, 2020 •

edited

Loading

wyc commented Jul 13, 2020 •

edited

Loading

wyc commented Jul 15, 2020 •

edited

Loading

wyc commented Jul 21, 2020 •

edited

Loading

wyc commented Jul 27, 2020 •

edited

Loading