Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add unique ids to each parse error #1339

Open
gsnedders opened this issue May 28, 2016 · 31 comments
Open

Add unique ids to each parse error #1339

gsnedders opened this issue May 28, 2016 · 31 comments
Labels
clarification Standard could be clearer topic: parser

Comments

@gsnedders
Copy link
Member

gsnedders commented May 28, 2016

It would be convenient for the sake of testing implementations to have unique ids on each parse-error.

I know @Hixie opposed this because the spec only requires parse errors to emitted at specific points, and some implementations merge multiple parse errors in the spec, and hence cannot distinguish them.

I don't think this is a good argument against it, as implementations can always have a mapping of their parse error x corresponds to a and b in the spec.

At the moment html5lib-tests essentially has position and some meaningless message.

Both html5lib and html5ever have unique messages for each parse error. It would be nice to make sure we have the right errors being raised.

We probably want ids something like the keys in the dictionary E in https://github.com/html5lib/html5lib-python/blob/master/html5lib/constants.py#L7.

@gsnedders
Copy link
Member Author

cc @zcorpan @nox @hsivonen

@annevk annevk added clarification Standard could be clearer topic: parser labels May 29, 2016
@zcorpan
Copy link
Member

zcorpan commented May 29, 2016

Sounds like a good idea. (I also think the parser could use more ids in general.)

@hsivonen
Copy link
Member

hsivonen commented Jun 3, 2016

I'm OK with this.

@zcorpan zcorpan added the good first issue Ideal for someone new to a WHATWG standard or software project label Jun 7, 2016
@haroldfredshort
Copy link

haroldfredshort commented Jun 17, 2016

I've gone through the microsyntax parse errors first. There are not very many but these are the hardest to properly identify. With one exception they can all fit the 'x-in-y' pattern. What do you think of these so far?

parse-error-comma-in-img-srcset
parse-error-trailing-commas-in-img-srcset-url
parse-error-invalid-value-in-img-descriptor
parse-error-img-descriptor (this is an else-case parse error for img descriptors, not sure what to call it)
parse-error-empty-size-in-sizes-attribute
parse-error-invalid-size-in-sizes-attribute

Example:
<span data-x="concept-microsyntax-parse-error" id="parse-error-comma-in-img-srcset">parse error</span>

Next I'll see about listing the generalized forms for the remaining parse errors.

@sideshowbarker
Copy link
Member

I think we should consider actually uniquely naming each parse error in the body of the spec; like:

<code><dfn>CommaInImgSrcset</dfn></code><span
    data-x="concept-microsyntax-parse-error">parse error</span>

@domenic
Copy link
Member

domenic commented Jun 17, 2016

Gah, we keep giving poor @haroldfredshort conflicting information :(...

@domenic
Copy link
Member

domenic commented Jun 17, 2016

Maybe people would all agree on something like

... that is a <span data-x="concept-microsyntax-parse-error">parse error</span>
(<dfn data-x="parse-error-comma-in-img-srcset">comma in img srcset</dfn>).

???

@sideshowbarker
Copy link
Member

Gah, we keep giving poor @haroldfredshort conflicting information :(...

oofs, yeah, sorry for not re-reading the comments in the PR more carefully

Maybe people would all agree on something like

... that is a <span data-x="concept-microsyntax-parse-error">parse error</span>
(<dfn data-x="parse-error-comma-in-img-srcset">comma in img srcset</dfn>).

Yeah, that would be fine by me.

@zcorpan
Copy link
Member

zcorpan commented Jun 17, 2016

I believe this bug is about parse errors in the "Parsing HTML documents" section -- not about errors in srcset.

@haroldfredshort
Copy link

Regarding the comment from @zcorpan, should I not add identifiers to the seven srcset parse errors, or would it be okay to identify these differently? Like this:

that is a <span data-x="concept-microsyntax-parse-error">parse error</span>
(<dfn data-x="microsyntax-parse-error-comma-in-img-srcset">comma in img srcset</dfn>).

The parse errors in the "Parsing HTML documents" section are simpler, but before I go further I am including an example here to make sure we are all agreed that this is how we want it to be:

...
<dt>EOF</dt><dd><span>Parse error</span>
(<dfn data-x="parse-error-eof-in-cdata-section-state">EOF in CDATA section state</dfn>)
...

@sideshowbarker
Copy link
Member

should I not add identifiers to the seven srcset parse errors

The srcset parse errors are not errors that get implemented by HTML parsers. They instead are targeted just to the separate specific code that browsers implement in their srcset implementations.

So it would be good to have IDs for those as well, but those should be in a separate PR, because this issue was raised specifically just for the parse errors for HTML parsers as defined in the HTML parsing algorithm, and so the target is just HTML parser developers/implementations.

@haroldfredshort
Copy link

@sideshowbarker Okay, thank you for the clarification.

@haroldfredshort
Copy link

I think each of the HTML parse errors can be made to fit into one of the following generalized error forms:
foo-in-bar
foo-not-in-bar
expected-foo-got-bar
unexpected-foo
invalid-foo
As discussed previously, each error id will begin with parse-error-.

Please let me know what you think.

@gsnedders
Copy link
Member Author

I think a few of the entity parse errors might not quite fit into that, but am open to suggestions there (also, it's only three or four, so that's not really wasted effort if we decide to change them after you've made a PR!).

I'm happy to go with @domenic's suggestion as to the format of the dfn, so I think we're agreed to it.

@gsnedders
Copy link
Member Author

(Sorry for that having made this bug sufficiently clear before, by the way!)

@inikulin
Copy link
Member

Is anyone working on it? If not, I would like to champion it if no one would mind.

@haroldfredshort
Copy link

haroldfredshort commented Mar 11, 2017 via email

@hsivonen
Copy link
Member

Unique id per error in the HTML parsing section seems like a good idea. When a human-facing message would be parametrized with the particular element names at hand, those variables varying shouldn't cause the error identity to be multiplied.

@inikulin
Copy link
Member

inikulin commented Mar 13, 2017

OK, so here is the plan: I'll stick with error format proposed by @domenic in #1339 (comment) with - as a word separator for error ID. Along the way, I'll update html5lib-tests with expected error data, thereby we'll have functionality covered with tests. I will use parse5 as a reference implementation. I'm planning to split job into two big chunks which will be issued as separate PRs: one for tokenization errors and one for tree construction errors.

  • Implement error reporting and testing infrastructure in parse5
  • Implement errors for tokenization
  • Implement errors for tree construction

@caridy and @diervo have expressed their desire to participate: feel free to join once I'll sort out bootstrapping-related things (first point in the TODO-list) and we'll coordinate our work on this.

Does anyone has any concerns before I start?

@gsnedders
Copy link
Member Author

gsnedders commented Mar 13, 2017

@inikulin FWIW, there was various bits of debate about parse error format in html5lib-tests, especially giving positions (some people wanted tests for start/end positions of the token with the error, IIRC, so whatever format we have there should make sure it's extensible for that). On the whole I'm in favour of something like (line,col): fragment.

@inikulin
Copy link
Member

@gsnedders I'm not sure we'll always be able associate error with the token correctly. However, from the top of my mind following should work in most cases:

  • For tokenizer error let (line, col) be position of consumed character that initiated an error;
  • For tree construction let (line, col) be position of first character of token that initiated and error.

What do you mean by fragment though? Error ID or source code representation of the token?

@gsnedders
Copy link
Member Author

@inikulin currently I think some have for tree construction (line, col) being the end of the token; I meant the error ID by fragment.

@inikulin
Copy link
Member

@gsnedders yeah, makes more sense.

@diervo
Copy link

diervo commented Mar 13, 2017

@inikulin I spend the weekend reading all the threads and docs about the topic, so I'm ready to help with the implementation. One you have some idea/design draft with the first integration and implementation details, I will happily help with anything you need.

This is something we really need soon for our project at Salesforce so I would gladly co-champion it with someone that has more background with the spec related stuff.

@inikulin
Copy link
Member

OK, I've finished preparation steps and and finally got my hands on the spec. Here is how it looks like:
image

I'm a bit hesitating if it's clear enough that what given in the parentheses is a error code. Maybe prefix form suggested in #1339 (comment) will work better? Or we could just clarify this somehow in the parse error definition?

@sideshowbarker
Copy link
Member

I'm a bit hesitating if it's clear enough that what given in the parentheses is a error code.

Seems like another option is to put the error name in front of the words parse error, like this:

Any character that is a not a Unicode character, i.e. any isolated surrogate, is a non-unicode-character-in-input-stream parse error.

@domenic
Copy link
Member

domenic commented Mar 25, 2017

Yeah, I personally like the prefix form, with @sideshowbarker's most recent variation seeming like an even nicer variant than the earlier comment.

Very excited about this!

We should probably also add something about the purpose of these identifiers in the parse error definition section.

@inikulin
Copy link
Member

I wonder if we should enforce usage of specification-provided error codes, i.e. should we use "should" or "may" RFC 2119 keywords in:

Some parse errors have dedicated error codes that should/may be used by conformance checkers...

@sideshowbarker
Copy link
Member

I wonder if we should enforce usage of specification-provided error codes, i.e. should we use "should" or "may" RFC 2119 keywords in:

Some parse errors have dedicated error codes that should/may be used by conformance checkers...

I’m supportive of using a normative “should” to state that requirement for conformance checkers.

domenic pushed a commit that referenced this issue May 31, 2017
This gives every parse error that occurs during tokenization a unique
ID, and adds non-normative text explaining and exemplifying when they
occur in an overview table.

Part of #1339; tree construction parse errors remain before that issue
is finished.
@zcorpan
Copy link
Member

zcorpan commented Jun 1, 2017

When doing the next round to add codes to the tree builder, please also add a comma and space after "e.g." here:
d6764a5#diff-36cd38f49b9afa08222c0dc9ebfe35ebR100416

Edit: Did this in #2728

@inikulin
Copy link
Member

inikulin commented Jun 1, 2017

@zcorpan Thank you!

@domenic domenic removed the good first issue Ideal for someone new to a WHATWG standard or software project label Jun 23, 2017
alice pushed a commit to alice/html that referenced this issue Jan 8, 2019
This gives every parse error that occurs during tokenization a unique
ID, and adds non-normative text explaining and exemplifying when they
occur in an overview table.

Part of whatwg#1339; tree construction parse errors remain before that issue
is finished.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
clarification Standard could be clearer topic: parser
Development

No branches or pull requests

9 participants