Add unique ids to each parse error #1339

gsnedders · 2016-05-28T20:11:19Z

It would be convenient for the sake of testing implementations to have unique ids on each parse-error.

I know @Hixie opposed this because the spec only requires parse errors to emitted at specific points, and some implementations merge multiple parse errors in the spec, and hence cannot distinguish them.

I don't think this is a good argument against it, as implementations can always have a mapping of their parse error x corresponds to a and b in the spec.

At the moment html5lib-tests essentially has position and some meaningless message.

Both html5lib and html5ever have unique messages for each parse error. It would be nice to make sure we have the right errors being raised.

We probably want ids something like the keys in the dictionary E in https://github.com/html5lib/html5lib-python/blob/master/html5lib/constants.py#L7.

The text was updated successfully, but these errors were encountered:

gsnedders · 2016-05-28T20:12:06Z

cc @zcorpan @nox @hsivonen

zcorpan · 2016-05-29T09:41:26Z

Sounds like a good idea. (I also think the parser could use more ids in general.)

hsivonen · 2016-06-03T10:40:59Z

I'm OK with this.

haroldfredshort · 2016-06-17T08:17:35Z

I've gone through the microsyntax parse errors first. There are not very many but these are the hardest to properly identify. With one exception they can all fit the 'x-in-y' pattern. What do you think of these so far?

parse-error-comma-in-img-srcset
parse-error-trailing-commas-in-img-srcset-url
parse-error-invalid-value-in-img-descriptor
parse-error-img-descriptor (this is an else-case parse error for img descriptors, not sure what to call it)
parse-error-empty-size-in-sizes-attribute
parse-error-invalid-size-in-sizes-attribute

Example:
<span data-x="concept-microsyntax-parse-error" id="parse-error-comma-in-img-srcset">parse error</span>

Next I'll see about listing the generalized forms for the remaining parse errors.

sideshowbarker · 2016-06-17T08:30:41Z

I think we should consider actually uniquely naming each parse error in the body of the spec; like:

<code><dfn>CommaInImgSrcset</dfn></code><span
    data-x="concept-microsyntax-parse-error">parse error</span>

domenic · 2016-06-17T08:32:56Z

Gah, we keep giving poor @haroldfredshort conflicting information :(...

domenic · 2016-06-17T08:37:26Z

Maybe people would all agree on something like

... that is a <span data-x="concept-microsyntax-parse-error">parse error</span>
(<dfn data-x="parse-error-comma-in-img-srcset">comma in img srcset</dfn>).

???

sideshowbarker · 2016-06-17T09:00:00Z

Gah, we keep giving poor @haroldfredshort conflicting information :(...

oofs, yeah, sorry for not re-reading the comments in the PR more carefully

Maybe people would all agree on something like

... that is a <span data-x="concept-microsyntax-parse-error">parse error</span>
(<dfn data-x="parse-error-comma-in-img-srcset">comma in img srcset</dfn>).

Yeah, that would be fine by me.

zcorpan · 2016-06-17T10:35:35Z

I believe this bug is about parse errors in the "Parsing HTML documents" section -- not about errors in srcset.

haroldfredshort · 2016-06-18T00:29:32Z

Regarding the comment from @zcorpan, should I not add identifiers to the seven srcset parse errors, or would it be okay to identify these differently? Like this:

that is a <span data-x="concept-microsyntax-parse-error">parse error</span>
(<dfn data-x="microsyntax-parse-error-comma-in-img-srcset">comma in img srcset</dfn>).

The parse errors in the "Parsing HTML documents" section are simpler, but before I go further I am including an example here to make sure we are all agreed that this is how we want it to be:

...
<dt>EOF</dt><dd><span>Parse error</span>
(<dfn data-x="parse-error-eof-in-cdata-section-state">EOF in CDATA section state</dfn>)
...

sideshowbarker · 2016-06-18T02:00:09Z

should I not add identifiers to the seven srcset parse errors

The srcset parse errors are not errors that get implemented by HTML parsers. They instead are targeted just to the separate specific code that browsers implement in their srcset implementations.

So it would be good to have IDs for those as well, but those should be in a separate PR, because this issue was raised specifically just for the parse errors for HTML parsers as defined in the HTML parsing algorithm, and so the target is just HTML parser developers/implementations.

haroldfredshort · 2016-06-18T05:45:57Z

@sideshowbarker Okay, thank you for the clarification.

haroldfredshort · 2016-06-18T09:00:33Z

I think each of the HTML parse errors can be made to fit into one of the following generalized error forms:
foo-in-bar
foo-not-in-bar
expected-foo-got-bar
unexpected-foo
invalid-foo
As discussed previously, each error id will begin with parse-error-.

Please let me know what you think.

gsnedders · 2016-06-18T22:11:45Z

I think a few of the entity parse errors might not quite fit into that, but am open to suggestions there (also, it's only three or four, so that's not really wasted effort if we decide to change them after you've made a PR!).

I'm happy to go with @domenic's suggestion as to the format of the dfn, so I think we're agreed to it.

gsnedders · 2016-06-18T22:12:08Z

(Sorry for that having made this bug sufficiently clear before, by the way!)

inikulin · 2017-03-11T19:11:02Z

Is anyone working on it? If not, I would like to champion it if no one would mind.

haroldfredshort · 2017-03-11T20:49:55Z

I've given up, go for it!

…

On Sat, Mar 11, 2017, 11:11 Ivan Nikulin ***@***.***> wrote: Is anyone working on it? If not, I would like to champion it if no one would mind. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1339 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAa9GODVj5UhOLj_2mdr2RM4ZOMQHaU2ks5rkvHKgaJpZM4IpJlq> .

hsivonen · 2017-03-13T10:45:38Z

Unique id per error in the HTML parsing section seems like a good idea. When a human-facing message would be parametrized with the particular element names at hand, those variables varying shouldn't cause the error identity to be multiplied.

inikulin · 2017-03-13T11:15:58Z

OK, so here is the plan: I'll stick with error format proposed by @domenic in #1339 (comment) with - as a word separator for error ID. Along the way, I'll update html5lib-tests with expected error data, thereby we'll have functionality covered with tests. I will use parse5 as a reference implementation. I'm planning to split job into two big chunks which will be issued as separate PRs: one for tokenization errors and one for tree construction errors.

Implement error reporting and testing infrastructure in parse5
Implement errors for tokenization
Implement errors for tree construction

@caridy and @diervo have expressed their desire to participate: feel free to join once I'll sort out bootstrapping-related things (first point in the TODO-list) and we'll coordinate our work on this.

Does anyone has any concerns before I start?

gsnedders · 2017-03-13T13:53:29Z

@inikulin FWIW, there was various bits of debate about parse error format in html5lib-tests, especially giving positions (some people wanted tests for start/end positions of the token with the error, IIRC, so whatever format we have there should make sure it's extensible for that). On the whole I'm in favour of something like (line,col): fragment.

inikulin · 2017-03-13T14:23:02Z

@gsnedders I'm not sure we'll always be able associate error with the token correctly. However, from the top of my mind following should work in most cases:

For tokenizer error let (line, col) be position of consumed character that initiated an error;
For tree construction let (line, col) be position of first character of token that initiated and error.

What do you mean by fragment though? Error ID or source code representation of the token?

gsnedders · 2017-03-13T14:50:29Z

@inikulin currently I think some have for tree construction (line, col) being the end of the token; I meant the error ID by fragment.

inikulin · 2017-03-13T15:24:56Z

@gsnedders yeah, makes more sense.

diervo · 2017-03-13T19:00:17Z

@inikulin I spend the weekend reading all the threads and docs about the topic, so I'm ready to help with the implementation. One you have some idea/design draft with the first integration and implementation details, I will happily help with anything you need.

This is something we really need soon for our project at Salesforce so I would gladly co-champion it with someone that has more background with the spec related stuff.

inikulin · 2017-03-24T22:13:10Z

OK, I've finished preparation steps and and finally got my hands on the spec. Here is how it looks like:

I'm a bit hesitating if it's clear enough that what given in the parentheses is a error code. Maybe prefix form suggested in #1339 (comment) will work better? Or we could just clarify this somehow in the parse error definition?

sideshowbarker · 2017-03-25T01:32:48Z

I'm a bit hesitating if it's clear enough that what given in the parentheses is a error code.

Seems like another option is to put the error name in front of the words parse error, like this:

Any character that is a not a Unicode character, i.e. any isolated surrogate, is a non-unicode-character-in-input-stream parse error.

domenic · 2017-03-25T01:54:10Z

Yeah, I personally like the prefix form, with @sideshowbarker's most recent variation seeming like an even nicer variant than the earlier comment.

Very excited about this!

We should probably also add something about the purpose of these identifiers in the parse error definition section.

inikulin · 2017-05-11T12:45:01Z

I wonder if we should enforce usage of specification-provided error codes, i.e. should we use "should" or "may" RFC 2119 keywords in:

Some parse errors have dedicated error codes that should/may be used by conformance checkers...

sideshowbarker · 2017-05-11T12:52:53Z

I wonder if we should enforce usage of specification-provided error codes, i.e. should we use "should" or "may" RFC 2119 keywords in:

Some parse errors have dedicated error codes that should/may be used by conformance checkers...

I’m supportive of using a normative “should” to state that requirement for conformance checkers.

This gives every parse error that occurs during tokenization a unique ID, and adds non-normative text explaining and exemplifying when they occur in an overview table. Part of #1339; tree construction parse errors remain before that issue is finished.

zcorpan · 2017-06-01T07:00:39Z

When doing the next round to add codes to the tree builder, please also add a comma and space after "e.g." here:
d6764a5#diff-36cd38f49b9afa08222c0dc9ebfe35ebR100416

Edit: Did this in #2728

inikulin · 2017-06-01T09:56:12Z

@zcorpan Thank you!

This gives every parse error that occurs during tokenization a unique ID, and adds non-normative text explaining and exemplifying when they occur in an overview table. Part of whatwg#1339; tree construction parse errors remain before that issue is finished.

annevk added clarification Standard could be clearer topic: parser labels May 29, 2016

inikulin mentioned this issue Jun 6, 2016

html validator inikulin/parse5#55

Closed

zcorpan added the good first issue Ideal for someone new to a WHATWG standard or software project label Jun 7, 2016

haroldfredshort mentioned this issue Jun 15, 2016

Add unique ids to each parse error #1339 #1428

Closed

inikulin mentioned this issue Feb 10, 2017

HTML validation - Feature PR inikulin/parse5#180

Open

domenic removed the good first issue Ideal for someone new to a WHATWG standard or software project label Jun 23, 2017

domenic mentioned this issue Jun 6, 2024

Error codes for parse errors during tree construction #10397

Open

untitaker mentioned this issue Nov 1, 2024

Update tests for relaxed <select> parser html5lib/html5lib-tests#178

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add unique ids to each parse error #1339

Add unique ids to each parse error #1339

gsnedders commented May 28, 2016 •

edited

Loading

gsnedders commented May 28, 2016

zcorpan commented May 29, 2016

hsivonen commented Jun 3, 2016

haroldfredshort commented Jun 17, 2016 •

edited

Loading

sideshowbarker commented Jun 17, 2016

domenic commented Jun 17, 2016

domenic commented Jun 17, 2016

sideshowbarker commented Jun 17, 2016

zcorpan commented Jun 17, 2016

haroldfredshort commented Jun 18, 2016

sideshowbarker commented Jun 18, 2016

haroldfredshort commented Jun 18, 2016

haroldfredshort commented Jun 18, 2016

gsnedders commented Jun 18, 2016

gsnedders commented Jun 18, 2016

inikulin commented Mar 11, 2017

haroldfredshort commented Mar 11, 2017 via email

hsivonen commented Mar 13, 2017

inikulin commented Mar 13, 2017 •

edited

Loading

gsnedders commented Mar 13, 2017 •

edited

Loading

inikulin commented Mar 13, 2017

gsnedders commented Mar 13, 2017

inikulin commented Mar 13, 2017

diervo commented Mar 13, 2017

inikulin commented Mar 24, 2017

sideshowbarker commented Mar 25, 2017

domenic commented Mar 25, 2017

inikulin commented May 11, 2017

sideshowbarker commented May 11, 2017

zcorpan commented Jun 1, 2017 •

edited

Loading

inikulin commented Jun 1, 2017

Add unique ids to each parse error #1339

Add unique ids to each parse error #1339

Comments

gsnedders commented May 28, 2016 • edited Loading

gsnedders commented May 28, 2016

zcorpan commented May 29, 2016

hsivonen commented Jun 3, 2016

haroldfredshort commented Jun 17, 2016 • edited Loading

sideshowbarker commented Jun 17, 2016

domenic commented Jun 17, 2016

domenic commented Jun 17, 2016

sideshowbarker commented Jun 17, 2016

zcorpan commented Jun 17, 2016

haroldfredshort commented Jun 18, 2016

sideshowbarker commented Jun 18, 2016

haroldfredshort commented Jun 18, 2016

haroldfredshort commented Jun 18, 2016

gsnedders commented Jun 18, 2016

gsnedders commented Jun 18, 2016

inikulin commented Mar 11, 2017

haroldfredshort commented Mar 11, 2017 via email

hsivonen commented Mar 13, 2017

inikulin commented Mar 13, 2017 • edited Loading

gsnedders commented Mar 13, 2017 • edited Loading

inikulin commented Mar 13, 2017

gsnedders commented Mar 13, 2017

inikulin commented Mar 13, 2017

diervo commented Mar 13, 2017

inikulin commented Mar 24, 2017

sideshowbarker commented Mar 25, 2017

domenic commented Mar 25, 2017

inikulin commented May 11, 2017

sideshowbarker commented May 11, 2017

zcorpan commented Jun 1, 2017 • edited Loading

inikulin commented Jun 1, 2017

gsnedders commented May 28, 2016 •

edited

Loading

haroldfredshort commented Jun 17, 2016 •

edited

Loading

inikulin commented Mar 13, 2017 •

edited

Loading

gsnedders commented Mar 13, 2017 •

edited

Loading

zcorpan commented Jun 1, 2017 •

edited

Loading