Add a "modern" parsing API #2993

dominiccooney · 2017-09-01T05:43:06Z

TL;DR HTML should provide an API for parsing. Why? "Textual" HTML is a widely used syntax. HTML parsing is complex enough to want to use the browser's parser, plus browsers can do implementation tricks with how they create elements, etc.

Unfortunately the way the HTML parser is exposed in the web platform is a hodge-podge. Streaming parsing is only available for main document loads; other things rely on strings which put pressure on memory. innerHTML is synchronous and could cause jank for large documents (although I would like to see data on this because it is pretty fast.)

Here are some strawman requirements:

Should work with streams, and probably strings.
It should be asynchronous. HTML parsing is fast, but if you wanted to handle megabytes of data on phones while animating something, you probably can't do it synchronously.

Commentary:

One big question is when this API exposes the tree it is operating on. Main document parsing does expose the tree and handles mutations to it pretty happily; innerHTML parsing does not until the nodes are adopted into the target node's document (which may start running custom element stuff.)

One minor question is what to do with errors.

Being asynchronous has implications for documents and/or custom elements. If you allow creating stuff in the main document, then you have to run the custom element constructors sometime, so to make it not jank you probably can't run them together. This is probably a feature worth addressing.

Not just a UID

I am not using just a UID, I'm using a comment that contains some UID.

// dumb example
function tag(statics, ...interpolations) {
  const out = [statics[0]];
  for (let i = 1; i < statics.length; i++)
    out.push('<!-- MY UID -->', statics[i]);
  return out.join('');
}

tag`<p>a ${'b'} c</p>`;

This gives me the ability to let the HTML parser split for me text content in chunks, and verify that if the nodeType of the <p> childNodes[x] is Node.COMMENT_NODE and its textContent is my UID, I'm fine.

The reason I'm using comments, beside letting the browser do the splitting job for me, is that browsers that don't support in core HTMLTemplateElement will discard partial tables, cols, or options layout but they wouldn't with comments.

var brokenWorkAround = document.createElement('div');
brokenWorkAround.innerHTML = '<td>goodbye TD</td>';
brokenWorkAround.childNodes; // [#text]
brokenWorkAround.outerHTML;
// <div>goodbye TD</div>

You can read about this issue in all the polyfill from webcomponents issues.
https://github.com/webcomponents/template/issues

As summary, if every browser was natively compatible with the template element and the fact it doesn't ignore any kind of node, the only thing parsers like mine would need is a way to understand when the HTML engine encounters a "special node", in my case represented by a comment with a special content.

Right now we all need to traverse the whole tree after creating it, and in search of special placeholders.

This is fast enough as a one-off operation, and thanks gosh template literals are unique so it's easy to perform the traversing only once, but it wouldn't scale on huge documents, specially now that I've learned for browsers, and due legacy, simply checking nodeType is a hell of a performance nightmare!

Attributes are "doomed"

Now that I've explained the basics for the content, let's talk about attributes.

If you inject a comment as attribute and there are no quotes around, the layout is destroyed.

<nope nopity=<!-- nope -->>nayh</nope>

So, for attributes, having a similar mechanism to define a unique entity/value to be notify about woul dbe ACE!!!! Right now the content is injected sanitized upfront. It works darn well but it's not ideal as a solution.

Moreover on Attribuites

If you put a placeholder in attributes you have the following possible issues:

IE / Edge might throw random errors and break if the attribute is, for example, style, and the content does not contain colons (even if it's unvalid). _some: uid; works, shena-nigans wouldn't.
some not-so-smart browser throws error with invalid attributes. As example, <img src=uid> would throw an error about the resource without even bothering the network (which has a smarter layer). This is Firefox
some node will throw, without failing though (thanks gosh), errors at first parse. These are SVG nodes. If you have <rect x=uid y=uid /> , before you'll set the right values it will show an error that x or y were not valid.

HTML is very forgiving in many parts, attributes are quite the opposite for various scenarios.

As summary if whatever mechanism would tell the browser any attribute with such special content should be ignored, all these problems would disappear.

Backward compatibility

As much as I'd love to have help from the platform itslef regarding the template literals pattern, I'm afraid that won't ever land in production until all browsers out there would support it (or there is a reliable polyfill for that).

That means that exposing the internal HTML parser through a new API can surely benefits projects from the future, but it would rarely land for all browser in 5+ years.

This last point is just my consideration about effort / results ratio.

Thanks again for helping out regardless.

jakearchibald · 2017-09-01T10:14:41Z

@inikulin

There is a problem with this approach

I don't think it's a problem. If you use {preventClose: true}, it doesn't encounter "end of stream". So:

await textStream1.pipeTo(streamingFragment.writable, { preventClose: true });
await textStream2.pipeTo(streamingFragment.writable);

The streaming fragment would consume the streams as if there were a single stream concatenated.

await textStream3.pipeTo(streamingFragment.writable);

The above would fail, as the writable has now closed.

WebReflection · 2017-09-01T10:27:11Z

P.S. just in case my wishes come true ... what both me and (most-likely) Justin would love to have natively exposed, is a document.queryRawContent(UID) that would return, in linear order, atributes with such value, or comments nodes with such value.

<html lang=UID>
<body> Hello <!--UID-->! <p class=UID></p></body>

The JS coutner part would be:

const result = document.queryRawContent(UID);
[
  the html lang attribute,
  the comment childNodes[1] of the body,
  the p class arttribute
]

Now that, in core, would make my parser a no brainer (beside the issue with comments and attributes, but RegExp upfront are very good at that and blazing fast

[edit] even while streaming it would work, actually it'd be even better so it's one pass for the browser

WebReflection · 2017-09-01T10:38:17Z

Also since I know for many code is better than thousand words, this is the TL;DR version of what hyperHTML does.

function tag(statics, ...interpolations) {
  if (this.statics !== statics) {
    this.statics = statics;
    this.updates = parse.call(this, statics, '<!--WUT-->');
  }
  this.updates(interpolations);
}

function parse(statics, lookFor) {
  const updates = [];
  this.innerHTML = statics.join(lookFor);
  traverse(this, updates, lookFor);
  const update = (value, i) => updates[i](value);
  return interpolations => interpolations.forEach(update);
}

function traverse(node, updates, lookFor) {
  switch (node.nodeType) {
    case Node.ELEMENT_NODE:
      updates.forEach.call(node.attributes, attr => {
        if (attr.value === lookFor)
          updates.push(v => attr.value = v)});
      updates.forEach.call(node.childNodes,
        node => traverse(node, updates, lookFor)); break;
    case Node.COMMENT_NODE:
      if (`<!--${node.textContent}-->` === lookFor) {
        const text = node.ownerDocument.createTextNode('');
        node.parentNode.replaceChild(text, node);
        updates.push(value => text.textContent = value);
}}}

const body = tag.bind(document.body);

setInterval(() => {
  body`
  <div class="${'my-class'}">
    <p> It's ${(new Date).toLocaleTimeString()} </p>
  </div>`;
}, 1000);

The slow path is the traverse function, the not-so-cool part is the innerHTML injection (as regular node, template or whatever it is) without having the ability to intercept, while parsing the string, all placeholders / attributes and act addressing them accordingly.

OK, I'll let you discuss the rest now 😄

jakearchibald · 2017-09-01T10:56:16Z

@WebReflection

I think the UID scanner you're talking about might not be necessary. Consider:

const fragment = whatever`
  <p>${someContent}</p>
  <img src=${someImgSrc}>
`;

Where whatever could do something like this:

async function whatever(strings, ...values) {
  const streamingFragment = document.createStreamingFragment();
  const writer = streamingFragment.writer.getWriter();

  for (const str of strings) {
    // str is:
    // <p>
    // </p> <img src=
    // >
    // (with extra whitespace of course)
    await writer.write(str);
    let parserState = streamingFragment.getParserState();

    if (parserState.tokenState == 'data') {
      // This is the case for <p>, and >
      await writer.write('<!-- -->');
      parserState.currentTarget.lastChild; // this is the comment you just created.
      // Swap it out for the interpolated value
    }
    else if (parserState.tokenState.includes('attr-value')) {
      // await the creation of this attr node
      parserState.attrNode.then(attr => {
        // Add the interpolated value, or remove it and add an event listener instead etc etc.
      });
    }
  }
}

WebReflection · 2017-09-01T10:59:53Z

Yes, that might work. As long as these scenarios are allowed:

const fragment = whatever`
  <ul>${...}</ul>
  ${...}
  <p data-a=${....} onclick=${....}>also ${...} and</p>
  <img a=${...} b=${...} src=${someImgSrc}>
  <table><tr>${...}</tr></table>
`;

which looks like it'd be the case.

jakearchibald · 2017-09-01T11:04:31Z

@WebReflection Interpolation should be allowed anywhere.

whatever`
  <${'img'} src="hi">
`;

In the above case tokenState would be "tag-open" or similar. At this point you could either throw a helpful error, or just pass the interpolated value through.

inikulin · 2017-09-01T11:17:29Z

@jakearchibald Do you expect tokenStateto be one of tokeniser states defined in https://html.spec.whatwg.org/multipage/parsing.html#tokenization? If so, I'm afraid we can't do that, they are part of parser intrinsics and are subject to change. Moreover, some of them can be meaningless for a user.

jakearchibald · 2017-09-01T11:29:19Z

@inikulin yeah, that's what I was hoping to expose, or something equivalent. Why can't we expose it?

WebReflection · 2017-09-01T11:45:10Z

@jakearchibald

what about the following ?

whatever`
  <${'button'} ${'disabled'}>
`;

I actually don't mind having that possible because boolean attributes need boolean values so that ${obj.disabled ? 'disabled' : ''} doesn't look like a great option to me, but I'd be curious to know if "attribute-name" would be exposed too.

Anyway, having my example covered would be already awesome.

jakearchibald · 2017-09-01T11:51:18Z

@WebReflection The tokeniser calls that the "Before attribute name state", so if we could expose that, it'd be possible.

WebReflection · 2017-09-01T11:58:55Z

Not sure this is actually just extra noise or something valuable, but if it can simplify anything, viperHTML uses similar mechanism to parse once on the NodeJS side.

The parser is the pretty awesome htmlparser2.

Probably inspiring as API ? I use the comment trick there though, but since there is a .write mechanism, I believe it could be possible to make it incremental.

inikulin · 2017-09-01T12:01:48Z

@jakearchibald These states are part of intrinsic parser mechanism and are subject of change, we've even removed/introduced few recently just to fix some conformance-error related bug in parser. So, exposing them to end user will require us to freeze current list of states, that will significantly complicate further development of the parser spec. Moreover, I believe some of them will be quite confusing for end users, e.g. Comment less-than sign bang dash dash state

WebReflection · 2017-09-01T12:08:51Z

@inikulin would a subset be reasonable? As example, data and attr-value for me would cover already 100% of hyperHTML use cases and I believe those two will never change in the history of HTML ... right?

jakearchibald · 2017-09-01T12:13:49Z

I'm keen on exposing some parser state to help libraries, but I'm happy for us to add it later rather than block streaming parsing on it.

inikulin · 2017-09-01T12:16:07Z

@WebReflection Yes, that could be a solution. But I have some use cases in mind that can be confusing for end user. Consider <div data-foo="bar". We'll emit attr-value state in that case, however this markup will not produce attribute in AST (it will not even produce a tag, since unclosed tags in the end of the input stream are omitted).

WebReflection · 2017-09-01T12:22:14Z

@inikulin if someone writes broken html I don't expect anything different than throwing errors and break everything right away (when using a new parser API)

Template literals are static, there's no way one of them would instantly fail the parser ... it either work or fail forever, since these are also frozen Arrays.

Accordingly, I understand this API is not necessarily for template literals only, but if the streamer goes bananas due wrong output it's developer fault.

today it's developer fault regardless, but she'll never notice due silent failure.

inikulin · 2017-09-01T12:25:48Z

if someone writes broken html I don't expect anything different than throwing errors and break everything right away.

You will be surprised looking at the real world markup around the web. Also, there is no such thing as "broken markup" anymore. There is non-conforming markup, but modern HTML parser can swallow anything. So, to conclude, you suggest to bail out with an error in case of invalid markup in this new streaming API?

WebReflection · 2017-09-01T12:30:23Z

You will be surprised looking at the real world markup around the web.

you missed the edit: when using a new parser API

So, to conclude, you suggest to bail out with an error in case of invalid markup in this new streaming API?

If the alternative is to not have it, yes please.

I'm tired of missed opportunities due lazy developers that need to be cuddle by standards for their mistakes.

inikulin · 2017-09-01T12:37:01Z

If the alternative is to not have it, yes please.

I'm tired of missed opportunities due lazy developers that need to be cuddle by standards about their mistakes.

I'm not keen to this approach to be honest, it brings us back to times of XHTML. One of the advantages of HTML5 was it's flexibility regarding parse errors and, hence, document authoring.

WebReflection · 2017-09-01T12:44:22Z

this API goal is different, and developers want to know if they wrote a broken template.

Not knowing it hurts themselves, and since there is no html highlight by default inside strings, it's also a safety belt for them.

So throw like any asynchronous operation that failed would throw, and let them decide if they want to fallback to innerHTML or fix that template literal instead, and forever.

jakearchibald · 2017-09-07T13:07:39Z

Fair enough. It'd be good to expose these states at some point, but it doesn't need to be v1.

RReverser · 2017-09-07T16:09:17Z

@WebReflection I agree having events for separate pieces of the HTML as it goes through would be quite nice, but I'd say it's already a little bit more advanced than the "as small as possible", more like version 2. For version 1, it would be nice at least to be able to insert streaming content into the DOM even without hooks for separate parts of it.

WebReflection · 2017-09-07T16:11:10Z

events are just attributes ... what I've written intercepts/pauses at dom chunks and / or attributes, no matter which attribute it is or what it does ... attributes 😄

RReverser · 2017-09-07T16:14:49Z

@WebReflection Sure, but as I said, it's a bit more advanced because it requires providing hooks from inside of the parser. I want to start with something that will be definitely possible to get implemented by vendors with pretty much no changes or hooks that are not already there, and then iterate on top of that.

dvoytenko · 2017-09-07T17:03:38Z

@dominiccooney

Thanks for those details. Roughly how much content are we talking about here?

This is really full-size docs. Anywhere between 10K and 200K. I don't know what averages are, tbh.

jakearchibald · 2017-09-11T09:40:39Z

#2142 – previous issue where a streaming parsing API was discussed

inikulin · 2017-09-14T11:06:05Z

Another important question: do we want it to behave like a streaming innerHTML? If so, such functionality can't be achieved with the fragment approach, since we don't know context of parsing ahead of time. Consider we have a <textarea> element. With innerHTML setter parser knows that content will be parsed in context of <textarea> element and switches tokeniser to text parsing mode. So, e.g. <div></div> will be parsed as text content. Whereas, with fragment we'll parse it as a div tag. If we'll use same machinery for fragment parsing approach as we use for the <template> parsing we can workaround some of the cases, such as parsing table content (however e.g. foster parenting will not work), but everything that involves adjustment of the tokeniser state will be a problem.

jakearchibald · 2017-09-14T11:42:40Z

@inikulin The fragment could buffer text until it's appended, at which point it knows its context. Although a guess it's a bit weird that you wouldn't be able to look at stuff in the fragment.

The API could take an option that would give it context ahead of time, so nodes could be created before insertion.

inikulin · 2017-09-14T13:01:50Z

@jakearchibald What if we modify API a bit. We'll introduce new entity, let's call it StreamingParser for now:

// If we provide context element, then content is streamed directly to it.
let parser = new StreamingParser(contentElement); 

let response = await fetch(url);
response.body
  .pipeTo(parser.stream);

// You can examine parsed content at any moment using `parser.fragment`
// property which is a fragment mapped to the parsed content in context element
console.log(parser.fragment.childNodes.length);

// If context element is not provided, we don't stream content anywhere,
// however you can still use `parser.fragment` to examine content or attach it to some node
parser = new StreamingParser(); 

// ...

jakearchibald · 2017-09-14T13:16:33Z

If you don't provide the content element, how is the content parsed?

inikulin · 2017-09-14T13:29:47Z

In that case parser.fragment (or even better call it parser.target) will be a DocumentFragment element implicitly created by the parser.

jakearchibald · 2017-09-14T13:35:00Z

Is that a valid context for a parser?

jakearchibald · 2017-09-14T13:57:45Z

As in, if I push <path/> to the parser, what ends up in parser.fragment?

inikulin · 2017-09-14T13:59:48Z

DocumentFragment itself is not a valid context for parser. I forgot to elaborate here: in case if we don't provide content element for the parser, it creates <template> element under the hood and pipes content into it, parser.target will be template.content in this case.

jakearchibald · 2017-09-14T14:03:42Z

It'd still be nice to have the nodes created before the target. A "context" option could do this. The option could take a Range, an Element (treated like a range that starts within the element), or a DOMString, which is treated as an element that would be created by document.createElement(string).

inikulin · 2017-09-14T15:38:30Z

How it will behave if we pass a Range as a context?

inikulin · 2017-09-27T12:27:57Z

@jakearchibald Seems like I got it: in case of Range we'll stream to all elements in Range? If so. we'll need separate instance of parser for each element in Range.

jakearchibald · 2017-09-27T12:53:04Z

@inikulin whoa, I really thought I'd replied to this, sorry. Range would simply be used to figure out the context, like https://w3c.github.io/DOM-Parsing/#idl-def-range-createcontextualfragment(fragment). There'd only be one parser instance.

inikulin · 2017-09-27T13:12:52Z

@jakearchibald Thanks for the clarification. We've just discussed possible behaviours with @RReverser and we were wondering if parsing should affect context element's ambient context: e.g. in case if we stream inside <table> and provided markup contains text outside table cell should we move this text above context <table> element (foster parent it) as it's done in full document parsing. Or we should behave exactly like innerHTML and keep text inside <table>?

jakearchibald · 2017-09-27T13:49:11Z

Hmm, that's a tough one. It'd be difficult to do what the parser does while giving access to the nodes before they're inserted. As in:

const streamingFragment = document.createStreamingFragment({context: 'table'});
const writer = streamingFragment.writer.getWriter();
await writer.write('hello');

// Is 'hello' anywhere in streamingFragment.childNodes?

In cases where the node would be moved outside of the context, we could do the innerHTML thing, or discard the node (it's been moved outside of the fragment, to nowhere).

I'd want to avoid as many of the innerHTML behaviours as possible, but I guess it isn't possible here.

RReverser · 2017-09-27T17:13:40Z

Another concern we discussed with @inikulin (also related to the discussion in last few comments) is that content being parsed might contain closing tags and so leave the parent context. In that regard, behaviour of innerHTML or createContextualDocumentFragment seems better in that it keeps the content isolated, although we're still not sure how stable is machinery for the latter API (given that it does more than innerHTML, e.g. executing scripts is allowed).

domenic · 2018-11-14T02:10:40Z

In an offline discussion, @sebmarkbage brought up the helpful point that if we added Response-accepting srcObject to iframe (see #3972), this would also serve as a streaming parsing API, albeit only in iframes.

RReverser · 2018-11-14T04:01:15Z

@domenic Hmm, I'm not sure how it would help with streaming parsing? Seems to mostly help with streaming generation of content?

domenic · 2018-12-19T22:13:19Z

@RReverser The parsing would also be done in a streaming fashion, just like it is currently done for iframes loaded from network-derived lowercase-"r" responses.

RReverser · 2018-12-20T13:59:32Z

What I mean is, I don't see how this helps with actually parsing HTML from JS side (and getting tokens etc.), it rather seems to help with generating and delivering HTML to the renderer.

RReverser · 2018-12-20T14:02:29Z

Actually nevermind, I realised that half of this old thread was already about the "delivery to the renderer" problem and not actual parsing. Which is useful too, but seems confusing to mix both in the same discussion.

dead-claudia · 2020-12-18T06:52:50Z

So I have an idea: how about something like this on HTMLElement: Promise<void> replaceChildrenWithHTML((ReadableStream or DOMString) stream)? This would lock the children list (attempts to read the children fail with an error) and return a promise resolved once it's unlocked and ready to manipulate again. This in effect would be an asynchronous elem.innerHTML = ..., and would be easy to make efficient with background DOM parsing. Note that the browser can append elements at any time, and while you can't manipulate elements themselves, addition can still be detected by properties like outerHeight. (This is so they can pipeline it - it makes for a better user experience.)

As for why a generic readable stream? Such an API could be immensely useful for not just things like displaying Markdown documents from the server, but also for things like displaying large CI logs and large files, where in more advanced cases, a developer might choose to use the scroll position plus the current outer height to determine a threshold to render more items, simply buffering the logs until they're ready to render them. (I could totally see a browser doing this for displayed text files whose sizes are over 100MB - they might even choose to buffer the rest to disk to save memory and just read from there after they've received everything from network, only pre-loading things that are remotely close to the viewport.)

I'm aware of how old this bug is. I still want to resurrect it.

annevk · 2021-03-17T10:04:52Z

Let's close this issue. This is probably best started in https://wicg.io/ or a personal repository before it reaches a point where it can be more seriously considered.

dominiccooney mentioned this issue Sep 1, 2017

Extend document.open/write to non-active documents #2827

Closed

annevk added the topic: parser label Sep 1, 2017

RReverser mentioned this issue Sep 27, 2017

Add parse5-based streaming parser jakearchibald/streaming-html#2

Merged

annevk closed this as completed Mar 17, 2021

Add a "modern" parsing API #2993

Add a "modern" parsing API #2993

Comments

dominiccooney commented Sep 1, 2017

annevk commented Sep 1, 2017

jakearchibald commented Sep 1, 2017

jakearchibald commented Sep 1, 2017

inikulin commented Sep 1, 2017

inikulin commented Sep 1, 2017

jakearchibald commented Sep 1, 2017

inikulin commented Sep 1, 2017

WebReflection commented Sep 1, 2017 • edited Loading

Not just a UID

Attributes are "doomed"

Moreover on Attribuites

Backward compatibility

jakearchibald commented Sep 1, 2017 • edited Loading

WebReflection commented Sep 1, 2017 • edited Loading

WebReflection commented Sep 1, 2017

jakearchibald commented Sep 1, 2017

WebReflection commented Sep 1, 2017

jakearchibald commented Sep 1, 2017

inikulin commented Sep 1, 2017

jakearchibald commented Sep 1, 2017

WebReflection commented Sep 1, 2017

jakearchibald commented Sep 1, 2017

WebReflection commented Sep 1, 2017

inikulin commented Sep 1, 2017

WebReflection commented Sep 1, 2017

jakearchibald commented Sep 1, 2017

inikulin commented Sep 1, 2017

WebReflection commented Sep 1, 2017 • edited Loading

inikulin commented Sep 1, 2017

WebReflection commented Sep 1, 2017 • edited Loading

inikulin commented Sep 1, 2017

WebReflection commented Sep 1, 2017 • edited Loading

jakearchibald commented Sep 7, 2017

RReverser commented Sep 7, 2017 • edited Loading

WebReflection commented Sep 7, 2017

RReverser commented Sep 7, 2017

dvoytenko commented Sep 7, 2017

jakearchibald commented Sep 11, 2017

inikulin commented Sep 14, 2017

jakearchibald commented Sep 14, 2017

inikulin commented Sep 14, 2017 • edited Loading

jakearchibald commented Sep 14, 2017

inikulin commented Sep 14, 2017

jakearchibald commented Sep 14, 2017

jakearchibald commented Sep 14, 2017

inikulin commented Sep 14, 2017

jakearchibald commented Sep 14, 2017

inikulin commented Sep 14, 2017

inikulin commented Sep 27, 2017

jakearchibald commented Sep 27, 2017 • edited Loading

inikulin commented Sep 27, 2017

jakearchibald commented Sep 27, 2017

RReverser commented Sep 27, 2017 • edited Loading

domenic commented Nov 14, 2018

RReverser commented Nov 14, 2018

domenic commented Dec 19, 2018

RReverser commented Dec 20, 2018

RReverser commented Dec 20, 2018

dead-claudia commented Dec 18, 2020

annevk commented Mar 17, 2021

WebReflection commented Sep 1, 2017 •

edited

Loading

jakearchibald commented Sep 1, 2017 •

edited

Loading

WebReflection commented Sep 1, 2017 •

edited

Loading

WebReflection commented Sep 1, 2017 •

edited

Loading

WebReflection commented Sep 1, 2017 •

edited

Loading

WebReflection commented Sep 1, 2017 •

edited

Loading

RReverser commented Sep 7, 2017 •

edited

Loading

inikulin commented Sep 14, 2017 •

edited

Loading

jakearchibald commented Sep 27, 2017 •

edited

Loading

RReverser commented Sep 27, 2017 •

edited

Loading