-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a "modern" parsing API #2993
Comments
cc @jakearchibald @whatwg/html-parser |
I'd like this API to support progressive rendering, so I guess I guess my preference is "as soon as possible". const streamingFragment = document.createStreamingFragment();
const response = await fetch(url);
response.body
.pipeThrough(new TextDecoder())
.pipeTo(streamingFragment.writable);
document.body.append(streamingFragment); I'd like the above to progressively render. The parsing would follow the "in template", although we may want options to handle other cases, like SVG.
What kinds of errors? |
There are a few libraries that use tagged template literals to build HTML, I think their code would be simpler if they knew what state the parser was in at a given point. This might be an opportunity. Eg: const fragment = whatever`
<p>${someContent}</p>
<img src=${someImgSrc}>
`; These libraries allow I wonder if something like const streamingFragment = document.createStreamingFragment();
const writer = streamingFragment.writer.getWriter();
await writer.write('<p>');
let parserState = await streamingFragment.getParserState();
parserState.currentNode; // paragraph
await writer.write('</p><img src=');
parserState = await streamingFragment.getParserState(); …I guess this last bit is more complicated, but ideally it should know it's in the "before attribute value" state for "src" within tag "img". Ideally there should be a way to get the resulting attribute & element as a promise. |
@dominiccooney HTML can have conformance errors, but there are recovery mechanisms for all of them and user agents doesn't bail out on errors. So any input can be consumed by the HTML parser without a problem. I like @jakearchibald's API. However, I wonder if we need to support full document streaming parser and how API will look like for it. Also, in streaming fragment approach will it be possible to perform consequent writes to fragment (e.g. pipe one response to fragment and afterwards another one). If so, how it will behave: overwrite content of fragment or insert it in the end of fragment? |
What do you mean by state here? Parser insertion mode, tokeniser state or something else? |
Hmm yeah. I'm not sure what the best pattern is to use for that.
Yeah, you can do this with streams. Either with individual writes, or piping with As in, if the parser eats: <p>Hello …then you: document.querySelector('p').append(', how are you today?'); …you get: <p>Hello, how are you today? …if the parser then receives <p>Hello everyone, how are you today? …as the parser as a pointer to the first text node of the paragraph. |
@jakearchibald There is a problem with this approach. Consider we have two streams: one writes |
Thanks @jakearchibald for thinking of us. I can speak for my 6+ months on the template literals VS DOM pattern so that maybe you can have as many info as possible about implementations/proposals/APIs etc. I'll try to split this post in topics. Not just a UIDI am not using just a UID, I'm using a comment that contains some UID. // dumb example
function tag(statics, ...interpolations) {
const out = [statics[0]];
for (let i = 1; i < statics.length; i++)
out.push('<!-- MY UID -->', statics[i]);
return out.join('');
}
tag`<p>a ${'b'} c</p>`; This gives me the ability to let the HTML parser split for me text content in chunks, and verify that if the The reason I'm using comments, beside letting the browser do the splitting job for me, is that browsers that don't support in core var brokenWorkAround = document.createElement('div');
brokenWorkAround.innerHTML = '<td>goodbye TD</td>';
brokenWorkAround.childNodes; // [#text]
brokenWorkAround.outerHTML;
// <div>goodbye TD</div> You can read about this issue in all the polyfill from webcomponents issues. As summary, if every browser was natively compatible with the Right now we all need to traverse the whole tree after creating it, and in search of special placeholders. This is fast enough as a one-off operation, and thanks gosh template literals are unique so it's easy to perform the traversing only once, but it wouldn't scale on huge documents, specially now that I've learned for browsers, and due legacy, simply checking Attributes are "doomed"Now that I've explained the basics for the content, let's talk about attributes. If you inject a comment as attribute and there are no quotes around, the layout is destroyed. <nope nopity=<!-- nope -->>nayh</nope> So, for attributes, having a similar mechanism to define a unique entity/value to be notify about woul dbe ACE!!!! Right now the content is injected sanitized upfront. It works darn well but it's not ideal as a solution. Moreover on AttribuitesIf you put a placeholder in attributes you have the following possible issues:
HTML is very forgiving in many parts, attributes are quite the opposite for various scenarios. As summary if whatever mechanism would tell the browser any attribute with such special content should be ignored, all these problems would disappear. Backward compatibilityAs much as I'd love to have help from the platform itslef regarding the template literals pattern, I'm afraid that won't ever land in production until all browsers out there would support it (or there is a reliable polyfill for that). That means that exposing the internal HTML parser through a new API can surely benefits projects from the future, but it would rarely land for all browser in 5+ years. This last point is just my consideration about effort / results ratio. Thanks again for helping out regardless. |
I don't think it's a problem. If you use await textStream1.pipeTo(streamingFragment.writable, { preventClose: true });
await textStream2.pipeTo(streamingFragment.writable); The streaming fragment would consume the streams as if there were a single stream concatenated. await textStream3.pipeTo(streamingFragment.writable); The above would fail, as the writable has now closed. |
P.S. just in case my wishes come true ... what both me and (most-likely) Justin would love to have natively exposed, is a <html lang=UID>
<body> Hello <!--UID-->! <p class=UID></p></body> The JS coutner part would be: const result = document.queryRawContent(UID);
[
the html lang attribute,
the comment childNodes[1] of the body,
the p class arttribute
] Now that, in core, would make my parser a no brainer (beside the issue with comments and attributes, but RegExp upfront are very good at that and blazing fast [edit] even while streaming it would work, actually it'd be even better so it's one pass for the browser |
Also since I know for many code is better than thousand words, this is the TL;DR version of what hyperHTML does. function tag(statics, ...interpolations) {
if (this.statics !== statics) {
this.statics = statics;
this.updates = parse.call(this, statics, '<!--WUT-->');
}
this.updates(interpolations);
}
function parse(statics, lookFor) {
const updates = [];
this.innerHTML = statics.join(lookFor);
traverse(this, updates, lookFor);
const update = (value, i) => updates[i](value);
return interpolations => interpolations.forEach(update);
}
function traverse(node, updates, lookFor) {
switch (node.nodeType) {
case Node.ELEMENT_NODE:
updates.forEach.call(node.attributes, attr => {
if (attr.value === lookFor)
updates.push(v => attr.value = v)});
updates.forEach.call(node.childNodes,
node => traverse(node, updates, lookFor)); break;
case Node.COMMENT_NODE:
if (`<!--${node.textContent}-->` === lookFor) {
const text = node.ownerDocument.createTextNode('');
node.parentNode.replaceChild(text, node);
updates.push(value => text.textContent = value);
}}}
const body = tag.bind(document.body);
setInterval(() => {
body`
<div class="${'my-class'}">
<p> It's ${(new Date).toLocaleTimeString()} </p>
</div>`;
}, 1000); The slow path is the OK, I'll let you discuss the rest now 😄 |
I think the UID scanner you're talking about might not be necessary. Consider: const fragment = whatever`
<p>${someContent}</p>
<img src=${someImgSrc}>
`; Where async function whatever(strings, ...values) {
const streamingFragment = document.createStreamingFragment();
const writer = streamingFragment.writer.getWriter();
for (const str of strings) {
// str is:
// <p>
// </p> <img src=
// >
// (with extra whitespace of course)
await writer.write(str);
let parserState = streamingFragment.getParserState();
if (parserState.tokenState == 'data') {
// This is the case for <p>, and >
await writer.write('<!-- -->');
parserState.currentTarget.lastChild; // this is the comment you just created.
// Swap it out for the interpolated value
}
else if (parserState.tokenState.includes('attr-value')) {
// await the creation of this attr node
parserState.attrNode.then(attr => {
// Add the interpolated value, or remove it and add an event listener instead etc etc.
});
}
}
} |
Yes, that might work. As long as these scenarios are allowed: const fragment = whatever`
<ul>${...}</ul>
${...}
<p data-a=${....} onclick=${....}>also ${...} and</p>
<img a=${...} b=${...} src=${someImgSrc}>
<table><tr>${...}</tr></table>
`; which looks like it'd be the case. |
@WebReflection Interpolation should be allowed anywhere. whatever`
<${'img'} src="hi">
`; In the above case |
@jakearchibald Do you expect |
@inikulin yeah, that's what I was hoping to expose, or something equivalent. Why can't we expose it? |
what about the following ? whatever`
<${'button'} ${'disabled'}>
`; I actually don't mind having that possible because boolean attributes need boolean values so that Anyway, having my example covered would be already awesome. |
@WebReflection The tokeniser calls that the "Before attribute name state", so if we could expose that, it'd be possible. |
Not sure this is actually just extra noise or something valuable, but if it can simplify anything, viperHTML uses similar mechanism to parse once on the NodeJS side. The parser is the pretty awesome htmlparser2. Probably inspiring as API ? I use the comment trick there though, but since there is a |
@jakearchibald These states are part of intrinsic parser mechanism and are subject of change, we've even removed/introduced few recently just to fix some conformance-error related bug in parser. So, exposing them to end user will require us to freeze current list of states, that will significantly complicate further development of the parser spec. Moreover, I believe some of them will be quite confusing for end users, e.g. |
@inikulin would a subset be reasonable? As example, |
I'm keen on exposing some parser state to help libraries, but I'm happy for us to add it later rather than block streaming parsing on it. |
@WebReflection Yes, that could be a solution. But I have some use cases in mind that can be confusing for end user. Consider |
@inikulin if someone writes broken html I don't expect anything different than throwing errors and break everything right away (when using a new parser API) Template literals are static, there's no way one of them would instantly fail the parser ... it either work or fail forever, since these are also frozen Arrays. Accordingly, I understand this API is not necessarily for template literals only, but if the streamer goes bananas due wrong output it's developer fault. today it's developer fault regardless, but she'll never notice due silent failure. |
You will be surprised looking at the real world markup around the web. Also, there is no such thing as "broken markup" anymore. There is non-conforming markup, but modern HTML parser can swallow anything. So, to conclude, you suggest to bail out with an error in case of invalid markup in this new streaming API? |
you missed the edit: when using a new parser API
If the alternative is to not have it, yes please. I'm tired of missed opportunities due lazy developers that need to be cuddle by standards for their mistakes. |
I'm not keen to this approach to be honest, it brings us back to times of XHTML. One of the advantages of HTML5 was it's flexibility regarding parse errors and, hence, document authoring. |
this API goal is different, and developers want to know if they wrote a broken template. Not knowing it hurts themselves, and since there is no html highlight by default inside strings, it's also a safety belt for them. So throw like any asynchronous operation that failed would throw, and let them decide if they want to fallback to innerHTML or fix that template literal instead, and forever. |
Fair enough. It'd be good to expose these states at some point, but it doesn't need to be v1. |
@WebReflection I agree having events for separate pieces of the HTML as it goes through would be quite nice, but I'd say it's already a little bit more advanced than the "as small as possible", more like version 2. For version 1, it would be nice at least to be able to insert streaming content into the DOM even without hooks for separate parts of it. |
events are just attributes ... what I've written intercepts/pauses at dom chunks and / or attributes, no matter which attribute it is or what it does ... attributes 😄 |
@WebReflection Sure, but as I said, it's a bit more advanced because it requires providing hooks from inside of the parser. I want to start with something that will be definitely possible to get implemented by vendors with pretty much no changes or hooks that are not already there, and then iterate on top of that. |
This is really full-size docs. Anywhere between 10K and 200K. I don't know what averages are, tbh. |
#2142 – previous issue where a streaming parsing API was discussed |
Another important question: do we want it to behave like a streaming |
@inikulin The fragment could buffer text until it's appended, at which point it knows its context. Although a guess it's a bit weird that you wouldn't be able to look at stuff in the fragment. The API could take an option that would give it context ahead of time, so nodes could be created before insertion. |
@jakearchibald What if we modify API a bit. We'll introduce new entity, let's call it // If we provide context element, then content is streamed directly to it.
let parser = new StreamingParser(contentElement);
let response = await fetch(url);
response.body
.pipeTo(parser.stream);
// You can examine parsed content at any moment using `parser.fragment`
// property which is a fragment mapped to the parsed content in context element
console.log(parser.fragment.childNodes.length);
// If context element is not provided, we don't stream content anywhere,
// however you can still use `parser.fragment` to examine content or attach it to some node
parser = new StreamingParser();
// ... |
If you don't provide the content element, how is the content parsed? |
In that case |
Is that a valid context for a parser? |
As in, if I push |
|
It'd still be nice to have the nodes created before the target. A "context" option could do this. The option could take a |
How it will behave if we pass a |
@jakearchibald Seems like I got it: in case of |
@inikulin whoa, I really thought I'd replied to this, sorry. |
@jakearchibald Thanks for the clarification. We've just discussed possible behaviours with @RReverser and we were wondering if parsing should affect context element's ambient context: e.g. in case if we stream inside |
Hmm, that's a tough one. It'd be difficult to do what the parser does while giving access to the nodes before they're inserted. As in: const streamingFragment = document.createStreamingFragment({context: 'table'});
const writer = streamingFragment.writer.getWriter();
await writer.write('hello');
// Is 'hello' anywhere in streamingFragment.childNodes? In cases where the node would be moved outside of the context, we could do the I'd want to avoid as many of the |
Another concern we discussed with @inikulin (also related to the discussion in last few comments) is that content being parsed might contain closing tags and so leave the parent context. In that regard, behaviour of |
In an offline discussion, @sebmarkbage brought up the helpful point that if we added |
@domenic Hmm, I'm not sure how it would help with streaming parsing? Seems to mostly help with streaming generation of content? |
@RReverser The parsing would also be done in a streaming fashion, just like it is currently done for iframes loaded from network-derived lowercase-"r" responses. |
What I mean is, I don't see how this helps with actually parsing HTML from JS side (and getting tokens etc.), it rather seems to help with generating and delivering HTML to the renderer. |
Actually nevermind, I realised that half of this old thread was already about the "delivery to the renderer" problem and not actual parsing. Which is useful too, but seems confusing to mix both in the same discussion. |
So I have an idea: how about something like this on As for why a generic readable stream? Such an API could be immensely useful for not just things like displaying Markdown documents from the server, but also for things like displaying large CI logs and large files, where in more advanced cases, a developer might choose to use the scroll position plus the current outer height to determine a threshold to render more items, simply buffering the logs until they're ready to render them. (I could totally see a browser doing this for displayed text files whose sizes are over 100MB - they might even choose to buffer the rest to disk to save memory and just read from there after they've received everything from network, only pre-loading things that are remotely close to the viewport.) I'm aware of how old this bug is. I still want to resurrect it. |
Let's close this issue. This is probably best started in https://wicg.io/ or a personal repository before it reaches a point where it can be more seriously considered. |
TL;DR HTML should provide an API for parsing. Why? "Textual" HTML is a widely used syntax. HTML parsing is complex enough to want to use the browser's parser, plus browsers can do implementation tricks with how they create elements, etc.
Unfortunately the way the HTML parser is exposed in the web platform is a hodge-podge. Streaming parsing is only available for main document loads; other things rely on strings which put pressure on memory. innerHTML is synchronous and could cause jank for large documents (although I would like to see data on this because it is pretty fast.)
Here are some strawman requirements:
Commentary:
One big question is when this API exposes the tree it is operating on. Main document parsing does expose the tree and handles mutations to it pretty happily; innerHTML parsing does not until the nodes are adopted into the target node's document (which may start running custom element stuff.)
One minor question is what to do with errors.
Being asynchronous has implications for documents and/or custom elements. If you allow creating stuff in the main document, then you have to run the custom element constructors sometime, so to make it not jank you probably can't run them together. This is probably a feature worth addressing.
See also:
Issue 2827
The text was updated successfully, but these errors were encountered: