-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize block parser by avoiding repeated HTML parsing and matcher creation #39424
Conversation
Size Change: +293 B (0%) Total Size: 1.16 MB
ℹ️ View Unchanged
|
4d3ad0e
to
dc4ae2e
Compare
* | ||
* @return {Function} The same function memoized. | ||
*/ | ||
function memoize( fn ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about using memoize which is already used in our code base?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, memize
would work here, too. I'll migrate to that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Migrated the memoization to memize
in ba1812d.
* @return {Node} Parsed DOM node. | ||
*/ | ||
function parseHtml( innerHTML ) { | ||
return hpqParse( innerHTML, ( h ) => h ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit weird, it seems we're using hpq not in the way it was supposed to be used. Makes me wonder if there's some internal optimization to be made. Isn't this just the same thing as returning innerHTML
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
innerHTML
is a string, while hpq( html, identity )
returns the parsed Node
object. A usual matcher would select something from the DOM node (typically some HTML attribute or text node), but an identity matcher that just returns the parsed node works, too.
The performance test don't seem to indicate improvement here, do you think this is within the error margin? |
The metric affected should be mainly It should be more visible when looking at long term trends, e.g., at |
Yeah, I've been procrastinating the work on that. We probably need more eye on this, I'm happy to share my repo. We may have reached the limit of the free plan in planetscale.com |
Yeah I'd be happy to contribute here 👍 |
dc4ae2e
to
ba1812d
Compare
source: 'attribute', | ||
selector: '[data-custom-class-name] > *', | ||
attribute: 'class', | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we do the same for the "style" attribute which is also a common one shared by a lot of blocks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see the style
attribute being read with this method anywhere. It's usually stored in the block comment (<!-- wp:heading {"style"} -->
), not in the block HTML markup, is that right? Then the code path is different and doesn't involve HTML parsing and matching.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, so we only care about "memoizing" attribute definitions if the attribute is parsed from content. I wonder if there's a way to surface this for block authors. In documentation at least.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For block authors, I think this matters only in rare extreme cases. Like when the block has many deprecated versions with shared attribute definitions. For example, the core/gallery
block has six deprecated versions, and they share definitions for attributes like img[data-full-url]
. But deduplicating them would have very little impact.
Where would be the best place to document this? We can do that anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, since most block use block.json and don't add anything "common" (rarely), it's not that important. I'm also not sure how to document this, so not important.
ba1812d
to
3035be7
Compare
…reation (WordPress#39424) * getBlockAttributes: parse HTML only once * matcherFromSource: memoize matchers by schema object * Use memize to memoize
In #39424 when we added an optimization to only parse a block's innerHTML once we also changed the behavior that the innerHTML propety represented the raw HTML loaded by the parser. Instead, what we have since that change is the DOM of the parsed HTML. In this patch we're adding yet-another parameter to the bag of arguments in `getBlockAttribute()` so that the new `raw` type can read and reproduce that original source, e.g. when reading the string `1 < 0` the parsed value's `innerHTML` will be `1 < 0` even though the block's raw content was `1 < 0`.
In #39424 when we added an optimization to only parse a block's innerHTML once we also changed the behavior that the innerHTML propety represented the raw HTML loaded by the parser. Instead, what we have since that change is the DOM of the parsed HTML. In this patch we're adding yet-another parameter to the bag of arguments in `getBlockAttribute()` so that the new `raw` type can read and reproduce that original source, e.g. when reading the string `1 < 0` the parsed value's `innerHTML` will be `1 < 0` even though the block's raw content was `1 < 0`.
Two simple optimizations of the block parser that lead to significant speed improvement. The large-post.html sample post (1000+ blocks) took 600ms to parse before this patch, now it takes only 340ms.
getBlockAttributes
parses HTML only once: before this patch, each innergetBlockAttribute
call parsed the block HTML again before selecting a target value from the parsed DOM. This patch optimizes that to parse only once. Makes use of the fact thathpq()
is an identity when called with an already parsed DOM nodematcherFromSource
that converts aschema
object to a matcher function used to be called 3900 times on thelarge-post.html
, although there are only 41 unique attribute schemas. I speeded up the parser by memoizing by theschema
object argument. There are two places where I extracted a schema to a separate variable to make it referentially unique. After this memoization, there are only 75 unique schemas. The difference between 41 and 75 is caused by some identical schemas to be defined multiple times, for example, inblock.json
files.