Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parse5 is about half the performance of htmlparser2 #1259

Closed
benjamingr opened this issue Dec 13, 2018 · 11 comments
Closed

parse5 is about half the performance of htmlparser2 #1259

benjamingr opened this issue Dec 13, 2018 · 11 comments

Comments

@benjamingr
Copy link

I'm running a benchmark that parses MDN and GitHub using _useHtmlParser2 true and false and I'm getting considerably faster times using htmlParser2.

Is there ongoing work you need help with this? I don't feel great passing _useHtml5Parser: true

The benchmark is literally loading MDN and all (html) subresources or a GitHub issue and all (html) subresources with Cheerio.

How do I work on this?

Thanks for the great library!

@benjamingr benjamingr changed the title parse5 is about half the performance if htmlparser2 parse5 is about half the performance of htmlparser2 Dec 13, 2018
@matthewmueller
Copy link
Member

Hey there, I agree this is a bit confusing right now. If you prefer to use htmlparser2 the more future-proof way to do this is:

const dom = htmlparser2.parseDOM(file.contents, options)
const $ = cheerio.load(dom)

Cheerio is capable of handling both DOM structures thanks for @jugglinmike !

@benjamingr
Copy link
Author

Hey, both ended up being too slow for my particular need so I ended up parsing myself https://github.com/testimio/mhtml-parser because I only needed very primitive processing and structure rather than creating a whole dom tree.

I used _useHtmlParser2: true in my cheerio code at https://github.com/testimio/mhtml-parser/blob/master/src/link-replacer.js#L87 and I still use cheerio for svgs :)

@benjamingr
Copy link
Author

Also, great news and thanks @jugglinmike !

I mostly wanted to raise a flag and provide feedback. I'm going to go ahead and close this issue now - but feel free to reopen it and thanks again for the library :)

@frank-dspeed
Copy link

@benjamingr lol https://github.com/testimio/mhtml-parser is using cheerio under the hood

@luanmuniz
Copy link
Contributor

Take a look at this thread: #863

There are a few explanations there, not everything, but it's a good start

@frank-dspeed
Copy link

@inikulin thanks for the resource i will create parse5 modification api based on basic dom api no one uses jquery anymore today document.querySelector is its succesor

@benjamingr
Copy link
Author

benjamingr commented Jan 17, 2020

@frank-dspeed no it's not, it's only using it for benchmarks and for SVGs. RTFC :]

https://github.com/testimio/mhtml-parser/blob/master/src/link-replacer.js#L35-L43

@benjamingr
Copy link
Author

And honestly JSDom is so slow for what I'm using it for that I will have to end up and writing out my own parser (for my employer, not personally). It's just a big undertaking (~3-4 months) and we won't provide the same API (just like fast-mhtml above doesn't - it solves a subset).

We have ~60 second parse times for some large websites.

@inikulin
Copy link
Contributor

I'm working on JS bindings for https://github.com/cloudflare/lol-html at the moment. Which provides low output latency spec-compliant tokenisation along with CSS-selectors support, but orders of magnitude faster than parse5. Maybe it will be useful for your case.

@frank-dspeed
Copy link

@inikulin i like cloudflare but they address some other stuff at present there i am doing tag-html a Template Engine and Construction Kit for ESNext Cross Environment Template needs for me the parser is only sugar on top of the goals and patterns we already archived it would allow some dom manipulation to get done in case of SSR or WebWorkers.

Your Module is cool for Server Side Proxys as cloudflare is one.

@benjamingr
Copy link
Author

@inikulin I would, how can I promote this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

5 participants