-
-
Notifications
You must be signed in to change notification settings - Fork 903
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MIssing content after parsing html #1927
Comments
@vdoan7773 Thanks for asking this question. The short answer is that you're seeing differences in how a browser "fixes" invalid HTML and how libxml2 (which is the underlying parser used by Nokogiri) "fixes" invalid HTML. Nokogiri inherits this behavior from the parser, and so there's nothing we can easily do to change this behavior. Longer answer: The presence of the bare character
outputs:
You can see that libxml2 is flagging the bare We've had lots of issues filed over the years pointing out differences between how libxml2 fixes broken markup when compared to browsers, Xerces, etc. Unfortunately, fixing broken markup isn't something that's defined in the HTML spec and so is implemented differently in different parsers. Maybe the one actionable thing here is to file a bug report with Snyk asking them to emit well-formed, valid HTML. They should be html-encoding that version string before putting it into their web page, and so Anyhoo, sorry I can't be of more help here. I hope I've explained what's going on, but let me know if you have other questions. |
Describe the bug
Some content of html is removed after parsing
To Reproduce
Run following script:
Nokogiri (1.10.4)
The text was updated successfully, but these errors were encountered: