Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure when parsing invalid HTML #10

Closed
yannham opened this issue Mar 10, 2017 · 2 comments
Closed

Failure when parsing invalid HTML #10

yannham opened this issue Mar 10, 2017 · 2 comments
Labels

Comments

@yannham
Copy link
Contributor

yannham commented Mar 10, 2017

I tried to parse the github page of my project (https://github.com/yannham/mechaml/) using Lambdasoup, but I got an underlying unexpected error from Markup.ml. When I type in a REPL (utop)

Soup.read_file "github.html" |> Soup.parse

where github.html is a dump of the previously given github page, I get

Exception: Failure "require_current_element: None"

While I expected Lambdasoup and Markup.ml to fail quietly

on invalid HTML5, or at least not to fail with an uncaught exception.

Here is a snapshot code of the incriminated version of the page

@aantron aantron added the bug label Mar 11, 2017
@aantron
Copy link
Owner

aantron commented Mar 11, 2017

Thanks. This is an internal error in Markup.ml that needs to be fixed.

This is due to wrong handling of an unmatched </form> tag in the (ill-formed) HTML input.

I want to note that Markup.ml should not exactly fail quietly, more like report the bad tag to ~report and then recover in a certain way – there is a specific behavior required by HTML5 (see 'An end tag whose tag name is "form"'), so I hesitate to call the correct behavior a failure.

@aantron
Copy link
Owner

aantron commented May 8, 2017

This should be fixed now (in Markup.ml master). Sorry about the delay – I actually wrote most of this commit back in March, but then I faced making a slightly ugly tradeoff due to the specification, which assumes a DOM-building parser, not being fully compatible with streaming parsing. While thinking about how to resolve that, I eventually got swamped by other work. See the commit message for some detail on what I chose – but it's ultimately just some comments on esoteric HTML error recovery behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants