Use html5lib to parse HTML instead of using regexs #973

dstufft · 2013-06-02T13:53:45Z

HTML is not a regular language and thus cannot be accurately parsed
with a regex. The regexs that pip had were gnarly, and required
multiple levels of them in order to "work".

Using html5lib instead of regexs makes the parsing step simpler,
better tested, and much more robust.

There is still some "HTML" being parsed by regex, but there's not
a whole lot that can be done about that bit except remove it
completely. It depends on applying significance to html inside
of an html comment and is really horrible.

dstufft · 2013-06-02T13:54:30Z

This is broken into 2 commits to make it easier to review. The first commit deals only with actually vendoring html5lib (and six because it's a dep of html5lib) and the second commit is the actual changes to pip.

HTML is not a regular language and thus cannot be accurately parsed with a regex. The regexs that pip had were gnarly, and required multiple levels of them in order to "work". Using html5lib instead of regexs makes the parsing step simpler, better tested, and much more robust. There is still some use of regex to parse html. Namely the hack that looks for a <th> tag with a "rel" after it. This hack is required to work inside of HTML comments so it cannot be parsed as HTML.

dstufft · 2013-06-02T16:44:19Z

@mitsuhiko on IRC noted that this may break some peoples internal deploys.

For what it's worth, html5lib is designed to implement parsing the same way as the browsers do, and I've yet to find something it choked on. As far as I can tell something would need to have html so malformed that a browser is unlikely to be able to render it.

The XPath queries are very simple (Give me anything with the a tag) so there should not be any assumption of structure.

I do not believe there will be any major issue by merging this, however I wanted to note @mitsuhiko's concern for other reviewers.

carljm · 2013-06-02T19:27:13Z

pip/vendor/__init__.py

+from __future__ import absolute_import
+
+# Monkeypatch pip.vendor.six into just six
+#   This is kind of terrible, however it is the least bad of 3 bad options


I think (2), not (3), is the least bad of these options. Changing imports of six within html5lib is a simple, mechanical, easily-automated modification to upstream. It is not a significant barrier to upgrading the vendored html5lib in the future (just copy in the new html5lib and run a simple search-and-replace to change the imports), and certainly does not "pave the way for us to be in charge of maintaining it." Messing with sys.modules like this is much, much worse; people do import and use pip as an API, and so all the issues with (1) will still be a problem.

It might not be a bad idea to actually include a script that can do the search and replace correctly on the html5lib imports of six, for the benefit of future upgraders of html5lib.

I've opted not to include a script simply because I plan to write a utility to handle vendoring things better and I'll just include this in that rather then attempt to make a one off script.

* It was determined that simply rewriting the imports was less invasive than patching sys.modules.

Use html5lib to parse HTML instead of using regexs

Vendor html5lib because regex are a terrible way to parse html

cba4521

dstufft added 2 commits June 2, 2013 10:01

Document the reasons why we are monkeypatching six into sys.modules

f350de4

carljm reviewed Jun 2, 2013
View reviewed changes

Rewrite six dependency to import from pip.vendor.six instead of six

d767395

* It was determined that simply rewriting the imports was less invasive than patching sys.modules.

dstufft added a commit that referenced this pull request Jun 5, 2013

Merge pull request #973 from pypa/parse-html

8344fb8

Use html5lib to parse HTML instead of using regexs

dstufft merged commit 8344fb8 into develop Jun 5, 2013

dstufft deleted the parse-html branch June 5, 2013 18:51

lock bot added the auto-locked Outdated issues that have been locked by automation label Jun 5, 2019

lock bot locked as resolved and limited conversation to collaborators Jun 5, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use html5lib to parse HTML instead of using regexs #973

Use html5lib to parse HTML instead of using regexs #973

dstufft commented Jun 2, 2013

dstufft commented Jun 2, 2013

dstufft commented Jun 2, 2013

carljm Jun 2, 2013

carljm Jun 2, 2013

dstufft Jun 3, 2013

Use html5lib to parse HTML instead of using regexs #973

Use html5lib to parse HTML instead of using regexs #973

Conversation

dstufft commented Jun 2, 2013

dstufft commented Jun 2, 2013

dstufft commented Jun 2, 2013

carljm Jun 2, 2013

Choose a reason for hiding this comment

carljm Jun 2, 2013

Choose a reason for hiding this comment

dstufft Jun 3, 2013

Choose a reason for hiding this comment