Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_html hanging? #4786

Closed
jseabold opened this issue Sep 9, 2013 · 21 comments
Closed

read_html hanging? #4786

jseabold opened this issue Sep 9, 2013 · 21 comments
Labels
IO Data IO issues that don't fit into a more specific label IO HTML read_html, to_html, Styler.apply, Styler.applymap
Milestone

Comments

@jseabold
Copy link
Contributor

jseabold commented Sep 9, 2013

Is this a local problem? I'm on g41d10b5. The following url appears never to return, or at least it's taking an inordinately long time.

http://www.nku.edu/~longa/geomed/ppa/doc/globals/Globals.htm

Same result with either the url or the raw html.

@jseabold
Copy link
Contributor Author

jseabold commented Sep 9, 2013

Though it works if you just copy out the table code HTML, so it must be a parsing issue.

@jseabold
Copy link
Contributor Author

jseabold commented Sep 9, 2013

Perhaps I just need to pass in some regex though. It seemed pretty simple, so I expected it to work.

@cpcloud
Copy link
Member

cpcloud commented Sep 9, 2013

I can take a look later today.

@ghost ghost assigned cpcloud Sep 10, 2013
@cpcloud
Copy link
Member

cpcloud commented Sep 10, 2013

There's something strange going on with html5lib. lxml + bs4 "parses" it but who knows if it's correct.

@cpcloud
Copy link
Member

cpcloud commented Sep 10, 2013

HTML that lets you know you're in for some pain:

<meta content="Microsoft Word 97" name="Generator"/>

@jseabold
Copy link
Contributor Author

Haha, indeed.

@cpcloud
Copy link
Member

cpcloud commented Sep 10, 2013

@jseabold I've sort of narrowed it down:

ipdb> list
    481                 return [element for element in generator
    482                         if isinstance(element, Tag)]
    483             # Optimization to find all tags with a given name.
    484             elif isinstance(name, basestring):
    485                 return [element for element in generator
--> 486                         if isinstance(element, Tag) and element.name == name]
    487             else:
    488                 strainer = SoupStrainer(name, attrs, text, **kwargs)
    489         else:
    490             # Build a SoupStrainer
    491             strainer = SoupStrainer(name, attrs, text, **kwargs)

ipdb> element
<ol><font face="Arial"><i><b>

</b></i><li>The input data file should contain the X,Y coordinates and the value at each point (x<sub>I</sub>).</li>
<li>Input whether you have a spatial weights matrix file.</li>
<li><font face="Arial">If you do not have a spatial weights matrix, you’ll be asked to enter the </font><i><font size="4">A </font></i><font face="Arial">and</font><i><font size="4"> m </font></i><font face="Arial">parameters (see below).</font></li></font><li><font face="Arial">If you do not have a spatial weights matrix, you’ll be asked to enter the </font><i><font size="4">A </font></i><font face="Arial">and</font><i><font size="4"> m </font></i><font face="Arial">parameters (see below).</font></li><font face="Arial">
<li>You will be asked to enter the maximum distance, the number of steps, and whether you want bands or increments.</li></font></ol>
ipdb> n
> /home/phillip/.virtualenvs/pandas/lib/python2.7/site-packages/bs4/element.py(485)_find_all()
    484             elif isinstance(name, basestring):
--> 485                 return [element for element in generator
    486                         if isinstance(element, Tag) and element.name == name]

ipdb> n
> /home/phillip/.virtualenvs/pandas/lib/python2.7/site-packages/bs4/element.py(486)_find_all()
    485                 return [element for element in generator
--> 486                         if isinstance(element, Tag) and element.name == name]
    487             else:

ipdb> element
<i><b>

</b></i>
ipdb> n
> /home/phillip/.virtualenvs/pandas/lib/python2.7/site-packages/bs4/element.py(485)_find_all()
    484             elif isinstance(name, basestring):
--> 485                 return [element for element in generator
    486                         if isinstance(element, Tag) and element.name == name]

ipdb> n
> /home/phillip/.virtualenvs/pandas/lib/python2.7/site-packages/bs4/element.py(486)_find_all()
    485                 return [element for element in generator
--> 486                         if isinstance(element, Tag) and element.name == name]
    487             else:

ipdb> element
<ol><font face="Arial"><i><b>

</b></i><li>The input data file should contain the X,Y coordinates and the value at each point (x<sub>I</sub>).</li>
<li>Input whether you have a spatial weights matrix file.</li>
<li><font face="Arial">If you do not have a spatial weights matrix, you’ll be asked to enter the </font><i><font size="4">A </font></i><font face="Arial">and</font><i><font size="4"> m </font></i><font face="Arial">parameters (see below).</font></li></font><li><font face="Arial">If you do not have a spatial weights matrix, you’ll be asked to enter the </font><i><font size="4">A </font></i><font face="Arial">and</font><i><font size="4"> m </font></i><font face="Arial">parameters (see below).</font></li><font face="Arial">
<li>You will be asked to enter the maximum distance, the number of steps, and whether you want bands or increments.</li></font></ol>

that <ol> element just keeps on parsing every time....

I would guess that the invalidity of the HTML is what's causing this, let me look a little deeper

@cpcloud
Copy link
Member

cpcloud commented Sep 10, 2013

For whatever reason, there's a cycle in the parse tree (which makes it not a tree anymore) so that's why this never finishes. One of the element's children is its parent, thus never finishing. This might take me a while to fix....as it looks like a bs4 problem....since i can parse successfully with html5lib alone

@jseabold
Copy link
Contributor Author

Thanks for having a look. Not a huge priority for me or anything and obviously not a typical web site. Just wasn't sure what was going on.

@cpcloud
Copy link
Member

cpcloud commented Sep 10, 2013

39074443

@cpcloud
Copy link
Member

cpcloud commented Sep 22, 2013

@jseabold can we close this?

@jseabold
Copy link
Contributor Author

Sure, if there isn't any way to fail gracefully on malformed HTML here. Not likely to come up often so probably not worth any effort.

@cpcloud
Copy link
Member

cpcloud commented Sep 22, 2013

eh i'll leave it... low prio tho

@cancan101
Copy link
Contributor

This is another URL that has the same issue: http://www.sec.gov/Archives/edgar/data/47217/000104746913006802/a2215416z10-q.htm

@cpcloud
Copy link
Member

cpcloud commented Oct 4, 2013

that's unfortunate

@cpcloud
Copy link
Member

cpcloud commented Oct 4, 2013

darn ... i was hoping this would be a rare occurrence

@cancan101
Copy link
Contributor

@cpcloud That was the same link. I just moved it to this thread. But I will keep looking for examples.

@cpcloud
Copy link
Member

cpcloud commented Oct 4, 2013

oh sorry didn't scroll up ... etting late in NYC

@cancan101
Copy link
Contributor

I just filed bug on BeautifulSoup: https://bugs.launchpad.net/beautifulsoup/+bug/1271394

@cpcloud
Copy link
Member

cpcloud commented Jan 22, 2014

👍

@ghost
Copy link

ghost commented Jan 24, 2014

Since the parser dep going into an infinite loop is not something pandas can address directly,
I'm closing this. Intrigued by how you can make self-referential HTML markup. but not that intrigued.

@ghost ghost closed this as completed Jan 24, 2014
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO Data IO issues that don't fit into a more specific label IO HTML read_html, to_html, Styler.apply, Styler.applymap
Projects
None yet
Development

No branches or pull requests

3 participants