-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
read_html hanging? #4786
Comments
Though it works if you just copy out the table code HTML, so it must be a parsing issue. |
Perhaps I just need to pass in some regex though. It seemed pretty simple, so I expected it to work. |
I can take a look later today. |
There's something strange going on with |
HTML that lets you know you're in for some pain:
|
Haha, indeed. |
@jseabold I've sort of narrowed it down:
that I would guess that the invalidity of the HTML is what's causing this, let me look a little deeper |
For whatever reason, there's a cycle in the parse tree (which makes it not a tree anymore) so that's why this never finishes. One of the element's children is its parent, thus never finishing. This might take me a while to fix....as it looks like a bs4 problem....since i can parse successfully with html5lib alone |
Thanks for having a look. Not a huge priority for me or anything and obviously not a typical web site. Just wasn't sure what was going on. |
@jseabold can we close this? |
Sure, if there isn't any way to fail gracefully on malformed HTML here. Not likely to come up often so probably not worth any effort. |
eh i'll leave it... low prio tho |
This is another URL that has the same issue: http://www.sec.gov/Archives/edgar/data/47217/000104746913006802/a2215416z10-q.htm |
that's unfortunate |
darn ... i was hoping this would be a rare occurrence |
@cpcloud That was the same link. I just moved it to this thread. But I will keep looking for examples. |
oh sorry didn't scroll up ... etting late in NYC |
I just filed bug on BeautifulSoup: https://bugs.launchpad.net/beautifulsoup/+bug/1271394 |
👍 |
Since the parser dep going into an infinite loop is not something pandas can address directly, |
Is this a local problem? I'm on g41d10b5. The following url appears never to return, or at least it's taking an inordinately long time.
http://www.nku.edu/~longa/geomed/ppa/doc/globals/Globals.htm
Same result with either the url or the raw html.
The text was updated successfully, but these errors were encountered: