Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Travis 3.6 Slow Build Test Failures - BS4 Issue #29622

Closed
alimcmaster1 opened this issue Nov 14, 2019 · 13 comments
Closed

Travis 3.6 Slow Build Test Failures - BS4 Issue #29622

alimcmaster1 opened this issue Nov 14, 2019 · 13 comments
Assignees
Labels
CI Continuous Integration Testing pandas testing functions or related to the test suite Unreliable Test Unit tests that occasionally fail

Comments

@alimcmaster1
Copy link
Member

test_thousands_macau_index_col in test_html fails using beautifulsoup4 version 4.8.1 and passes using version 4.7.1.

Example failures:
https://travis-ci.org/pandas-dev/pandas/jobs/611742966

As pointed out by @jreback in this PR - #29603

Example failing build pre my changes in #29513
https://travis-ci.org/pandas-dev/pandas/jobs/611347428

Attempting to debug issue and will report to bs4. Should we pin in the meantime to version <= 4.7.1?

@alimcmaster1 alimcmaster1 added CI Continuous Integration Testing pandas testing functions or related to the test suite labels Nov 14, 2019
@simonjayhawkins
Copy link
Member

duplicate of #28125

@nschloe
Copy link
Contributor

nschloe commented Dec 18, 2019

Bumped into this as well, downgrading to beautifulsoup4 4.7.1 fixed it for me.

@jbrockmendel jbrockmendel added the Unreliable Test Unit tests that occasionally fail label Dec 25, 2019
@jreback jreback added this to the 1.0 milestone Dec 27, 2019
@jreback
Copy link
Contributor

jreback commented Dec 27, 2019

would really really like to fix / xfail this test. ! cc @alimcmaster1

@alimcmaster1
Copy link
Member Author

take

@leonardr
Copy link

leonardr commented Dec 29, 2019

This was brought to my attention through Beautiful Soup bug 1857765. I've fixed the problem on the Beautiful Soup side in revision 533. However, in normal usage Tag.attrs should never be None. Looking at the test that's failing I think there's a problem in pandas code, though the fix isn't obvious to me.

This code iterates over a list of tables and calls decompose() on anything inside that table with display:none. I think the problem happens when you have nested tables, and processing table N involves calling decompose() on table N+1 (because table N+1 is inside table N and has display:none).

Once you decompose() a Tag its __dict__ is wiped. That means calling Tag.attrs will invoke Tag.__getattr__, which will try to find a child tag called "attrs". That will return None, of course. So that's how Tag.attrs ends up None.

In general, once you call decompose() on a Tag object its behavior becomes undefined and you're not supposed to use it for anything. Here it looks like you're not intentionally using a decomposed Tag, but you're iterating over a list and some Tags may have been decomposed by the time you get to them.

Since you're only adding table to a list if a find call returns a value other than None, and find on a decomposed Tag will always return None, I don't think this problem has any real-world effects. It would be better to avoid dealing with a decomposed Tag altogether, but right now there's no good way to tell whether a Tag has been decomposed.

I could change the behavior of Tag so that attempts to use a decomposed Tag raise an exception, but that would just make the problem more obvious -- you'd get a different exception inside Tag.__unicode__. So maybe I should set a flag on a Tag that's been decomposed; that way you can check for it. Let me know what you think.

@alimcmaster1
Copy link
Member Author

Thanks very much @leonardr for taking a look into this. I agree with you we should look into this on the pandas side. Ill take a look and reply back on here

I can confirm your change in revision 553 fix our test on master. https://bazaar.launchpad.net/~leonardr/beautifuI lsoup/bs4/revision/553

@jreback I have xfailed our tests for now and then will remove when beautifulsoup changes are released: see PR #30544

@leonardr would be much appreciated if you ping us on here when you aim to next release

@leonardr
Copy link

I can do a release once we figure out how to deal with the underlying situation -- a Tag might have been decomposed and there's no way to tell.

@alimcmaster1
Copy link
Member Author

alimcmaster1 commented Dec 30, 2019

I can do a release once we figure out how to deal with the underlying situation -- a Tag might have been decomposed and there's no way to tell.

Agree thanks for the detailed explanation above. Defer to you really on how you want to handle this and we can update the pandas code accordingly.

Here it looks like you're not intentionally using a decomposed Tag, but you're iterating over a list and some Tags may have been decomposed by the time you get to them.

Could bs4 provide functionality to safely iterate over a list? Might be easier than callers handling a flag?

Otherwise the flag approach to identify if .decompose() has been called on the Tag sounds good or calling .decompose() a second time a no-op?

@leonardr

@leonardr
Copy link

leonardr commented Dec 31, 2019

Iterating over a list isn't dangerous per se; the problem happens when you try to take an action on a decomposed PageElement. That can happen any time you decompose a tag you've assigned to a variable.

While working out a solution I noticed that Tag.name will be None for a decomposed tag. If you want to get everything working again without waiting for a new Beautiful Soup release or updating dependencies, I think that's an easy and intuitive way to check whether a tag has been decomposed, especially in the context where you specifically looked for table tags.

I may still need to change Beautiful Soup, since there's no way to tell whether a NagivableString has been decomposed, but since that's going to change the API I'd like to wait a bit.

@TomAugspurger
Copy link
Contributor

@alimcmaster1 do you know what all remains to be done on pandas' side here? Pushing off the 1.0 milestone, as I don't think there's anything critical, but I haven't been following closely.

@TomAugspurger TomAugspurger modified the milestones: 1.0, Contributions Welcome Jan 8, 2020
@alimcmaster1
Copy link
Member Author

Agree nothing critical @TomAugspurger i can follow up shortly with the fix @leonardr suggested and hence remove the xfail added in #30544

@leonardr
Copy link

leonardr commented Apr 5, 2020

FYI I just released Beautiful Soup 4.9.0 with the fix discussed in this issue.

@alimcmaster1
Copy link
Member Author

Awesome thanks vm!! @leonardr

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI Continuous Integration Testing pandas testing functions or related to the test suite Unreliable Test Unit tests that occasionally fail
Projects
None yet
Development

No branches or pull requests

7 participants