Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adjust test_parallel bound #3281

Merged
merged 2 commits into from
Mar 22, 2022

Conversation

austereantelope
Copy link
Contributor

Hi,

A few days ago I created a PR #3278 where I proposed to tighten the bound of test_parallel's assertion bound (self.assertLess(neighbor_rank, 20)) from 20 to 2. I verified the new bound locally by running the test 500 times to make sure the test is not flaky (>99 % pass rate).

However after the PR is merged I noticed some CI runs are failing on this test. After taking a closer look at the test and my experiment script I realized the issue is caused by the parallelism in the test. The test uses 4 workers for training and the assertion value could be affected by the parallelism. However in my script I run each test using taskset -c coreAffinity which limits the test to use only one core which will miss the variance in the test caused by using 4 workers.

I re-ran the experiment without using taskset for a more accurate bound for the test. Below I include the experiment results

To improve this test, I propose to adjust the bound to 5 (the blue dotted line). The new bound has a catch rate of 0.78 (+0.2 increase compare to original bound of 20) while still has >99 % pass rate (test is not flaky, I ran the updated test 500 times and observed >99 % pass rate). I think this is a good balance between improving the bug-detection ability of the test while keep the flakiness of the test low.

@piskvorky @gojomo @mpenkov Do you guys think this makes sense? Please let me know if this looks good or if you have any other suggestions.

@codecov
Copy link

codecov bot commented Dec 25, 2021

Codecov Report

Merging #3281 (de90480) into develop (ac3bbcd) will decrease coverage by 2.06%.
The diff coverage is n/a.

❗ Current head de90480 differs from pull request most recent head 4934788. Consider uploading reports for the commit 4934788 to get more accurate results

@@             Coverage Diff             @@
##           develop    #3281      +/-   ##
===========================================
- Coverage    81.43%   79.37%   -2.07%     
===========================================
  Files          122       68      -54     
  Lines        21052    11781    -9271     
===========================================
- Hits         17144     9351    -7793     
+ Misses        3908     2430    -1478     
Impacted Files Coverage Δ
gensim/models/ldamulticore.py 75.29% <0.00%> (-15.62%) ⬇️
gensim/scripts/glove2word2vec.py 76.19% <0.00%> (-7.15%) ⬇️
gensim/corpora/wikicorpus.py 93.75% <0.00%> (-1.04%) ⬇️
gensim/matutils.py 77.23% <0.00%> (-0.90%) ⬇️
gensim/similarities/docsim.py 23.95% <0.00%> (-0.76%) ⬇️
gensim/models/rpmodel.py 89.47% <0.00%> (-0.53%) ⬇️
gensim/utils.py 71.86% <0.00%> (-0.12%) ⬇️
gensim/corpora/dictionary.py 94.17% <0.00%> (-0.09%) ⬇️
gensim/models/hdpmodel.py 71.27% <0.00%> (-0.08%) ⬇️
gensim/models/ldamodel.py 87.27% <0.00%> (-0.07%) ⬇️
... and 90 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 546de20...4934788. Read the comment docs.

@austereantelope
Copy link
Contributor Author

@piskvorky Please let me know if this looks good to you

@gojomo
Copy link
Collaborator

gojomo commented Jan 12, 2022

I appreciate the extra rigor & attention!

Still, my concerns from #3278 apply. The CI virtual-machines/containers/whatever have historically higher propensity to trigger even rare, threading-related failures. Even a 99% or 99.9% pass rate, on a single test, will be a recurring annoyance to people who suffer random failures after making unrelated changes. And, we have multiple tests with similar risks - so (for example) if we were to tune 10 independent tests to each have "99.5%" reliability, running the full test suite would still fail 5% of the time.

So: if rigorously tuning individual tests (which is a great idea!), the target should be something like 99.999% reliability for each - ideally as confirmed on the CI systems themselves.

@austereantelope
Copy link
Contributor Author

@gojomo Thanks for your feedback. I ran the test locally 10,000 times. The bound we suggested (5) has a pass rate of 0.9963. We can achieve 0.9999 pass rate if we set the bound to be 8. Do you think this is a reasonable pass rate for this test?

@piskvorky
Copy link
Owner

piskvorky commented Jan 14, 2022

@austereantelope I believe the underlying concern is capturing variance that's outside of merely re-running the tests more times, in a fixed environment.

For instance, I'm sure the existing tests worked reliably well on the computers of the people who originally wrote them. But then the tests gave us trouble when deployed elsewhere, such as the CI system, with different OSes / CPUs / thread scheduling / Travis vs Github Actions / whatever. And then we had to add hacks to relax their bounds or skip them altogether.

Or make the test data and/or algorithm genuinely more robust to such variance (can be a hard academic task, way more work).

To be clear, your PR here is interesting and a net benefit to Gensim already, in my estimation. The worry is just something similar may happen again.

@austereantelope
Copy link
Contributor Author

@piskvorky I agree that the underlying variability we see for this test is due to the parallelism (the test is called test_parallel) and it behaves differently for different machines (developer vs CI). However currently the bound of 2 is too tight even on my development machine. Unfortunately I don't think we can feasibility test what the approriate bound should be on CI (run 100,000 times across all possible CI configurations).

The bound I suggested in the initial PR (5, pass rate of 0.9963) or after running it locally 10,000 times (8, pass rate of 0.9999) might be a good middle ground for now, and I can continue monitor some CI failures to see if we need to further loosen the bound.

@piskvorky piskvorky requested a review from mpenkov January 15, 2022 05:54
@piskvorky piskvorky added this to the Next release milestone Feb 18, 2022
@mpenkov
Copy link
Collaborator

mpenkov commented Mar 22, 2022

Merging. Thank you @austereantelope !

@mpenkov mpenkov merged commit acd8308 into piskvorky:develop Mar 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants