-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
adjust test_parallel bound #3281
adjust test_parallel bound #3281
Conversation
Codecov Report
@@ Coverage Diff @@
## develop #3281 +/- ##
===========================================
- Coverage 81.43% 79.37% -2.07%
===========================================
Files 122 68 -54
Lines 21052 11781 -9271
===========================================
- Hits 17144 9351 -7793
+ Misses 3908 2430 -1478
Continue to review full report at Codecov.
|
@piskvorky Please let me know if this looks good to you |
I appreciate the extra rigor & attention! Still, my concerns from #3278 apply. The CI virtual-machines/containers/whatever have historically higher propensity to trigger even rare, threading-related failures. Even a 99% or 99.9% pass rate, on a single test, will be a recurring annoyance to people who suffer random failures after making unrelated changes. And, we have multiple tests with similar risks - so (for example) if we were to tune 10 independent tests to each have "99.5%" reliability, running the full test suite would still fail 5% of the time. So: if rigorously tuning individual tests (which is a great idea!), the target should be something like 99.999% reliability for each - ideally as confirmed on the CI systems themselves. |
@gojomo Thanks for your feedback. I ran the test locally 10,000 times. The bound we suggested (5) has a pass rate of 0.9963. We can achieve 0.9999 pass rate if we set the bound to be 8. Do you think this is a reasonable pass rate for this test? |
@austereantelope I believe the underlying concern is capturing variance that's outside of merely re-running the tests more times, in a fixed environment. For instance, I'm sure the existing tests worked reliably well on the computers of the people who originally wrote them. But then the tests gave us trouble when deployed elsewhere, such as the CI system, with different OSes / CPUs / thread scheduling / Travis vs Github Actions / whatever. And then we had to add hacks to relax their bounds or skip them altogether. Or make the test data and/or algorithm genuinely more robust to such variance (can be a hard academic task, way more work). To be clear, your PR here is interesting and a net benefit to Gensim already, in my estimation. The worry is just something similar may happen again. |
@piskvorky I agree that the underlying variability we see for this test is due to the parallelism (the test is called test_parallel) and it behaves differently for different machines (developer vs CI). However currently the bound of 2 is too tight even on my development machine. Unfortunately I don't think we can feasibility test what the approriate bound should be on CI (run 100,000 times across all possible CI configurations). The bound I suggested in the initial PR (5, pass rate of 0.9963) or after running it locally 10,000 times (8, pass rate of 0.9999) might be a good middle ground for now, and I can continue monitor some CI failures to see if we need to further loosen the bound. |
Merging. Thank you @austereantelope ! |
Hi,
A few days ago I created a PR #3278 where I proposed to tighten the bound of
test_parallel
's assertion bound (self.assertLess(neighbor_rank, 20)
) from20
to2
. I verified the new bound locally by running the test 500 times to make sure the test is not flaky (>99 % pass rate).However after the PR is merged I noticed some CI runs are failing on this test. After taking a closer look at the test and my experiment script I realized the issue is caused by the parallelism in the test. The test uses 4 workers for training and the assertion value could be affected by the parallelism. However in my script I run each test using
taskset -c coreAffinity
which limits the test to use only one core which will miss the variance in the test caused by using 4 workers.I re-ran the experiment without using
taskset
for a more accurate bound for the test. Below I include the experiment resultsTo improve this test, I propose to adjust the bound to
5
(the blue dotted line). The new bound has a catch rate of 0.78 (+0.2 increase compare to original bound of20
) while still has >99 % pass rate (test is not flaky, I ran the updated test 500 times and observed >99 % pass rate). I think this is a good balance between improving the bug-detection ability of the test while keep the flakiness of the test low.@piskvorky @gojomo @mpenkov Do you guys think this makes sense? Please let me know if this looks good or if you have any other suggestions.