-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Separate multi-node integ tests to their own job #458
Conversation
Signed-off-by: Daniel Widdis <[email protected]>
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #458 +/- ##
=========================================
Coverage 71.88% 71.88%
Complexity 620 620
=========================================
Files 78 78
Lines 3126 3126
Branches 236 236
=========================================
Hits 2247 2247
Misses 772 772
Partials 107 107 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for digging into this @dbwiddis. Great RCA.
Interesting, I had thought we we're preventing the auto-redeployment of ML Models via the cluster settings here during set up |
Maybe that was a recent change that I missed when spelunking old logs. But recovering indices and cluster state was happening and I'm pretty sure it impacted the timing of node startup. |
Signed-off-by: Daniel Widdis <[email protected]> (cherry picked from commit e2fdc10) Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Separate multi-node integ tests to their own job (#458) (cherry picked from commit e2fdc10) Signed-off-by: Daniel Widdis <[email protected]> Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Joshua Palis <[email protected]>
Description
Multi-node integration tests were set to run on the same server as integration tests had just run on. On cluster startup, the nodes were reading the previous cluster state from the (dropped) single node and uses the same stored indices.
Additionally ML Commons recently made the default
true
for a feature that auto-deploys ML models after a node drop and recovery, so these recoveries were occurring.While no guarantee separating this out will fix the flaky macOS tests, I've seen enough hints in the logs to suggest that leftover bits from previous tests are slowing startup and potentially contributing to test failures.
Issues Resolved
Might resolve flaky tests. At least will cut down on the debugging noise.
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.