-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Test various upgrade scenarios #1580
Conversation
Ran the test manually, and it failed with this error: "invariant InvariantClusterStatusAvailableWithThreshold failed". It appears to me the upgrade completed successfully. Here's the output of "kubectl-fdb analyze" over the cluster after the test failed with the above error:
|
Result of fdb-kubernetes-operator-pr on Linux CentOS 7
|
and multiple stateless processes) during the stage phase
Result of fdb-kubernetes-operator-pr on Linux CentOS 7
|
Result of fdb-kubernetes-operator-pr on Linux CentOS 7
|
Result of fdb-kubernetes-operator-pr on Linux CentOS 7
|
Result of fdb-kubernetes-operator-pr on Linux CentOS 7
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we split up the tests into separate PRs? Otherwise one single test will block the PR from being merged.
Uploaded the latest version, please take a look. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes looks fine to me, let's wait for the test result 👍
Result of fdb-kubernetes-operator-pr on Linux CentOS 7
|
Reports the following failures: operator_ha_upgrade_test.go: Test "upgrading a cluster with operator pod chaos and without foundationdb pod chaos"" failed because "Cluster.Generation" after upgrade is 37, instead of 19. NOTE: I modified "Upgrading a multi-DC cluster without chaos" test to check "Cluster.Generation" after upgrade - again, "Cluster.Generation" is 34, instead of 19. operator_upgrades_test.go: Test "upgrading a cluster where a storage and multiple stateless processes get restarted during the staging phase Upgrade" failed on this error:
|
More on the failure in operator_ha_upgrade_test.go: I don't see any helpful information in the test output, but I do see recoveries (on Splunk) seconds apart (like these: 2023-05-01T16:01:30Z and 2023-05-01T16:01:32Z; these are the timestamps of Type "MasterRecoveryState" with "Status: reading_coordinated_state"). I think this is the result of server processes getting bounced at different, but relatively closer, timestamps. |
Ran this test locally multiple times, they all succeeded. So the failure reported by CI might not be related to this specific test. |
Result of fdb-kubernetes-operator-pr on Linux CentOS 7
|
Seems like the last test run hit some issues. I try another run. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One the e2e test pipeline passes we can merge this PR 👍
Result of fdb-kubernetes-operator-pr on Linux CentOS 7
|
The failure is:
|
Result of fdb-kubernetes-operator-pr on Linux CentOS 7
|
Description
Write tests that cover the following scenarios:
Test status json in the context of version incompatible upgrades.
Test that restarts multiple processes (a storage process and multiple stateless processes) during the stage phase
Test that tests cluster generation number during upgrade
Type of change
Discussion
Are there any design details that you would like to discuss further?
No
Testing
Ran the test manually.
Documentation
Did you update relevant documentation within this repository?
N/A
If this change is adding new functionality, do we need to describe it in our user manual?
N/A
If this change is adding or removing subreconcilers, have we updated the core technical design doc to reflect that?
N/A
If this change is adding new safety checks or new potential failure modes, have we documented and how to debug potential issues?
N/A
Follow-up
Are there any follow-up issues that we should pursue in the future?
No
Does this introduce new defaults that we should re-evaluate in the future?
No