Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test various upgrade scenarios #1580

Merged
merged 8 commits into from
May 3, 2023
Merged

Conversation

sbodagala
Copy link
Contributor

@sbodagala sbodagala commented Apr 10, 2023

Description

Write tests that cover the following scenarios:

  • Test status json in the context of version incompatible upgrades.

  • Test that restarts multiple processes (a storage process and multiple stateless processes) during the stage phase

  • Test that tests cluster generation number during upgrade

Type of change

  • Other (adds an upgrade test; no code changes)

Discussion

Are there any design details that you would like to discuss further?
No

Testing

Ran the test manually.

Documentation

Did you update relevant documentation within this repository?
N/A

If this change is adding new functionality, do we need to describe it in our user manual?
N/A

If this change is adding or removing subreconcilers, have we updated the core technical design doc to reflect that?
N/A

If this change is adding new safety checks or new potential failure modes, have we documented and how to debug potential issues?
N/A

Follow-up

Are there any follow-up issues that we should pursue in the future?
No

Does this introduce new defaults that we should re-evaluate in the future?
No

@sbodagala sbodagala requested a review from johscheuer April 10, 2023 14:51
@sbodagala
Copy link
Contributor Author

Ran the test manually, and it failed with this error: "invariant InvariantClusterStatusAvailableWithThreshold failed".

It appears to me the upgrade completed successfully. Here's the output of "kubectl-fdb analyze" over the cluster after the test failed with the above error:

kubectl-fdb analyze fdb-cluster-cz8blukk -n sre-s3xfe49r

Checking cluster: sre-s3xfe49r/fdb-cluster-cz8blukk
✔ Cluster is available
✔ Cluster is fully replicated
✔ Cluster is reconciled
✔ ProcessGroups are all in ready condition
✔ Pods are all running and available
Checking cluster: sre-s3xfe49r/fdb-cluster-cz8blukk with auto-fix: false

@foundationdb-ci
Copy link

Result of fdb-kubernetes-operator-pr on Linux CentOS 7

  • Commit ID: dc0ef82
  • Duration 4:10:42
  • Result: ❌ FAILED
  • Error: Error while executing command: if $fail_test; then exit 1; fi. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link

Result of fdb-kubernetes-operator-pr on Linux CentOS 7

  • Commit ID: eeef335
  • Duration 4:11:17
  • Result: ❌ FAILED
  • Error: Error while executing command: if $fail_test; then exit 1; fi. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link

Result of fdb-kubernetes-operator-pr on Linux CentOS 7

  • Commit ID: 7736889
  • Duration 3:08:27
  • Result: ❌ FAILED
  • Error: Error while executing command: if $fail_test; then exit 1; fi. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@sbodagala sbodagala changed the title Test upgrading a cluster when no storage processes are restarted Test various upgrade scenarios Apr 21, 2023
@foundationdb-ci
Copy link

Result of fdb-kubernetes-operator-pr on Linux CentOS 7

  • Commit ID: 2f560b6
  • Duration 3:06:47
  • Result: ❌ FAILED
  • Error: Error while executing command: if $fail_test; then exit 1; fi. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link

Result of fdb-kubernetes-operator-pr on Linux CentOS 7

  • Commit ID: 8f32426
  • Duration 4:10:06
  • Result: ❌ FAILED
  • Error: Error while executing command: if $fail_test; then exit 1; fi. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

Copy link
Member

@johscheuer johscheuer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we split up the tests into separate PRs? Otherwise one single test will block the PR from being merged.

e2e/test_operator_upgrades/operator_upgrades_test.go Outdated Show resolved Hide resolved
e2e/test_operator_upgrades/operator_upgrades_test.go Outdated Show resolved Hide resolved
e2e/test_operator_upgrades/operator_upgrades_test.go Outdated Show resolved Hide resolved
e2e/test_operator_upgrades/operator_upgrades_test.go Outdated Show resolved Hide resolved
@sbodagala
Copy link
Contributor Author

Uploaded the latest version, please take a look. Thanks!

Copy link
Member

@johscheuer johscheuer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes looks fine to me, let's wait for the test result 👍

@johscheuer johscheuer self-requested a review April 28, 2023 15:51
@foundationdb-ci
Copy link

Result of fdb-kubernetes-operator-pr on Linux CentOS 7

  • Commit ID: e379366
  • Duration 2:50:59
  • Result: ❌ FAILED
  • Error: Error while executing command: if $fail_test; then exit 1; fi. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@sbodagala
Copy link
Contributor Author

Result of fdb-kubernetes-operator-pr on Linux CentOS 7

  • Commit ID: e379366
  • Duration 2:50:59
  • Result: ❌ FAILED
  • Error: Error while executing command: if $fail_test; then exit 1; fi. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

Reports the following failures:

operator_ha_upgrade_test.go:

Test "upgrading a cluster with operator pod chaos and without foundationdb pod chaos"" failed because "Cluster.Generation" after upgrade is 37, instead of 19.

NOTE: I modified "Upgrading a multi-DC cluster without chaos" test to check "Cluster.Generation" after upgrade - again, "Cluster.Generation" is 34, instead of 19.

operator_upgrades_test.go:

Test "upgrading a cluster where a storage and multiple stateless processes get restarted during the staging phase Upgrade" failed on this error:

<*errors.errorString | 0xc0006c1220>: {
s: "invariant InvariantClusterStatusAvailableWithThreshold failed",
}
invariant InvariantClusterStatusAvailableWithThreshold failed

@sbodagala
Copy link
Contributor Author

More on the failure in operator_ha_upgrade_test.go: I don't see any helpful information in the test output, but I do see recoveries (on Splunk) seconds apart (like these: 2023-05-01T16:01:30Z and 2023-05-01T16:01:32Z; these are the timestamps of Type "MasterRecoveryState" with "Status: reading_coordinated_state"). I think this is the result of server processes getting bounced at different, but relatively closer, timestamps.

@sbodagala
Copy link
Contributor Author

Test "upgrading a cluster where a storage and multiple stateless processes get restarted during the staging phase Upgrade" failed on this error:

<*errors.errorString | 0xc0006c1220>: {
s: "invariant InvariantClusterStatusAvailableWithThreshold failed",
}
invariant InvariantClusterStatusAvailableWithThreshold failed

Ran this test locally multiple times, they all succeeded. So the failure reported by CI might not be related to this specific test.

@foundationdb-ci
Copy link

Result of fdb-kubernetes-operator-pr on Linux CentOS 7

  • Commit ID: 2b65a07
  • Duration 4:10:22
  • Result: ❌ FAILED
  • Error: Error while executing command: if $fail_test; then exit 1; fi. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@johscheuer johscheuer closed this May 2, 2023
@johscheuer johscheuer reopened this May 2, 2023
@johscheuer
Copy link
Member

Seems like the last test run hit some issues. I try another run.

Copy link
Member

@johscheuer johscheuer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One the e2e test pipeline passes we can merge this PR 👍

@foundationdb-ci
Copy link

Result of fdb-kubernetes-operator-pr on Linux CentOS 7

  • Commit ID: 2b65a07
  • Duration 3:56:49
  • Result: ❌ FAILED
  • Error: Error while executing command: if $fail_test; then exit 1; fi. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@johscheuer
Copy link
Member

johscheuer commented May 2, 2023

Result of fdb-kubernetes-operator-pr on Linux CentOS 7

  • Commit ID: 2b65a07
  • Duration 3:56:49
  • Result: ❌ FAILED
  • Error: Error while executing command: if $fail_test; then exit 1; fi. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)
Summarizing 1 Failure:
  [FAIL] Operator Upgrades [AfterEach] upgrading a cluster where a storage and multiple stateless processes get restarted during the staging phase Upgrade from 6.3.25 to 7.1.27 with a storage and multiple stateless processes restarted during the staging phase [e2e]
  /codebuild/output/src828282037/src/github.com/FoundationDB/fdb-kubernetes-operator/e2e/fixtures/fixtures.go:62

The failure is:

2023/05/02 10:16:22 reconciled name=fdb-cluster-p5yg7xfw, namespace=test-operator-upgrades-845-1w5pkh73
  [FAILED] in [AfterEach] - /codebuild/output/src828282037/src/github.com/FoundationDB/fdb-kubernetes-operator/e2e/fixtures/fixtures.go:62 @ 05/02/23 10:16:37.822
• [FAILED] [748.764 seconds]
Operator Upgrades [AfterEach] upgrading a cluster where a storage and multiple stateless processes get restarted during the staging phase Upgrade from 6.3.25 to 7.1.27 with a storage and multiple stateless processes restarted during the staging phase [e2e]
  [AfterEach] /codebuild/output/src828282037/src/github.com/FoundationDB/fdb-kubernetes-operator/e2e/test_operator_upgrades/operator_upgrades_test.go:103
  [It] /codebuild/output/src828282037/src/github.com/FoundationDB/fdb-kubernetes-operator/e2e/fixtures/upgrade_test_configuration.go:104

  [FAILED] Unexpected error:
      <*errors.errorString | 0xc000098090>: {
          s: "invariant InvariantClusterStatusAvailableWithThreshold failed",
      }
      invariant InvariantClusterStatusAvailableWithThreshold failed
  occurred
  In [AfterEach] at: /codebuild/output/src828282037/src/github.com/FoundationDB/fdb-kubernetes-operator/e2e/fixtures/fixtures.go:62 @ 05/02/23 10:16:37.822
------------------------------

@foundationdb-ci
Copy link

Result of fdb-kubernetes-operator-pr on Linux CentOS 7

  • Commit ID: e8cd310
  • Duration 2:43:26
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants