feature state and upgrade/downgrade testing #723

pohly · 2020-09-02T15:29:55Z

Declaring the entire project as "alpha" is not correct anymore,
several features are considered stable and ready for
production. However, some may still be experimental. We need to
document this in more detail and ensure that we have corresponding tests.

Fixes: #631

pohly · 2020-09-02T15:30:23Z

TODO:

get rid of the forked Kubernetes (issue update to Kubernetes 1.19.1 #722)
should the test really be called "version skew"? Probably not, it's more about down- and upgrade of the entire driver. "Version skew" testing is when there was a partial update (for example, controller from 0.6 and nodes still on 0.7).
actually test "version skew"?
decide which tests to run and what the Jenkins job timeouts should be

pohly · 2020-09-03T15:41:44Z

should the test really be called "version skew"?

I've added actual skew testing, so the overall name of the new testsuite seems suitable. I kept it.

pohly · 2020-09-05T10:13:14Z

CI testing ran into an "interesting" error:

01:13:50.064 pmem-driver: WAITING: ImagePullBackOff - Back-off pulling image "intel/pmem-csi-driver:v0.7.19"

I suspect that was because the release had been tagged, but the image wasn't pushed yet by our CI. Determining which release to test against must be made more intelligent to account for that...

pohly · 2020-09-09T14:07:32Z

The problem I asked about in acfeaf6#r436079052 was causing timeouts in the CI testing. I've reverted that commit.

This reverts commit acfeaf6. As suspected during code review at the time, this change is problematic because when the controller exits first, the nodes cannot unregister. gRPC then tries to connect for up to 20 minutes before eventually giving up (seen in CI testing of intel#723), which slows down testing enough that it times out.

This reverts commit acfeaf6. As suspected during code review at the time, this change is problematic because when the controller exits first, the nodes cannot unregister. gRPC then tries to connect for up to 20 seconds before eventually giving up (seen in CI testing of intel#723). When reinstalling the driver a lot as in the version skew testing, that tends to add up. Relying on graceful unregistration also doesn't help in other cases. A better solution is to detect dead nodes on the controller side.

avalluri · 2020-09-23T09:42:30Z

@pohly We could revert 1a433b7. I see that is causing the operator to detect default values as changes.

pohly · 2020-09-23T10:00:12Z

@pohly We could revert 1a433b7. I see that is causing the operator to detect default values as changes.

Let's discuss in #742 - this is not related to this PR.

This reverts commit acfeaf6. As suspected during code review at the time, this change is problematic because when the controller exits first, the nodes cannot unregister. gRPC then tries to connect for up to 20 seconds before eventually giving up (seen in CI testing of #723). When reinstalling the driver a lot as in the version skew testing, that tends to add up. Relying on graceful unregistration also doesn't help in other cases. A better solution is to detect dead nodes on the controller side.

This reverts commit acfeaf6. As suspected during code review at the time, this change is problematic because when the controller exits first, the nodes cannot unregister. gRPC then tries to connect for up to 20 seconds before eventually giving up (seen in CI testing of intel#723). When reinstalling the driver a lot as in the version skew testing, that tends to add up. Relying on graceful unregistration also doesn't help in other cases. A better solution is to detect dead nodes on the controller side.

This reverts commit acfeaf6. As suspected during code review at the time, this change is problematic because when the controller exits first, the nodes cannot unregister. gRPC then tries to connect for up to 20 seconds before eventually giving up (seen in CI testing of #723). When reinstalling the driver a lot as in the version skew testing, that tends to add up. Relying on graceful unregistration also doesn't help in other cases. A better solution is to detect dead nodes on the controller side.

It's common to have a valid deployment name and just wanting the corresponding struct without having to worry about an error that can never occur.

Commit d9578b6 removed the @ before the shell command for debugging purposes; this shouldn't have been committed.

When the YAML files were updated for Kubernetes 1.19, external-provisioner was updated to 2.0.0 without also updating the operator default. We don't have a test case for "operator default matches YAML default" because the operator defaults always overwrite the YAML values; this is hard to change, so we'll have to catch this through code review. What could have been caught was that the RBAC rules got out-of-sync, except that object comparison was unnecessarily limited to just the "spec" field of those objects which have it. Now almost all fields are compared. This highlighted that the RBAC rules in the operator were slightly different than the ones from the reference YAML files. Now they are identical.

quay.io has been replaced by k8s.gcr.io as the official registry for the sidecars. The new external-provisioner v2.0.2 is only available there. We want that new version because it avoids the unnecessary "delete before detach" protection; that new feature may have been the reason why volumes were not deleted during CI stress testing.

This has two facets: - switching the entire driver deployment from one version to another while there is persistent state like active volumes - combining components from different releases in a deployment, which can happen during a rolling update Upgrade and downgrade testing is done in two directions: from 0.6 to master and from master to 0.6. In both cases and for all deployment methods and most volume types, some sanity checks are run: - an unused volume must be deletable - an unused volume must be usable for a pod - a volume used by a pod can be removed Different volume types are covered via test patterns, i.e. each volume type goes through driver downgrade/upgrade separately. Persistent filesystem types are covered in some varieties, including cache volumes. Block and CSI inline volumes only get tested once. This is a compromise between keeping tests small and overall runtime, because reinstalling the driver is slow. Therefore these tests also don't run in our pre-submit testing. Skew testing is done by switching to an old release and replacing the controller image.

Commit 5e996e4 introduced the usage of our exec helper code into e2e.test. However, that didn't quite work yet and can be improved: - the log level must be increased to 5 to see the commands and their output - the error from the helper code intentionally includes the command error output, so there is no need to repeat that in the error assertion

Mixing klog v1 and v2 did not work as seamlessly as expected: the command line flag for the log level only affected klog v2, but not v1. Using just v2 is cleaner anyway...

Enabling verbose logging showed that the initial cluster ready check and pod logging were getting throttled by the default rate limiter in the Kubernetes client. We probably don't want that and don't need to worry about a runaway process or fairness, so we can avoid the throttling and its associated log messages by setting a nop rate limiter before creating the client.

Having "olm" as prefix was problematic because a `make test_e2e TEST_E2E_FOCUS=operator.API.updating.provisionerImage.in.deployment.with.specific.values.*while.running` then ran both variants of the tests (for "olm-operator" and "operator"), without an easy way to limit testing to exactly one test case. Anchoring the regex with ^ does not work because Ginkgo internally puts some undocumented and hidden prefix before it. It's better to avoid the issue by choosing distinct names for the deployments.

With NodeCacheCapable=true in the Extender configuration, just the node names are passed as arguments and expected in the results. As we don't need more than that, that mode is better because it is more efficient. Logging gets streamlined for that mode. The install instructions lacked documentation for Kubernetes 1.19 and for setting up /var/lib/scheduler/scheduler-config.yaml.

This now really should be on the safe side.

pohly · 2020-09-27T10:17:40Z

@avalluri : both PR and branch testing have passed. Please review.

The PR contains more than just version skew testing because other things came up while working on this. I can pull out some changes into separate PRs if you want, but others then will cause merge conflicts.

avalluri

Looks good to me.

pohly requested a review from avalluri September 2, 2020 15:32

pohly force-pushed the version-skew branch 3 times, most recently from 0a0b700 to c41a07b Compare September 3, 2020 15:14

pohly force-pushed the version-skew branch 2 times, most recently from 4b8308c to fbca100 Compare September 4, 2020 14:29

pohly force-pushed the version-skew branch 5 times, most recently from 54b9054 to 56dd110 Compare September 9, 2020 13:33

pohly force-pushed the version-skew branch 5 times, most recently from 6b0682b to e6dcc22 Compare September 11, 2020 05:48

pohly force-pushed the version-skew branch 4 times, most recently from c888bcf to 217a5d1 Compare September 15, 2020 17:39

pohly mentioned this pull request Sep 16, 2020

operator: patch deployment status instead of update #732

Merged

pohly force-pushed the version-skew branch from 976528b to d0f4ee4 Compare September 18, 2020 16:32

pohly mentioned this pull request Sep 23, 2020

test flake: updating labels in default deployment while running – fedora-1_19.olm-operator.API #742

Closed

avalluri force-pushed the version-skew branch 2 times, most recently from ef190e8 to f9d8a0e Compare September 23, 2020 12:24

pohly force-pushed the version-skew branch 3 times, most recently from 66b52bc to 27d1494 Compare September 25, 2020 19:07

pohly added 12 commits September 26, 2020 20:34

test: add deploy.MustParse

706bc3a

It's common to have a valid deployment name and just wanting the corresponding struct without having to worry about an error that can never occur.

test: silence test-kustomize

ee8e5a1

Commit d9578b6 removed the @ before the shell command for debugging purposes; this shouldn't have been committed.

switch to klog/v2

9c4eb9a

Mixing klog v1 and v2 did not work as seamlessly as expected: the command line flag for the log level only affected klog v2, but not v1. Using just v2 is cleaner anyway...

test: bump up Jenkins timeouts to 9 hours

d07c02d

This now really should be on the safe side.

pohly force-pushed the version-skew branch from 27d1494 to d07c02d Compare September 26, 2020 18:35

pohly changed the title ~~WIP: feature state and upgrade/downgrade testing~~ feature state and upgrade/downgrade testing Sep 27, 2020

avalluri approved these changes Sep 28, 2020

View reviewed changes

pohly merged commit 3d71eca into devel Sep 28, 2020

This was referenced Sep 29, 2020

update to Kubernetes 1.19.1 #722

Closed

remove "alpha" warning #631

Closed

support upgrade and downgrade #687

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feature state and upgrade/downgrade testing #723

feature state and upgrade/downgrade testing #723

pohly commented Sep 2, 2020 •

edited

Loading

pohly commented Sep 2, 2020 •

edited

Loading

pohly commented Sep 3, 2020

pohly commented Sep 5, 2020

pohly commented Sep 9, 2020

avalluri commented Sep 23, 2020

pohly commented Sep 23, 2020

pohly commented Sep 27, 2020

avalluri left a comment

feature state and upgrade/downgrade testing #723

feature state and upgrade/downgrade testing #723

Conversation

pohly commented Sep 2, 2020 • edited Loading

pohly commented Sep 2, 2020 • edited Loading

pohly commented Sep 3, 2020

pohly commented Sep 5, 2020

pohly commented Sep 9, 2020

avalluri commented Sep 23, 2020

pohly commented Sep 23, 2020

pohly commented Sep 27, 2020

avalluri left a comment

Choose a reason for hiding this comment

pohly commented Sep 2, 2020 •

edited

Loading

pohly commented Sep 2, 2020 •

edited

Loading