-
Notifications
You must be signed in to change notification settings - Fork 55
feature state and upgrade/downgrade testing #723
Conversation
TODO:
|
0a0b700
to
c41a07b
Compare
I've added actual skew testing, so the overall name of the new testsuite seems suitable. I kept it. |
4b8308c
to
fbca100
Compare
CI testing ran into an "interesting" error:
I suspect that was because the release had been tagged, but the image wasn't pushed yet by our CI. Determining which release to test against must be made more intelligent to account for that... |
54b9054
to
56dd110
Compare
The problem I asked about in acfeaf6#r436079052 was causing timeouts in the CI testing. I've reverted that commit. |
6b0682b
to
e6dcc22
Compare
This reverts commit acfeaf6. As suspected during code review at the time, this change is problematic because when the controller exits first, the nodes cannot unregister. gRPC then tries to connect for up to 20 minutes before eventually giving up (seen in CI testing of intel#723), which slows down testing enough that it times out.
c888bcf
to
217a5d1
Compare
This reverts commit acfeaf6. As suspected during code review at the time, this change is problematic because when the controller exits first, the nodes cannot unregister. gRPC then tries to connect for up to 20 minutes before eventually giving up (seen in CI testing of intel#723), which slows down testing enough that it times out.
This reverts commit acfeaf6. As suspected during code review at the time, this change is problematic because when the controller exits first, the nodes cannot unregister. gRPC then tries to connect for up to 20 minutes before eventually giving up (seen in CI testing of intel#723), which slows down testing enough that it times out.
This reverts commit acfeaf6. As suspected during code review at the time, this change is problematic because when the controller exits first, the nodes cannot unregister. gRPC then tries to connect for up to 20 seconds before eventually giving up (seen in CI testing of intel#723). When reinstalling the driver a lot as in the version skew testing, that tends to add up. Relying on graceful unregistration also doesn't help in other cases. A better solution is to detect dead nodes on the controller side.
This reverts commit acfeaf6. As suspected during code review at the time, this change is problematic because when the controller exits first, the nodes cannot unregister. gRPC then tries to connect for up to 20 seconds before eventually giving up (seen in CI testing of #723). When reinstalling the driver a lot as in the version skew testing, that tends to add up. Relying on graceful unregistration also doesn't help in other cases. A better solution is to detect dead nodes on the controller side.
ef190e8
to
f9d8a0e
Compare
66b52bc
to
27d1494
Compare
This reverts commit acfeaf6. As suspected during code review at the time, this change is problematic because when the controller exits first, the nodes cannot unregister. gRPC then tries to connect for up to 20 seconds before eventually giving up (seen in CI testing of intel#723). When reinstalling the driver a lot as in the version skew testing, that tends to add up. Relying on graceful unregistration also doesn't help in other cases. A better solution is to detect dead nodes on the controller side.
This reverts commit acfeaf6. As suspected during code review at the time, this change is problematic because when the controller exits first, the nodes cannot unregister. gRPC then tries to connect for up to 20 seconds before eventually giving up (seen in CI testing of #723). When reinstalling the driver a lot as in the version skew testing, that tends to add up. Relying on graceful unregistration also doesn't help in other cases. A better solution is to detect dead nodes on the controller side.
It's common to have a valid deployment name and just wanting the corresponding struct without having to worry about an error that can never occur.
Commit d9578b6 removed the @ before the shell command for debugging purposes; this shouldn't have been committed.
When the YAML files were updated for Kubernetes 1.19, external-provisioner was updated to 2.0.0 without also updating the operator default. We don't have a test case for "operator default matches YAML default" because the operator defaults always overwrite the YAML values; this is hard to change, so we'll have to catch this through code review. What could have been caught was that the RBAC rules got out-of-sync, except that object comparison was unnecessarily limited to just the "spec" field of those objects which have it. Now almost all fields are compared. This highlighted that the RBAC rules in the operator were slightly different than the ones from the reference YAML files. Now they are identical.
quay.io has been replaced by k8s.gcr.io as the official registry for the sidecars. The new external-provisioner v2.0.2 is only available there. We want that new version because it avoids the unnecessary "delete before detach" protection; that new feature may have been the reason why volumes were not deleted during CI stress testing.
This has two facets: - switching the entire driver deployment from one version to another while there is persistent state like active volumes - combining components from different releases in a deployment, which can happen during a rolling update Upgrade and downgrade testing is done in two directions: from 0.6 to master and from master to 0.6. In both cases and for all deployment methods and most volume types, some sanity checks are run: - an unused volume must be deletable - an unused volume must be usable for a pod - a volume used by a pod can be removed Different volume types are covered via test patterns, i.e. each volume type goes through driver downgrade/upgrade separately. Persistent filesystem types are covered in some varieties, including cache volumes. Block and CSI inline volumes only get tested once. This is a compromise between keeping tests small and overall runtime, because reinstalling the driver is slow. Therefore these tests also don't run in our pre-submit testing. Skew testing is done by switching to an old release and replacing the controller image.
Commit 5e996e4 introduced the usage of our exec helper code into e2e.test. However, that didn't quite work yet and can be improved: - the log level must be increased to 5 to see the commands and their output - the error from the helper code intentionally includes the command error output, so there is no need to repeat that in the error assertion
Mixing klog v1 and v2 did not work as seamlessly as expected: the command line flag for the log level only affected klog v2, but not v1. Using just v2 is cleaner anyway...
Enabling verbose logging showed that the initial cluster ready check and pod logging were getting throttled by the default rate limiter in the Kubernetes client. We probably don't want that and don't need to worry about a runaway process or fairness, so we can avoid the throttling and its associated log messages by setting a nop rate limiter before creating the client.
Having "olm" as prefix was problematic because a `make test_e2e TEST_E2E_FOCUS=operator.API.updating.provisionerImage.in.deployment.with.specific.values.*while.running` then ran both variants of the tests (for "olm-operator" and "operator"), without an easy way to limit testing to exactly one test case. Anchoring the regex with ^ does not work because Ginkgo internally puts some undocumented and hidden prefix before it. It's better to avoid the issue by choosing distinct names for the deployments.
With NodeCacheCapable=true in the Extender configuration, just the node names are passed as arguments and expected in the results. As we don't need more than that, that mode is better because it is more efficient. Logging gets streamlined for that mode. The install instructions lacked documentation for Kubernetes 1.19 and for setting up /var/lib/scheduler/scheduler-config.yaml.
This now really should be on the safe side.
@avalluri : both PR and branch testing have passed. Please review. The PR contains more than just version skew testing because other things came up while working on this. I can pull out some changes into separate PRs if you want, but others then will cause merge conflicts. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me.
Declaring the entire project as "alpha" is not correct anymore,
several features are considered stable and ready for
production. However, some may still be experimental. We need to
document this in more detail and ensure that we have corresponding tests.
Fixes: #631