Skip to content
This repository has been archived by the owner on Oct 22, 2024. It is now read-only.

feature state and upgrade/downgrade testing #723

Merged
merged 12 commits into from
Sep 28, 2020
Merged

feature state and upgrade/downgrade testing #723

merged 12 commits into from
Sep 28, 2020

Conversation

pohly
Copy link
Contributor

@pohly pohly commented Sep 2, 2020

Declaring the entire project as "alpha" is not correct anymore,
several features are considered stable and ready for
production. However, some may still be experimental. We need to
document this in more detail and ensure that we have corresponding tests.

Fixes: #631

@pohly
Copy link
Contributor Author

pohly commented Sep 2, 2020

TODO:

  • get rid of the forked Kubernetes (issue update to Kubernetes 1.19.1 #722)
  • should the test really be called "version skew"? Probably not, it's more about down- and upgrade of the entire driver. "Version skew" testing is when there was a partial update (for example, controller from 0.6 and nodes still on 0.7).
  • actually test "version skew"?
  • decide which tests to run and what the Jenkins job timeouts should be

@pohly pohly requested a review from avalluri September 2, 2020 15:32
@pohly pohly force-pushed the version-skew branch 3 times, most recently from 0a0b700 to c41a07b Compare September 3, 2020 15:14
@pohly
Copy link
Contributor Author

pohly commented Sep 3, 2020

should the test really be called "version skew"?

I've added actual skew testing, so the overall name of the new testsuite seems suitable. I kept it.

@pohly pohly force-pushed the version-skew branch 2 times, most recently from 4b8308c to fbca100 Compare September 4, 2020 14:29
@pohly
Copy link
Contributor Author

pohly commented Sep 5, 2020

CI testing ran into an "interesting" error:

01:13:50.064 pmem-driver: WAITING: ImagePullBackOff - Back-off pulling image "intel/pmem-csi-driver:v0.7.19"

I suspect that was because the release had been tagged, but the image wasn't pushed yet by our CI. Determining which release to test against must be made more intelligent to account for that...

@pohly pohly force-pushed the version-skew branch 5 times, most recently from 54b9054 to 56dd110 Compare September 9, 2020 13:33
@pohly
Copy link
Contributor Author

pohly commented Sep 9, 2020

The problem I asked about in acfeaf6#r436079052 was causing timeouts in the CI testing. I've reverted that commit.

@pohly pohly force-pushed the version-skew branch 5 times, most recently from 6b0682b to e6dcc22 Compare September 11, 2020 05:48
avalluri pushed a commit to avalluri/pmem-CSI that referenced this pull request Sep 12, 2020
This reverts commit acfeaf6.

As suspected during code review at the time, this change is
problematic because when the controller exits first, the nodes cannot
unregister. gRPC then tries to connect for up to 20 minutes before
eventually giving up (seen in CI testing of
intel#723), which slows down
testing enough that it times out.
@pohly pohly force-pushed the version-skew branch 4 times, most recently from c888bcf to 217a5d1 Compare September 15, 2020 17:39
avalluri pushed a commit to avalluri/pmem-CSI that referenced this pull request Sep 15, 2020
This reverts commit acfeaf6.

As suspected during code review at the time, this change is
problematic because when the controller exits first, the nodes cannot
unregister. gRPC then tries to connect for up to 20 minutes before
eventually giving up (seen in CI testing of
intel#723), which slows down
testing enough that it times out.
avalluri pushed a commit to avalluri/pmem-CSI that referenced this pull request Sep 15, 2020
This reverts commit acfeaf6.

As suspected during code review at the time, this change is
problematic because when the controller exits first, the nodes cannot
unregister. gRPC then tries to connect for up to 20 minutes before
eventually giving up (seen in CI testing of
intel#723), which slows down
testing enough that it times out.
pohly added a commit to pohly/pmem-CSI that referenced this pull request Sep 18, 2020
This reverts commit acfeaf6.

As suspected during code review at the time, this change is
problematic because when the controller exits first, the nodes cannot
unregister. gRPC then tries to connect for up to 20 seconds before
eventually giving up (seen in CI testing of
intel#723). When reinstalling the
driver a lot as in the version skew testing, that tends to add up.

Relying on graceful unregistration also doesn't help in other cases. A
better solution is to detect dead nodes on the controller side.
@avalluri
Copy link
Contributor

@pohly We could revert 1a433b7. I see that is causing the operator to detect default values as changes.

@pohly
Copy link
Contributor Author

pohly commented Sep 23, 2020

@pohly We could revert 1a433b7. I see that is causing the operator to detect default values as changes.

Let's discuss in #742 - this is not related to this PR.

avalluri pushed a commit that referenced this pull request Sep 23, 2020
This reverts commit acfeaf6.

As suspected during code review at the time, this change is
problematic because when the controller exits first, the nodes cannot
unregister. gRPC then tries to connect for up to 20 seconds before
eventually giving up (seen in CI testing of
#723). When reinstalling the
driver a lot as in the version skew testing, that tends to add up.

Relying on graceful unregistration also doesn't help in other cases. A
better solution is to detect dead nodes on the controller side.
@avalluri avalluri force-pushed the version-skew branch 2 times, most recently from ef190e8 to f9d8a0e Compare September 23, 2020 12:24
@pohly pohly force-pushed the version-skew branch 3 times, most recently from 66b52bc to 27d1494 Compare September 25, 2020 19:07
pohly added a commit to pohly/pmem-CSI that referenced this pull request Sep 26, 2020
This reverts commit acfeaf6.

As suspected during code review at the time, this change is
problematic because when the controller exits first, the nodes cannot
unregister. gRPC then tries to connect for up to 20 seconds before
eventually giving up (seen in CI testing of
intel#723). When reinstalling the
driver a lot as in the version skew testing, that tends to add up.

Relying on graceful unregistration also doesn't help in other cases. A
better solution is to detect dead nodes on the controller side.
This reverts commit acfeaf6.

As suspected during code review at the time, this change is
problematic because when the controller exits first, the nodes cannot
unregister. gRPC then tries to connect for up to 20 seconds before
eventually giving up (seen in CI testing of
#723). When reinstalling the
driver a lot as in the version skew testing, that tends to add up.

Relying on graceful unregistration also doesn't help in other cases. A
better solution is to detect dead nodes on the controller side.
It's common to have a valid deployment name and just wanting the
corresponding struct without having to worry about an error that can
never occur.
Commit d9578b6
removed the @ before the shell command for debugging purposes; this
shouldn't have been committed.
When the YAML files were updated for Kubernetes 1.19,
external-provisioner was updated to 2.0.0 without also updating the
operator default. We don't have a test case for "operator default
matches YAML default" because the operator defaults always overwrite
the YAML values; this is hard to change, so we'll have to catch this
through code review.

What could have been caught was that the RBAC rules got out-of-sync,
except that object comparison was unnecessarily limited to just the
"spec" field of those objects which have it. Now almost all fields
are compared.

This highlighted that the RBAC rules in the operator were slightly
different than the ones from the reference YAML files. Now they are
identical.
quay.io has been replaced by k8s.gcr.io as the official registry for
the sidecars. The new external-provisioner v2.0.2 is only available
there. We want that new version because it avoids the unnecessary
"delete before detach" protection; that new feature may have been the
reason why volumes were not deleted during CI stress testing.
This has two facets:
- switching the entire driver deployment from one version to another
  while there is persistent state like active volumes
- combining components from different releases in a deployment,
  which can happen during a rolling update

Upgrade and downgrade testing is done in two directions: from 0.6 to
master and from master to 0.6. In both cases and for all deployment
methods and most volume types, some sanity checks are run:

- an unused volume must be deletable
- an unused volume must be usable for a pod
- a volume used by a pod can be removed

Different volume types are covered via test patterns, i.e. each volume
type goes through driver downgrade/upgrade separately. Persistent
filesystem types are covered in some varieties, including cache
volumes. Block and CSI inline volumes only get tested once.

This is a compromise between keeping tests small and overall runtime,
because reinstalling the driver is slow. Therefore these tests also
don't run in our pre-submit testing.

Skew testing is done by switching to an old release and replacing the
controller image.
Commit 5e996e4 introduced the usage of our exec helper code
into e2e.test. However, that didn't quite work yet and can be
improved:
- the log level must be increased to 5 to see the commands and their
  output
- the error from the helper code intentionally includes the command
  error output, so there is no need to repeat that in the error
  assertion
Mixing klog v1 and v2 did not work as seamlessly as expected: the
command line flag for the log level only affected klog v2, but not
v1. Using just v2 is cleaner anyway...
Enabling verbose logging showed that the initial cluster ready check
and pod logging were getting throttled by the default rate limiter in
the Kubernetes client. We probably don't want that and don't need to
worry about a runaway process or fairness, so we can avoid the
throttling and its associated log messages by setting a nop rate
limiter before creating the client.
Having "olm" as prefix was problematic because a `make test_e2e
TEST_E2E_FOCUS=operator.API.updating.provisionerImage.in.deployment.with.specific.values.*while.running`
then ran both variants of the tests (for "olm-operator" and
"operator"), without an easy way to limit testing to exactly one test
case. Anchoring the regex with ^ does not work because Ginkgo
internally puts some undocumented and hidden prefix before it.

It's better to avoid the issue by choosing distinct names for the
deployments.
With NodeCacheCapable=true in the Extender configuration, just the
node names are passed as arguments and expected in the results. As we
don't need more than that, that mode is better because it is more
efficient. Logging gets streamlined for that mode.

The install instructions lacked documentation for Kubernetes 1.19 and
for setting up /var/lib/scheduler/scheduler-config.yaml.
This now really should be on the safe side.
@pohly pohly changed the title WIP: feature state and upgrade/downgrade testing feature state and upgrade/downgrade testing Sep 27, 2020
@pohly
Copy link
Contributor Author

pohly commented Sep 27, 2020

@avalluri : both PR and branch testing have passed. Please review.

The PR contains more than just version skew testing because other things came up while working on this. I can pull out some changes into separate PRs if you want, but others then will cause merge conflicts.

Copy link
Contributor

@avalluri avalluri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

remove "alpha" warning
2 participants