Skip to content
This repository has been archived by the owner on Oct 22, 2024. It is now read-only.

operator: revise deployment status API and operator reconcile loop #749

Merged
merged 5 commits into from
Nov 12, 2020

Conversation

avalluri
Copy link
Contributor

@avalluri avalluri commented Sep 27, 2020

Revised the Deployment.Status to accommodate the deployment state
conditions and driver state. Currently Deployment has 3 conditions named
CertsVerified, CertsReady, and DriverDeployed. It also records summary
of controller and node driver state, .i.e, no. of nodes the driver is
running.

In order to record real time status of the driver current had to rewrite
the current reconcile loop. The existing reconcile loop was keen on the
deployment CR changes and redeploy only the sub-objects that requires
to redeploy. Instead the new reconcile logic refresh all the objects
and CR status to keep the state consistent. The refresh chooses to
merge patching the objects to avoid all unnecessary updates.

There are two reconcile entry points:

  • CR reconcile loop: refreshes all the sub-objects and CR status
  • sub-object event handler: redeploy only the deleted/changed resource
    and updates CR status if required.

This also includes other code cleanups that come across.

TODOs:

  • E2E tests for validating if the operator restores the state of a
    broken deployment - i.e, recovering deleted/modified sub-objects.
  • Investigate the "unexpected object update" failures.
  • Update documentation bits to show new status fields.

FIXES: #611

Copy link
Contributor

@pohly pohly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've not looked at the actual code yet. Let's get the current tests to work first. For that, please rebase.

pkg/apis/pmemcsi/v1alpha1/deployment_types.go Show resolved Hide resolved

// +k8s:deepcopy-gen=true
type DriverStatus struct {
// Type represents type of the driver: controller or node
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we call this "type of the driver" elsewhere?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we talk about "driver deployment", I tend to think of it as the combination of all of its parts. "Driver component" seems more appropriate here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean to renameType to DriverComponent?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed the field. The status looks as below:

  Driver Components:
    Component:     Controller
    Last Updated:  2020-10-08T07:45:13Z
    Reason:        1 instance(s) of controller driver is running successfully
    Status:        Ready
    Component:     Node
    Last Updated:  2020-10-08T07:45:11Z
    Reason:        All 3 node driver pod(s) running successfully
    Status:        Ready

@avalluri avalluri force-pushed the operator-status branch 3 times, most recently from 3245084 to 8413e84 Compare October 9, 2020 09:36
@avalluri avalluri changed the title WIP: operator: revise deployment status API and operator reconcile loop operator: revise deployment status API and operator reconcile loop Oct 9, 2020
@avalluri
Copy link
Contributor Author

avalluri commented Oct 9, 2020

Split the change into 2 different commits; Revised reconcile loop and Revised status API.

@avalluri avalluri requested a review from pohly October 9, 2020 09:44
pkg/apis/pmemcsi/v1alpha1/deployment_types.go Show resolved Hide resolved
}
op := NewObjectPatch(ss, ss.DeepCopy())
d.getControllerStatefulSet(ss)
if err := d.updateSubObject(r, op); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of these redeploy functions seem to follow the same pattern. Are you sure that this cannot be moved into one function that works with interface{} and callbacks and then just gets called with different parameters?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rewrote the code such that all redeploy* functions call a redeploy() with RedeployObject by setting the appropriate callback handlers. But one exception is redeploySecrets which is different from the pattern. It checks if the object is new after the fetching and before the update.

@avalluri avalluri force-pushed the operator-status branch 2 times, most recently from 1e77d5f to 7a8e83d Compare October 13, 2020 19:36
@avalluri avalluri requested a review from pohly October 13, 2020 20:40
Copy link
Contributor

@pohly pohly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make test failed in the CI.

https://cloudnative-k8sci.southcentralus.cloudapp.azure.com/blue/organizations/jenkins/pmem-csi/detail/PR-749/12/pipeline

[2020-10-28T08:27:44.157Z] •! Panic [0.006 seconds]
[2020-10-28T08:27:44.157Z] Operator
[2020-10-28T08:27:44.157Z] /mnt/workspace/pmem-csi_PR-749/pkg/apis/pmemcsi/v1alpha1/deployment_types_test.go:31
[2020-10-28T08:27:44.157Z]   API
[2020-10-28T08:27:44.157Z]   /mnt/workspace/pmem-csi_PR-749/pkg/apis/pmemcsi/v1alpha1/deployment_types_test.go:36
[2020-10-28T08:27:44.157Z]     should have valid json schema [It]
[2020-10-28T08:27:44.157Z]     /mnt/workspace/pmem-csi_PR-749/pkg/apis/pmemcsi/v1alpha1/deployment_types_test.go:112
[2020-10-28T08:27:44.157Z] 
[2020-10-28T08:27:44.157Z]     Test Panicked
[2020-10-28T08:27:44.157Z]     runtime error: invalid memory address or nil pointer dereference
[2020-10-28T08:27:44.157Z]     /go/src/runtime/panic.go:212

containerImage: opts.DriverImage,
deployments: map[string]*pmemcsiv1alpha1.Deployment{},
deploymentsMutex: sync.Mutex{},
reconcileMutex: sync.Mutex{},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can remove the initialization of deploymentsMutex and reconcileMutex, it's not needed.

reconcileHooks map[ReconcileHook]struct{}
deployments map[string]*pmemcsiv1alpha1.Deployment
deploymentsMutex sync.Mutex
reconcileMutex sync.Mutex
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add comments explaining what is protected by these mutexes and why that is necessary. deploymentsMutex seems relatively straightforward (protect concurrent access to deployments) but the logic around locking reconcileMutex isn't.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the field comments.

docs/install.md Outdated Show resolved Hide resolved
docs/install.md Outdated Show resolved Hide resolved
docs/install.md Outdated
|---|---|
| CertsReady | Driver certificates/secrets are available. |
| CertsVerified | Verified that the provided certificates are valid. |
| DriverDeployed | All the componentes required for the PMEM-CSI deployment has been deployed. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/has/have/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

docs/install.md Outdated Show resolved Hide resolved
docs/install.md Show resolved Hide resolved
@@ -30,7 +30,7 @@ spec:
- jsonPath: .status.phase
name: Status
type: string
- jsonPath: .metatdata.creationTimestamp
- jsonPath: .metadata.creationTimestamp
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need a test for this and the other fields.

@avalluri avalluri force-pushed the operator-status branch 2 times, most recently from 90c8553 to 2a1d95d Compare November 1, 2020 02:54
out, err := exec.RunCommand(ssh, "kubectl", "get", "deployments.pmem-csi.intel.com", "--no-headers")
Expect(err).ShouldNot(HaveOccurred(), "kubectl get: %v", out)
fields := strings.Fields(string(out))
Expect(len(fields)).Should(BeEquivalentTo(expectedFiles), "fields mismatch: %v", out)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new test isn't passing:

/mnt/workspace/pmem-csi_PR-749/test/e2e/operator/deployment_api.go:164
Nov 1 03:50:34.611: fields mismatch: test-get-deployment-fields direct map[storage:unknown-node] unexisting/pmem-csi-driver 1s
Expected
<int>: 5
to be equivalent to
<int>: 6
/mnt/workspace/pmem-csi_PR-749/test/e2e/operator/deployment_api.go:185

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MatchRegexp (https://github.com/onsi/gomega/blob/v1.10.3/matchers.go#L171) might be a better way to check for the expected output.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new test isn't passing:

Fixed it now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MatchRegexp (https://github.com/onsi/gomega/blob/v1.10.3/matchers.go#L171) might be a better way to check for the expected output.

Changed the test to use regex.

Copy link
Contributor

@pohly pohly Nov 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit dangerous: it only works if the strings that you put into the regex have no special meaning. I'm on the edge whether the curly brackets in {"storage":"unknown-node"} are matched literally or might be interpreted as a repetition count. But it seems to work, so I suppose it's okay....

... except that the output with Kubernetes 1.17 is different:
https://cloudnative-k8sci.southcentralus.cloudapp.azure.com/blue/organizations/jenkins/pmem-csi/detail/PR-749/20/tests

Expected
<string>: test-get-deployment-fields direct map[storage:unknown-node] unexisting/pmem-csi-driver Running 14s
to match regular expression
<string>: test-get-deployment-fields\s+direct\s+{"storage":"unknown-node"}\s+unexisting/pmem-csi-driver\s+Running\s+[0-9]+(s|m)
/mnt/workspace/pmem-csi_PR-749/test/e2e/operator/deployment_api.go:188

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made changes to regex to support K8s 1.17 output., this also avoids usage of {} in the regex.

@avalluri avalluri force-pushed the operator-status branch 2 times, most recently from 63c687d to 02bc95f Compare November 4, 2020 06:09
@pohly
Copy link
Contributor

pohly commented Nov 11, 2020

@avalluri please rebase

avalluri and others added 5 commits November 11, 2020 15:57
The existing reconcile loop was keen on the deployment CR changes and
redeploy *only* the sub-objects that requires redeploying. This approach
has few drawbacks:

 - it does not handle the changes occurred to the sub-objects once after
   deployed. For example, it does not restore the deleted sub-resource
   which breaks the running deployment.
 - it is not possible to implement the upcoming new status API, that
   requires to update the deployment status with the sub-object(s) updated
   status.

Instead the new reconcile logic *refreshes* all the objects
and CR status to keep the state consistent. The refresh chooses to
merge/server-side patching the objects to avoid all unnecessary updates.

There are two reconcile entry points:
 - CR reconcile loop: to be called on CR updates and refreshes all the
   sub-objects and CR status
 - sub-object event handler: handles the changes(including delete) on
   deployment sub-objects and redeploy only the deleted/changed resource
   and updates CR status if required.

Also made controller data thread-safe.
Revised the Deployment.Status to accommodate the deployment state
conditions and the driver state. Currently, Deployment has 3 conditions
named CertsVerified, CertsReady, and DriverDeployed. It also records the
summary of controller and node driver state, .i.e, number  of nodes the
driver is running. The state The state get refreshed on each time the
reconcile loop run.
With recent changes deployment CR comparsion code is no more needed.
By constructing the objects with TypeMeta, we can retrieve
GroupVersionKind from them when needed.
An e2e test to validate the deployment fields listed by `kubectl get`.
@avalluri
Copy link
Contributor Author

@avalluri please rebase

Rebased.

Copy link
Contributor

@pohly pohly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's merge this.

One more comment about API versioning: strictly speaking, we are changing from v1alpha1 to v1alpha2 here. But because it's an alpha API, it's okayish and (in practice) easier for everyone to just change some fields.

But as we now intend to move to a beta operator API, the 0.9.0 release should have a v1beta1 deployment API, ideally in addition to the v1alpha1. At some point we need to test how to support more than one API version, so we should better start now...

@pohly pohly merged commit d01bc59 into intel:devel Nov 12, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

operator API v1alpha2: revise Status
2 participants