-
Notifications
You must be signed in to change notification settings - Fork 55
operator: revise deployment status API and operator reconcile loop #749
Conversation
bff2216
to
9b17451
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've not looked at the actual code yet. Let's get the current tests to work first. For that, please rebase.
|
||
// +k8s:deepcopy-gen=true | ||
type DriverStatus struct { | ||
// Type represents type of the driver: controller or node |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we call this "type of the driver" elsewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we talk about "driver deployment", I tend to think of it as the combination of all of its parts. "Driver component" seems more appropriate here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean to renameType
to DriverComponent
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed the field. The status looks as below:
Driver Components:
Component: Controller
Last Updated: 2020-10-08T07:45:13Z
Reason: 1 instance(s) of controller driver is running successfully
Status: Ready
Component: Node
Last Updated: 2020-10-08T07:45:11Z
Reason: All 3 node driver pod(s) running successfully
Status: Ready
3245084
to
8413e84
Compare
Split the change into 2 different commits; Revised reconcile loop and Revised status API. |
} | ||
op := NewObjectPatch(ss, ss.DeepCopy()) | ||
d.getControllerStatefulSet(ss) | ||
if err := d.updateSubObject(r, op); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of these redeploy
functions seem to follow the same pattern. Are you sure that this cannot be moved into one function that works with interface{}
and callbacks and then just gets called with different parameters?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rewrote the code such that all redeploy* functions call a redeploy()
with RedeployObject
by setting the appropriate callback handlers. But one exception is redeploySecrets
which is different from the pattern. It checks if the object is new after the fetching and before the update.
pkg/pmem-csi-operator/controller/deployment/controller_driver.go
Outdated
Show resolved
Hide resolved
pkg/pmem-csi-operator/controller/deployment/deployment_controller.go
Outdated
Show resolved
Hide resolved
pkg/pmem-csi-operator/controller/deployment/deployment_controller.go
Outdated
Show resolved
Hide resolved
pkg/pmem-csi-operator/controller/deployment/deployment_controller.go
Outdated
Show resolved
Hide resolved
1e77d5f
to
7a8e83d
Compare
7a8e83d
to
0aa500c
Compare
0aa500c
to
18aa6e3
Compare
pkg/pmem-csi-operator/controller/deployment/deployment_controller.go
Outdated
Show resolved
Hide resolved
pkg/pmem-csi-operator/controller/deployment/deployment_controller.go
Outdated
Show resolved
Hide resolved
e09b6ae
to
f65ccfb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make test
failed in the CI.
[2020-10-28T08:27:44.157Z] •! Panic [0.006 seconds]
[2020-10-28T08:27:44.157Z] Operator
[2020-10-28T08:27:44.157Z] /mnt/workspace/pmem-csi_PR-749/pkg/apis/pmemcsi/v1alpha1/deployment_types_test.go:31
[2020-10-28T08:27:44.157Z] API
[2020-10-28T08:27:44.157Z] /mnt/workspace/pmem-csi_PR-749/pkg/apis/pmemcsi/v1alpha1/deployment_types_test.go:36
[2020-10-28T08:27:44.157Z] should have valid json schema [It]
[2020-10-28T08:27:44.157Z] /mnt/workspace/pmem-csi_PR-749/pkg/apis/pmemcsi/v1alpha1/deployment_types_test.go:112
[2020-10-28T08:27:44.157Z]
[2020-10-28T08:27:44.157Z] Test Panicked
[2020-10-28T08:27:44.157Z] runtime error: invalid memory address or nil pointer dereference
[2020-10-28T08:27:44.157Z] /go/src/runtime/panic.go:212
cfc4a5b
to
5ba620e
Compare
containerImage: opts.DriverImage, | ||
deployments: map[string]*pmemcsiv1alpha1.Deployment{}, | ||
deploymentsMutex: sync.Mutex{}, | ||
reconcileMutex: sync.Mutex{}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can remove the initialization of deploymentsMutex
and reconcileMutex
, it's not needed.
reconcileHooks map[ReconcileHook]struct{} | ||
deployments map[string]*pmemcsiv1alpha1.Deployment | ||
deploymentsMutex sync.Mutex | ||
reconcileMutex sync.Mutex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add comments explaining what is protected by these mutexes and why that is necessary. deploymentsMutex
seems relatively straightforward (protect concurrent access to deployments
) but the logic around locking reconcileMutex
isn't.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added the field comments.
docs/install.md
Outdated
|---|---| | ||
| CertsReady | Driver certificates/secrets are available. | | ||
| CertsVerified | Verified that the provided certificates are valid. | | ||
| DriverDeployed | All the componentes required for the PMEM-CSI deployment has been deployed. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/has/have/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
5ba620e
to
bd08ad3
Compare
@@ -30,7 +30,7 @@ spec: | |||
- jsonPath: .status.phase | |||
name: Status | |||
type: string | |||
- jsonPath: .metatdata.creationTimestamp | |||
- jsonPath: .metadata.creationTimestamp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need a test for this and the other fields.
90c8553
to
2a1d95d
Compare
test/e2e/operator/deployment_api.go
Outdated
out, err := exec.RunCommand(ssh, "kubectl", "get", "deployments.pmem-csi.intel.com", "--no-headers") | ||
Expect(err).ShouldNot(HaveOccurred(), "kubectl get: %v", out) | ||
fields := strings.Fields(string(out)) | ||
Expect(len(fields)).Should(BeEquivalentTo(expectedFiles), "fields mismatch: %v", out) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new test isn't passing:
/mnt/workspace/pmem-csi_PR-749/test/e2e/operator/deployment_api.go:164
Nov 1 03:50:34.611: fields mismatch: test-get-deployment-fields direct map[storage:unknown-node] unexisting/pmem-csi-driver 1s
Expected
<int>: 5
to be equivalent to
<int>: 6
/mnt/workspace/pmem-csi_PR-749/test/e2e/operator/deployment_api.go:185
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MatchRegexp
(https://github.com/onsi/gomega/blob/v1.10.3/matchers.go#L171) might be a better way to check for the expected output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new test isn't passing:
Fixed it now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MatchRegexp
(https://github.com/onsi/gomega/blob/v1.10.3/matchers.go#L171) might be a better way to check for the expected output.
Changed the test to use regex.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit dangerous: it only works if the strings that you put into the regex have no special meaning. I'm on the edge whether the curly brackets in {"storage":"unknown-node"}
are matched literally or might be interpreted as a repetition count. But it seems to work, so I suppose it's okay....
... except that the output with Kubernetes 1.17 is different:
https://cloudnative-k8sci.southcentralus.cloudapp.azure.com/blue/organizations/jenkins/pmem-csi/detail/PR-749/20/tests
Expected
<string>: test-get-deployment-fields direct map[storage:unknown-node] unexisting/pmem-csi-driver Running 14s
to match regular expression
<string>: test-get-deployment-fields\s+direct\s+{"storage":"unknown-node"}\s+unexisting/pmem-csi-driver\s+Running\s+[0-9]+(s|m)
/mnt/workspace/pmem-csi_PR-749/test/e2e/operator/deployment_api.go:188
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made changes to regex to support K8s 1.17 output., this also avoids usage of {}
in the regex.
63c687d
to
02bc95f
Compare
02bc95f
to
2ea6520
Compare
@avalluri please rebase |
The existing reconcile loop was keen on the deployment CR changes and redeploy *only* the sub-objects that requires redeploying. This approach has few drawbacks: - it does not handle the changes occurred to the sub-objects once after deployed. For example, it does not restore the deleted sub-resource which breaks the running deployment. - it is not possible to implement the upcoming new status API, that requires to update the deployment status with the sub-object(s) updated status. Instead the new reconcile logic *refreshes* all the objects and CR status to keep the state consistent. The refresh chooses to merge/server-side patching the objects to avoid all unnecessary updates. There are two reconcile entry points: - CR reconcile loop: to be called on CR updates and refreshes all the sub-objects and CR status - sub-object event handler: handles the changes(including delete) on deployment sub-objects and redeploy only the deleted/changed resource and updates CR status if required. Also made controller data thread-safe.
Revised the Deployment.Status to accommodate the deployment state conditions and the driver state. Currently, Deployment has 3 conditions named CertsVerified, CertsReady, and DriverDeployed. It also records the summary of controller and node driver state, .i.e, number of nodes the driver is running. The state The state get refreshed on each time the reconcile loop run.
With recent changes deployment CR comparsion code is no more needed.
By constructing the objects with TypeMeta, we can retrieve GroupVersionKind from them when needed.
An e2e test to validate the deployment fields listed by `kubectl get`.
2ea6520
to
2003d58
Compare
Rebased. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's merge this.
One more comment about API versioning: strictly speaking, we are changing from v1alpha1 to v1alpha2 here. But because it's an alpha API, it's okayish and (in practice) easier for everyone to just change some fields.
But as we now intend to move to a beta operator API, the 0.9.0 release should have a v1beta1 deployment API, ideally in addition to the v1alpha1. At some point we need to test how to support more than one API version, so we should better start now...
Revised the Deployment.Status to accommodate the deployment state
conditions and driver state. Currently Deployment has 3 conditions named
CertsVerified, CertsReady, and DriverDeployed. It also records summary
of controller and node driver state, .i.e, no. of nodes the driver is
running.
In order to record real time status of the driver current had to rewrite
the current reconcile loop. The existing reconcile loop was keen on the
deployment CR changes and redeploy only the sub-objects that requires
to redeploy. Instead the new reconcile logic refresh all the objects
and CR status to keep the state consistent. The refresh chooses to
merge patching the objects to avoid all unnecessary updates.
There are two reconcile entry points:
and updates CR status if required.
This also includes other code cleanups that come across.
TODOs:
broken deployment - i.e, recovering deleted/modified sub-objects.
FIXES: #611