Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Propagate VR condition error message to protected pvc conditions #1639

Merged
merged 4 commits into from
Nov 11, 2024

Conversation

nirs
Copy link
Member

@nirs nirs commented Nov 5, 2024

When a VR condition is not met, we set the protected PVC condition
message using the error message returned from isVRConditionMet(). When
using csi-addons > 0.10.0, we use now the message from the condition
instead of the default message.

Since the Validated condition is not reported by older version of
csi-addons, and we must wait until the Validated condition status is
known when VRG is deleted, isVRConditionMet() returns now also the state
of the condition, which can be:

  • missing: condition not found
  • stale: observed generation does not match object generation
  • unknown: the special "Unknown" value
  • known: status is True or False

When we validate the Validate condition we have these cases:

  • Condition is missing: continue to next condition.

  • Condition is met: continue to the next condition.

  • Condition not met and its status is False. This VR will never
    complete and it is safe to delete since replication will never start.
    If VRG is deleted, we return true since the VR reached the designed
    state. Otherwise we return false. In this case we updated the
    protected pvc condition with the message from the VR condition.

  • Condition is not met and is stale or unnown: we need to check again
    later. There is no point to check the completed condition since a VR
    cannot complete without validation.In this case we updated the
    protected pvc condition with the message generated by
    isVRConditionMet() for stale or unknown conditions.

Example protected pvc DataReady condition with propagated message when
VR validation failed:

conditions:
  - lastTransitionTime: "2024-11-06T15:33:06Z"
    message: 'failed to meet prerequisite: rpc error: code = FailedPrecondition
      desc = system is not in a state required for the operation''s execution:
      failed to enable mirroring on image "replicapool/csi-vol-fe2ca7f8-713c-4c51-bf52-0d4b2c11d329":
      parent image "replicapool/csi-snap-e2114105-b451-469b-ad97-eb3cbe2af54e"
      is not enabled for mirroring'
    observedGeneration: 1
    reason: Error
    status: "False"
    type: DataReady

Note

Using development build of csi-addons adding for testing.
We don't depend on the csi-addon release to merge this fix, but it will
be affective only when using csi-addons including this change:
csi-addons/kubernetes-csi-addons#691

@nirs nirs force-pushed the vr-status branch 3 times, most recently from a84c11b to c496fe9 Compare November 5, 2024 13:47
@nirs nirs changed the title Test vr status with a message Propagate VR condition error message to protected pvc conditions Nov 5, 2024
@nirs
Copy link
Member Author

nirs commented Nov 5, 2024

Works for VR conditions other then Validated. We propagate the messages form the VR conditions:

VR:

  status:
    conditions:
    - lastTransitionTime: "2024-11-05T14:32:43Z"
      message: failed to promote volume
      observedGeneration: 1
      reason: FailedToPromote
      status: "False"
      type: Completed
    - lastTransitionTime: "2024-11-05T14:32:43Z"
      message: failed to enable volume replication
      observedGeneration: 1
      reason: Error
      status: "True"
      type: Degraded
    - lastTransitionTime: "2024-11-05T14:32:43Z"
      message: volume is not resyncing
      observedGeneration: 1
      reason: NotResyncing
      status: "False"
      type: Resyncing
    - lastTransitionTime: "2024-11-05T14:32:43Z"
      message: 'failed to meet prerequisite: rpc error: code = FailedPrecondition
        desc = system is not in a state required for the operation''s execution: failed
        to enable mirroring on image "replicapool/csi-vol-f4737b6e-eeff-4137-8248-301cf37a3368":
        parent image "replicapool/csi-snap-e7c91292-a272-4278-9ee9-6be7a4c8bfe0" is
        not enabled for mirroring'
      observedGeneration: 1
      reason: PrerequisiteNotMet
      status: "False"
      type: Validated

VRG:

    protectedPVCs:
    - accessModes:
      - ReadWriteOnce
      conditions:
      - lastTransitionTime: "2024-11-05T14:32:43Z"
        message: failed to promote volume
        observedGeneration: 1
        reason: Error
        status: "False"
        type: DataReady
      - lastTransitionTime: "2024-11-05T14:32:44Z"
        message: PV cluster data already protected for PVC restored-pvc
        observedGeneration: 1
        reason: Uploaded
        status: "True"
        type: ClusterDataProtected
      - lastTransitionTime: "2024-11-05T14:32:44Z"
        message: failed to promote volume
        observedGeneration: 1
        reason: Error
        status: "False"
        type: DataProtected

Missing change: when Validated condition is False, we want to set the DataReady condition and DataProtected using the error message from the Validated condition. Currently we use the Validated condition only for checking if the VR is finished and can be removed.

@nirs nirs force-pushed the vr-status branch 4 times, most recently from a0d18cc to 54e7e67 Compare November 5, 2024 16:25
@nirs
Copy link
Member Author

nirs commented Nov 5, 2024

Propgartion to protected pvcs message works now for all VR conditions:

    protectedPVCs:
    - accessModes:
      - ReadWriteOnce
      conditions:
      - lastTransitionTime: "2024-11-05T16:36:15Z"
        message: 'failed to meet prerequisite: rpc error: code = FailedPrecondition
          desc = system is not in a state required for the operation''s execution:
          failed to enable mirroring on image "replicapool/csi-vol-348f65fd-c658-4764-b7e7-85c45974e97e":
          parent image "replicapool/csi-snap-1ef6bed0-57e3-458f-8a99-413b823dde59"
          is not enabled for mirroring'
        observedGeneration: 1
        reason: Error
        status: "False"
        type: DataReady
      - lastTransitionTime: "2024-11-05T16:36:16Z"
        message: PV cluster data already protected for PVC restored-pvc
        observedGeneration: 1
        reason: Uploaded
        status: "True"
        type: ClusterDataProtected
      - lastTransitionTime: "2024-11-05T16:36:15Z"
        message: 'failed to meet prerequisite: rpc error: code = FailedPrecondition
          desc = system is not in a state required for the operation''s execution:
          failed to enable mirroring on image "replicapool/csi-vol-348f65fd-c658-4764-b7e7-85c45974e97e":
          parent image "replicapool/csi-snap-1ef6bed0-57e3-458f-8a99-413b823dde59"
          is not enabled for mirroring'
        observedGeneration: 1
        reason: Error
        status: "False"
        type: DataProtected
      csiProvisioner: rook-ceph.rbd.csi.ceph.com
      labels:
        appname: busybox
        ramendr.openshift.io/owner-name: flatten-drpc
        ramendr.openshift.io/owner-namespace-name: ramen-ops
      name: restored-pvc
      namespace: flatten
      replicationID:
        id: ""
      resources:
        requests:
          storage: 1Gi
      storageClassName: rook-ceph-block
      storageID:
        id: rook-ceph-dr1-1

But we have 25 failed unit tests, need to understand why they fail.

@nirs
Copy link
Member Author

nirs commented Nov 5, 2024

We don't propagate the protected pvcs conditions to the drpc, so on the hub this does not help to debug the issue.

Maybe we can add list or errors messages from protected pvcs to make it easier to debug.

  status:
    actionDuration: 23.105201755s
    actionStartTime: "2024-11-05T17:02:08Z"
    conditions:
    - lastTransitionTime: "2024-11-05T17:02:01Z"
      message: Initial deployment completed
      observedGeneration: 1
      reason: Deployed
      status: "True"
      type: Available
    - lastTransitionTime: "2024-11-05T17:02:01Z"
      message: Ready
      observedGeneration: 1
      reason: Success
      status: "True"
      type: PeerReady
    - lastTransitionTime: "2024-11-05T17:02:02Z"
      message: VolumeReplicationGroup (ramen-ops/flatten-drpc) on cluster dr1 is reporting
        errors (All PVCs of the VolumeReplicationGroup are not ready) readying workload
        data, retrying till DataReady condition is met
      observedGeneration: 1
      reason: Error
      status: "False"
      type: Protected
    lastKubeObjectProtectionTime: "2024-11-05T17:02:04Z"
    lastUpdateTime: "2024-11-05T17:02:31Z"
    observedGeneration: 1
    phase: Deployed
    preferredDecision:
      clusterName: dr1
      clusterNamespace: dr1
    progression: Completed
    resourceConditions:
      conditions:
      - lastTransitionTime: "2024-11-05T17:02:02Z"
        message: All PVCs of the VolumeReplicationGroup are not ready
        observedGeneration: 1
        reason: Error
        status: "False"
        type: DataReady
      - lastTransitionTime: "2024-11-05T17:02:02Z"
        message: All PVCs of the VolumeReplicationGroup are not ready
        observedGeneration: 1
        reason: Error
        status: "False"
        type: DataProtected
      - lastTransitionTime: "2024-11-05T17:02:01Z"
        message: Nothing to restore
        observedGeneration: 1
        reason: Restored
        status: "True"
        type: ClusterDataReady
      - lastTransitionTime: "2024-11-05T17:02:04Z"
        message: Cluster data of all PVs are protected. Kube objects protected. Kube
          objects protected
        observedGeneration: 1
        reason: Uploaded
        status: "True"
        type: ClusterDataProtected
      resourceMeta:
        generation: 1
        kind: VolumeReplicationGroup
        name: flatten-drpc
        namespace: ramen-ops
        protectedpvcs:
        - restored-pvc
        resourceVersion: "15650"

@nirs nirs force-pushed the vr-status branch 2 times, most recently from d958c57 to f84aa7f Compare November 5, 2024 17:34
v.log.Info(fmt.Sprintf("%s (VolRep: %s/%s)", errorMsg, volRep.GetName(), volRep.GetNamespace()))
// Check the condition, but update the pvc conditions only if the check was successful.
conditionMet, ok, errorMsg := isVRConditionMet(volRep, volrep.ConditionValidated, metav1.ConditionTrue)
if !conditionMet && ok {
Copy link
Member

@BenamarMk BenamarMk Nov 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so if isVRConditionMet fails, and returns !met && !ok, then this if will evaluate to false because !(!met) && !ok is false. That is !(false) && false It seems wrong.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If ok is false, we don't know the condition value. The condition is missing, the observed generation does not match the object generation, or the value is unknown. For deleting the vr during deletion, we must wait for Status=False. When not deleting, we need to wait and try agin later.

What is wrong?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the error is ignored then. Meaning, we will not enter the if block.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My point is; if this line fails, then we ignore the error.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not an error, we don't know the value of Validated condition. This was the behavior before this change.

We can treat this as an error, since before validated becomes true nothing will work, and it is a quick check, so we can set the conditions and skip checking the next condition.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I can keep the log for the error message, so we report missing, stale or unknown status, without affecting the conditions.

But note that this was logged only when deleting the vrg because previously we called this only during delete. Now we always call this.

Copy link
Member

@BenamarMk BenamarMk Nov 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, that's why I said either log it or remove those error log statements, line 1619, 1626, and 1633. But don't just throw it away.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can simplify if instead of returning ok boolean, we will return enum:

  • ConditionMissing
  • ConditionStale
  • ConditionUnknown
  • ConditionAvailable

With this I can update DataReady and DataProtected for all states except ConditionMissing, and I can return true during deletion only for ConditionAvailable.

Example cases:

state action
ConditionMissing check next condition
ConditionStale update pvc conditions and return false
ConditionUnknown update pvc conditions and return false
ConditionAvailable if condition not met return true if deleting, otherwise false. If true check next condition

With this validateVRStatus will look like the other helpers, always logging and setting the pv conditions, unless the condition is missing (old csi-addons).

@BenamarMk what do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That’s definitely a better long-term solution. However, it requires a lot of changes. You might want to focus on addressing just the error message.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change to return condition state is simple. Handling the condition is little bit more complicated but is more correct.

@nirs nirs force-pushed the vr-status branch 3 times, most recently from 2f93dcf to fced314 Compare November 5, 2024 21:15
@yati1998
Copy link

yati1998 commented Nov 6, 2024

@nirs I not very sure on how the final VRG will look like, can you please point out the comment from above?

@nirs
Copy link
Member Author

nirs commented Nov 6, 2024

@nirs I not very sure on how the final VRG will look like, can you please point out the comment from above?

This comment show the change in the vrg:
#1639 (comment)

@nirs nirs force-pushed the vr-status branch 3 times, most recently from 855be7f to ec846ff Compare November 6, 2024 17:09
@nirs nirs marked this pull request as ready for review November 6, 2024 17:09
@nirs nirs mentioned this pull request Nov 6, 2024
@yati1998
Copy link

yati1998 commented Nov 7, 2024

LGTM, expect currently we are just setting error message to dataProtect and dataReady, but based on various conditions from VR, it should be populated and show messages accordingly.
@nirs If I am not wrong you are planning to bring this change later on as bug fix right?

@nirs
Copy link
Member Author

nirs commented Nov 7, 2024

LGTM, expect currently we are just setting error message to dataProtect and dataReady, but based on various conditions from VR, it should be populated and show messages accordingly. @nirs If I am not wrong you are planning to bring this change later on as bug fix right?

Setting the error message is the purpose of change. In normal condition we know the exact state so using the message from the VR is not very useful. We may simplify the code later to just use the message from the VR.

Another issue duplicating the content of DataReady and DataProtected conditions, which does not seems right, but this is not a new issue, and changing it is not in the scope of this change.

Copy link
Member

@BenamarMk BenamarMk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than two small comments, the rest looks good to me.

conditionMissing = conditionState("missing")
conditionStale = conditionState("stale")
conditionUnknown = conditionState("unknown")
conditionKnown = conditionState("known")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while reviewing, I got confused mixing conditionKnown and conditionUnknown. I think it is better to rename conditionKnown to conditionExists

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

conditionExists is confusing with conditionMissing, but I can find a term that different from Unknown.

v.log.Info(fmt.Sprintf("%s (VolRep: %s/%s)", errorMsg, volRep.GetName(), volRep.GetNamespace()))
) (bool, conditionState) {
conditionMet, condState, errorMsg := isVRConditionMet(volRep, volrep.ConditionValidated, metav1.ConditionTrue)
if !conditionMet && condState != conditionMissing {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would still log the errorMsg when the condState is conditionMissing

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One issue, since we log the defaultMsg (like we do for other conditions), we will log:

 "VolumeReplication resource not validated"

Which is confusing when using csi-addons < 0.10.0. For example when we Completed=True, which means that everything is good. So I think the best way is to keep the current behavior, log only if the condition exists.

@yati1998
Copy link

yati1998 commented Nov 8, 2024

@nirs can we get one more approval on this and merge it?

nirs added 4 commits November 10, 2024 16:45
It was added in failed state (Status=False, Reason=PrerequisiteNotMet)
by default which is wrong. The issue was hidden since we check the
condition only when deleting the VRG. We want to inspect the condition
when the when the VRG is live, so we can report the condition status in
the protected pvcs conditions.

Signed-off-by: Nir Soffer <[email protected]>
Based on the messages addded in
csi-addons/kubernetes-csi-addons#691. We want to
propagate the error messages to the protected pvcs conditions.

Signed-off-by: Nir Soffer <[email protected]>
When a VR condition is not met, we set the protected PVC condition
message using the error message returned from isVRConditionMet(). When
using csi-addons > 0.10.0, we use now the message from the condition
instead of the default message.

Since the Validated condition is not reported by older version of
csi-addons, and we must wait until the Validated condition status is
known when VRG is deleted, isVRConditionMet() returns now also the state
of the condition, which can be:

- missing: condition not found
- stale: observed generation does not match object generation
- unknown: the special "Unknown" value
- known: status is True or False

When we validate the Validate condition we have these cases:

- Condition is missing: continue to next condition.

- Condition is met: continue to the next condition.

- Condition not met and its status is False. This VR will never
  complete and it is safe to delete since replication will never start.
  If VRG is deleted, we return true since the VR reached the designed
  state. Otherwise we return false. In this case we updated the
  protected pvc condition with the message from the VR condition.

- Condition is not met and is stale or unnown: we need to check again
  later. There is no point to check the completed condition since a VR
  cannot complete without validation.In this case we updated the
  protected pvc condition with the message generated by
  isVRConditionMet() for stale or unknown conditions.

Example protected pvc DataReady condition with propagated message when
VR validation failed:

    conditions:
      - lastTransitionTime: "2024-11-06T15:33:06Z"
        message: 'failed to meet prerequisite: rpc error: code = FailedPrecondition
          desc = system is not in a state required for the operation''s execution:
          failed to enable mirroring on image "replicapool/csi-vol-fe2ca7f8-713c-4c51-bf52-0d4b2c11d329":
          parent image "replicapool/csi-snap-e2114105-b451-469b-ad97-eb3cbe2af54e"
          is not enabled for mirroring'
        observedGeneration: 1
        reason: Error
        status: "False"
        type: DataReady

Signed-off-by: Nir Soffer <[email protected]>
Verify that when VR Validated condition is False, the condition message
is propagated to the protected pvc DataReady condition message.

To make this easy to test, we have a new helper for waiting until
protected pvc condition status and message are updated to specified
values.

We propagate the same message to the DataProtected condition, but this
behavior seems like unwanted behavior that should change and is not
worth testing.

Signed-off-by: Nir Soffer <[email protected]>
@nirs
Copy link
Member Author

nirs commented Nov 10, 2024

@nirs can we get one more approval on this and merge it?

We don't need more approvals. We kept it to give time for more reviewers.

@BenamarMk BenamarMk merged commit 01d9710 into RamenDR:main Nov 11, 2024
20 checks passed
@nirs nirs deleted the vr-status branch November 20, 2024 18:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants