Allow restoration of the `status` metadata field #1687

rayanebel · 2019-07-23T08:05:38Z

Describe the problem/challenge you have
To manage our clusters, we are using Rancher. Rancher run on top of Kubernetes and save all the data in the same ETCD server (it's like a big operator). We tried to use velero to be able to restore our Rancher server in case of disaster recovery.

Backup is running perfectly but we are facing to an issue during the restore phase because Rancher save some state information inside status part and velero clean all status and metadata before restoring all the objects. So currently, we use velero for the backup and restore the file with multiple kubectl command line.

Describe the solution you'd like
I don't have the perfect solution but, I saw in velero code the function resetMetadataAndStatus

https://github.com/heptio/velero/blob/e371ba78b0844b28359f902076d470ae5a5de0b9/pkg/restore/restore.go#L1117

Maybe as a first step we can add a simple flag --include-status or something else to let the choice to the user to execute this function.

Maybe you have another solution or another idea ?

Anything else you would like to add:

We also open a ticket in Rancher side : rancher/rancher#21647

Environment:

Velero version
- Client: v1.0.0
- Server: v0.11.0
Kubernetes:
- Client: v1.14
- Server: v1.12.8
Kubernetes installer & version: AKS
Cloud provider or hardware configuration: Azure AKS/Rancher (2.2.5)

The text was updated successfully, but these errors were encountered:

prydonius · 2019-08-12T17:59:24Z

Thanks for filing the issue upstream with Rancher, I do think that it's odd for Rancher to store state in status. That said, I don't think it would hurt to add an annotation or flag (my preference would be to annotate items with something like velero.io/restore-status: true). It's possible that other Operators might use status in some meaningful way, so this could be generally useful.

prydonius · 2019-08-12T17:59:55Z

wdyt @skriss @carlisia @nrb?

skriss · 2019-08-12T21:47:30Z

I believe there's another issue here, which is that status is actually ignored by the API server when a resource is created or updated, since it's not intended to be specified by the user. To restore status, I believe we'd have to explicitly capture it and write it to the /status subresource for whatever resource we are operating on. CRDs haven't always supported the /status subresource, so that may not be true for some older clusters/CRDs, but going forward it should be. I'm not sure off the top of my head exactly how we would implement this - there are likely complexities around not being able to create the spec and status atomically, and possibly other issues.

skriss · 2019-08-12T21:49:24Z

We probably need to discuss as a team and see if this is something we want to invest further time in to investigate/prototype.

i300543 · 2019-08-25T08:06:51Z

Hello,
we are also trying to use velero for our cluster backups, and faced with similar issue.
We would rather prefer to keep the state in the status field in case it was not defined as a subresource:

We would like to keep some state in the object that is visible to the user as status, but not modifiable by the user
If the CRD object doesn't explicitly define /status as a subresource, then a backup/recover operation should not remove the status. Doing so means that objects that rely on it will fail, while not doing so would mean that objects that don't rely on it would still work correctly.
Nothing is required of the Velero code other than "don't modify the backup as it was made.

Please let us know if and when you reach a decision.
Thanks!

SAP Webide dev team

nrb · 2019-08-26T16:27:04Z

Thanks for adding your use case here @i300543.

Originally this code was designed under the assumption that status would be recoverable based on the spec. Can you help me understand why that's not the case with your CRDs?

i300543 · 2019-08-27T07:48:02Z

For various reasons, our application was developed in this way, and changing it is not easy at this point. Why does Velero add a restriction to Kubernetes objects which is explicitly not enforced by Kubernetes itself? In Kubernetes, a CRD which doesn't explicitly define a "/status" subresource is able to use the Status section like our application does. Velero's assumption breaks our flow, while removing this assumption will not hurt any use case as far as we can see no ?

skriss · 2019-09-03T19:29:16Z

@rayanebel @i300543 could you provide a couple specific examples of the types of information that are stored in status that need to be restored?

Also, in terms of UX - would you want to enable/disable restoring status on a per-CRD level (i.e. for all items of a given CRD type), or at some other level of granularity?

rayanebel · 2019-09-04T16:04:41Z

hello @skriss

As Rancher work with custom resources definitions (xxxxxx.cattle.io) for all its object. I think It will be great to restore status at the CRD level (for all item in CRD with a wildcard for example).

With Rancher for example, when we create a cluster we have an object of kind clusters.management.cattle.io which are added in kubernetes with the following content :

apiVersion: management.cattle.io/v3
kind: Cluster
metadata:
  labels:
    cattle.io/creator: norman
  name: c-cpcnd
  resourceVersion: "11616341"
  selfLink: /apis/management.cattle.io/v3/clusters/c-cpcnd
  uid: e34c0df9-cefe-11e9-8978-067a6cd806b3
spec:
  description: ""
  desiredAgentImage: ""
  desiredAuthImage: ""
  displayName: pp-test-aks2
  dockerRootDir: /var/lib/docker
  enableClusterAlerting: false
  [...]
status:
  agentImage: rancher/rancher-agent:v2.2.8
  allocatable:
    cpu: 5793m
    memory: 14003528Ki
    pods: "330"
  apiEndpoint: https://xxxxxxxxxxxxxxxxxxxxxxxx:443
  appliedEnableNetworkPolicy: false
  appliedPodSecurityPolicyTemplateId: ""
  appliedSpec:
    description: ""
    desiredAgentImage: ""
    desiredAuthImage: ""
    displayName: pp-test-aks2
    dockerRootDir: /var/lib/docker
    enableClusterAlerting: false
    enableClusterMonitoring: false
    enableNetworkPolicy: false
    internal: false
    localClusterAuthEndpoint:
      enabled: false
  authImage: ""
  caCert: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
  capabilities:
    loadBalancerCapabilities:
      enabled: true
      healthCheckSupported: true
      protocolsSupported:
      - TCP
      - UDP
      provider: Azure L4 LB
  capacity:
    cpu: "6"
    memory: 21339464Ki
    pods: "330"
  componentStatuses:
  - conditions:
    - message: 'Get http://127.0.0.1:10252/healthz: dial tcp 127.0.0.1:10252: connect:
        connection refused'
      status: "False"
      type: Healthy
    name: controller-manager
  - conditions:
    - message: '{"health": "true"}'
      status: "True"
      type: Healthy
    name: etcd-0
  - conditions:
    - message: 'Get http://127.0.0.1:10251/healthz: dial tcp 127.0.0.1:10251: connect:
        connection refused'
      status: "False"
      type: Healthy
    name: scheduler
  conditions:
  - status: "True"
    type: Pending
  - lastUpdateTime: "2019-09-04T10:50:03Z"
    status: "True"
    type: Provisioned
  - lastUpdateTime: "2019-09-04T10:50:19Z"
    status: "True"
    type: Waiting
  - lastUpdateTime: "2019-09-04T10:29:34Z"
    status: "True"
    type: BackingNamespaceCreated
  - lastUpdateTime: "2019-09-04T10:29:34Z"
    status: "True"
    type: DefaultProjectCreated
  - lastUpdateTime: "2019-09-04T10:29:34Z"
    status: "True"
    type: SystemProjectCreated
  - lastUpdateTime: "2019-09-04T10:29:34Z"
    status: "True"
    type: InitialRolesPopulated
  - lastUpdateTime: "2019-09-04T10:29:40Z"
    status: "True"
    type: CreatorMadeOwner
  - lastUpdateTime: "2019-09-04T10:31:35Z"
    status: "True"
    type: NoDiskPressure
  - lastUpdateTime: "2019-09-04T10:31:35Z"
    status: "True"
    type: NoMemoryPressure
  - lastUpdateTime: "2019-09-04T10:50:05Z"
    status: "True"
    type: GlobalAdminsSynced
  - lastUpdateTime: "2019-09-04T10:50:05Z"
    status: "False"
    type: AlertingEnabled
  - lastUpdateTime: "2019-09-04T10:50:19Z"
    status: "True"
    type: Ready
  - lastUpdateTime: "2019-09-04T10:50:25Z"
    status: "True"
    type: SystemAccountCreated
  - lastUpdateTime: "2019-09-04T10:50:26Z"
    status: "True"
    type: AgentDeployed
  - lastUpdateTime: "2019-09-04T10:50:26Z"
    status: "False"
    type: PrometheusOperatorDeployed
  - lastUpdateTime: "2019-09-04T10:50:49Z"
    status: "True"
    type: Updated
  - lastUpdateTime: "2019-09-04T10:50:43Z"
    status: "True"
    type: ServiceAccountMigrated
  driver: aks2
  limits:
    cpu: 700m
    memory: 3140Mi
    pods: "0"
  requested:
    cpu: 905m
    memory: 1114Mi
    pods: "17"
  serviceAccountToken: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
  version:
    buildDate: "2019-08-19T11:05:16Z"
    compiler: gc
    gitCommit: 96fac5cd13a5dc064f7d9f4f23030a6aeface6cc
    gitTreeState: clean
    gitVersion: v1.14.6
    goVersion: go1.12.9
    major: "1"
    minor: "14"
    platform: linux/amd64

And for example Rancher need to retrieve data like status.serviceAccountToken or status.caCert or status.apiEndpoint

prydonius · 2019-09-09T21:41:04Z

Somewhat related to this, there are discussions just starting in SIG Apps to standardise the .status field which probably involves assumptions about not storing state in .status.

ChrisHaPunkt · 2019-10-22T10:04:32Z

Same here. In our migration scenario, Rancher is recreating the System and Default project, because it is not aware of the fact that these were already created und restored by the backup.
So we are ending up with two System and two Default projects.

carlisia · 2020-10-22T15:56:27Z

Related: #1272

wcochran53 · 2020-10-26T18:24:09Z

I have a similar issue where the Custom Resource when restored doesn't have the status fields populated, and thus the operator relating to this CR isn't able to reconcile. ( I understand that the operator should not be using the status fields for this purpose but I am in a situation where I have to live with what I have ). Thus I would like the status fields not to be dropped.

I understand this is possible if the Custom Resource Definition does not use the /status subresource, in which case it is a simple RestoreAction plugin to ensure that the Status fields are restored. However because in my case the Custom Resource Definition does use /status subresource I believe this means we can only update the Customer Resource status using the /status endpoint. This becomes impossible using the RestoreAction Plugin because I think the creation of the resource only happens during a velero restore after the restoreAction plugin operation has occurred, and updating teh status using teh /status endpoint can only happen after resource creation.

I had been wondering about whether I could use 2 RestoreAction Plugins, the first one called to recreate the Custom Resource, and the 2nd restore action plugin to update the status via the /status endpoint. However I noticed that the creation of the normal restored resource occurs after the Restore Action plugins have been called and this will fail if a resource of the same name already exists and results in the metadata being reset and status cleared.

So I still would like some mechanism of maintaining status of a Custom Resource where the CRD uses a status subresource, and if this could be done using a RestoreAction Plugin then this would be great.

gowrisankar22 · 2021-05-03T04:59:37Z

@carlisia @nrb any update on the issue? Only option is to restore the /status is not to enable as Subsource?

carlisia · 2021-05-04T19:46:49Z

We have the intention of making the experience better by at least documenting around this issue. See: #3654 (comment).

Maybe @dsu-igeek has some further insights.

nrb · 2021-05-04T21:33:19Z

I think that @wcochran53's proposed solution with a RestoreItemAction plugin would be how I'd approach this. If someone feels strongly about implementing one that would inspect CRs for a status subresource, then we could definitely review the proposal.

stale · 2021-07-08T15:36:53Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

RafaeLeal · 2022-03-11T18:19:21Z

Hey, I'm suffering from the same issue.
We have a CICD system built with CRDs, and some of those reference external resources (things like Github's webhooks, deploy keys, and so on). So we have these external resources' ids on status field. Although we could make the controller search Github before creating a new webhook or deploy key, that would mean that when we restore the backup, we would generate a lot of requests on Github, which will probably result in rate limits issues.
So we would prefer to restore the status field and let the controller do that only for new CRDs.

I'd love to help with some initiative to support this

RafaeLeal · 2022-03-11T21:26:28Z

Well, I was taking a look at the code and reading previous answers.. and it seems that RestoreItemAction does not fit our needs because it runs previous to the resource creation. We could add something like PostRestoreItemAction, but for me, that's too generic for a problem that seems to be inside Velero's domain (correct me if I'm mistaken).
I'd add a spec.restoreStatus field that would work as a resource filter, like this:

kind: Restore
metadata: 
  ...
spec:
  restoreStatus:
    includedResources:
      - webhooks
      - someothercrds
      - deployments # no reason to restrict this to CRDs I guess
    excludedResources: []

The default (restoreStatus being nil) we can assume that would exclude all (keep existing behavior). WDYT?

reasonerjt · 2022-03-15T09:00:00Z

This seems a valid assumption to me:
#1687 (comment)
I personally wanna refrain from adding too many flexibilities that make users may shoot themselves in the foot in the future.
Could you give us a concrete example of what information you put into the status field of your CR?

RafaeLeal · 2022-03-15T16:15:42Z

This seems a valid assumption to me: #1687 (comment) I personally wanna refrain from adding too many flexibilities that make users may shoot themselves in the foot in the future. Could you give us a concrete example of what information you put into the status field of your CR?

I think the simplest example I can give is a CRD we have called Webhook, which represents a Github Webhook. On status, we have the Github Webhook ID, so we can know if it's configured correctly (with the correct events, secret and URL).
So we have something like this:

apiVersion: workflows.dev/v1alpha1
kind: Webhook
metadata:
  name: example-webhook
spec:
  baseURL: https://listener.example.url/hook
  events:
  - push
  - pull_request
  repository:
    name: velero
    owner: vmware-tanzu
status:
  externalID: 123456789

If we don't apply the status.externalID, we could duplicate the webhook on Github.
To prevent that, we could on the controller list Github Webhooks and get the webhook that has the same baseURL. That would prevent us to have two webhooks for the same baseURL, but that's not even the biggest issue in this scenario (but it could be in other scenario). The problem is that the controller would do a Github request (list on webhooks) for every instance of this CRD when I restore, and that will reach Github's rate limits.

reasonerjt · 2022-03-16T06:43:20Z

@RafaeLeal
Thanks for the reply.
We may find a way to retain the externalID across backup and restore without changing the workflow in restore controller.
Could you write a backupitemaction plugin to move the data in status to annotation of the CR so the information will not be lost after it's restored?

RafaeLeal · 2022-03-16T21:31:54Z

@RafaeLeal Thanks for the reply. We may find a way to retain the externalID across backup and restore without changing the workflow in restore controller. Could you write a backupitemaction plugin to move the data in status to annotation of the CR so the information will not be lost after it's restored?

I think that can solve my issue, but this is a hack and not a general solution... It only works for me because I'm also the developer of the controller, but that's not the case for a lot of other CRDs... I think it makes sense to develop a more general solution, don't you think?
I also think it's a bit weird to have the controller looking for metadata instead of spec
I don't think it would need that many changes on the restore controller...

hamza3202 · 2022-03-17T13:45:05Z

I am facing the same issue with my CICD system

We are using Argo Workflows which stores information about when the workflow was run, where the logs are etc in the status. After restore, that information is missing.

However, If I manually apply the kubernetes manifests(that velero had backed up) then status field is restored.

RafaeLeal · 2022-03-28T18:22:04Z

Hey, I just proposed a solution in this PR
#4785

I built, deployed, and tested it on a cluster: it works. We might need to tweak some tests though.

Let me know what you think, and if we should discuss more deeply something.
I'm interested to know if that would work well for you as well @hamza3202

sunglim · 2024-06-17T12:35:02Z

Hi
May I ask a question about the process of restore?
Restoring CR without status[1], and updating status[2] are not a single operation.
That means, if there is an operator observing CR, when velero restores a CR without status[1], an reconciliation in the operator can be started, and then, velero updates the status[2]. Is this a scenario that can happen?

blackpiglet · 2024-06-24T04:08:14Z

It's possible, but if my memory serves me, the client-go create method doesn't allow to create the k8s resource with status.

kaovilai · 2024-06-24T17:42:06Z

@blackpiglet I believe one could follow up with a patch call to add status.

sseago · 2024-06-24T18:06:03Z

@kaovilai that's exactly how we do it (see #4785 )

The question was asked above about the possibility of some other controller processing the restored item between the create and patch calls. And yes, this is a possiblity, but we haven't been successful so far in coming up with a way to avoid this possibility since you can't create the resource with status included.

nrb mentioned this issue Aug 22, 2019

Status section in crd object is not restored #1788

Closed

skriss added the Needs info Waiting for information label Sep 3, 2019

skriss added the Enhancement/User End-User Enhancement to Velero label Sep 5, 2019

skriss added the Needs Product Blocked needing input or feedback from Product label Sep 13, 2019

nrb mentioned this issue Aug 11, 2020

Restoration is failing while taking restore for Istio service mesh app. #2653

Closed

dsu-igeek removed the Needs info Waiting for information label Oct 22, 2020

carlisia added the Kubernetes Resources Pertains to backup/restoration of Kubernetes resources label Oct 22, 2020

nrb mentioned this issue Oct 22, 2020

velero restore drops certain fields of CustomResourceDefinition resources. #2742

Closed

nrb changed the title ~~[QUESTION] Let the possibility to keep and restore "status" metadata~~ Allow restoration of the status metadata field Oct 22, 2020

Lirt mentioned this issue Feb 2, 2021

capi-controller-manager crashes after a velero backup/restore kubernetes-sigs/cluster-api#4105

Closed

JoelSpeed mentioned this issue Apr 12, 2021

apply user defined tags to all resources openshift/enhancements#706

Merged

eleanor-millman added Reviewed Q2 2021 Icebox We see the value, but it is not slated for the next couple releases. labels May 4, 2021

stale bot closed this as completed Jul 22, 2021

turkenh mentioned this issue Aug 24, 2021

Add support for ResourceRecordSets crossplane-contrib/provider-gcp#358

Merged

2 tasks

Feggah mentioned this issue Aug 26, 2021

Support for update for external-name and other identifier fields crossplane/crossplane#2532

Closed

ywk253100 reopened this Nov 25, 2021

stale bot removed the staled label Nov 25, 2021

RafaeLeal mentioned this issue Mar 28, 2022

Add ability to restore status on selected resources #4785

Merged

3 tasks

sseago closed this as completed in #4785 May 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow restoration of the `status` metadata field #1687

Allow restoration of the `status` metadata field #1687

rayanebel commented Jul 23, 2019 •

edited

Loading

prydonius commented Aug 12, 2019

prydonius commented Aug 12, 2019

skriss commented Aug 12, 2019

skriss commented Aug 12, 2019

i300543 commented Aug 25, 2019

nrb commented Aug 26, 2019

i300543 commented Aug 27, 2019

skriss commented Sep 3, 2019

rayanebel commented Sep 4, 2019 •

edited

Loading

prydonius commented Sep 9, 2019

ChrisHaPunkt commented Oct 22, 2019 •

edited

Loading

carlisia commented Oct 22, 2020

wcochran53 commented Oct 26, 2020

gowrisankar22 commented May 3, 2021

carlisia commented May 4, 2021

nrb commented May 4, 2021

stale bot commented Jul 8, 2021

RafaeLeal commented Mar 11, 2022

RafaeLeal commented Mar 11, 2022

reasonerjt commented Mar 15, 2022

RafaeLeal commented Mar 15, 2022

reasonerjt commented Mar 16, 2022

RafaeLeal commented Mar 16, 2022

hamza3202 commented Mar 17, 2022 •

edited

Loading

RafaeLeal commented Mar 28, 2022

sunglim commented Jun 17, 2024

blackpiglet commented Jun 24, 2024

kaovilai commented Jun 24, 2024

sseago commented Jun 24, 2024

Allow restoration of the status metadata field #1687

Allow restoration of the status metadata field #1687

Comments

rayanebel commented Jul 23, 2019 • edited Loading

prydonius commented Aug 12, 2019

prydonius commented Aug 12, 2019

skriss commented Aug 12, 2019

skriss commented Aug 12, 2019

i300543 commented Aug 25, 2019

nrb commented Aug 26, 2019

i300543 commented Aug 27, 2019

skriss commented Sep 3, 2019

rayanebel commented Sep 4, 2019 • edited Loading

prydonius commented Sep 9, 2019

ChrisHaPunkt commented Oct 22, 2019 • edited Loading

carlisia commented Oct 22, 2020

wcochran53 commented Oct 26, 2020

gowrisankar22 commented May 3, 2021

carlisia commented May 4, 2021

nrb commented May 4, 2021

stale bot commented Jul 8, 2021

RafaeLeal commented Mar 11, 2022

RafaeLeal commented Mar 11, 2022

reasonerjt commented Mar 15, 2022

RafaeLeal commented Mar 15, 2022

reasonerjt commented Mar 16, 2022

RafaeLeal commented Mar 16, 2022

hamza3202 commented Mar 17, 2022 • edited Loading

RafaeLeal commented Mar 28, 2022

sunglim commented Jun 17, 2024

blackpiglet commented Jun 24, 2024

kaovilai commented Jun 24, 2024

sseago commented Jun 24, 2024

Allow restoration of the `status` metadata field #1687

Allow restoration of the `status` metadata field #1687

rayanebel commented Jul 23, 2019 •

edited

Loading

rayanebel commented Sep 4, 2019 •

edited

Loading

ChrisHaPunkt commented Oct 22, 2019 •

edited

Loading

hamza3202 commented Mar 17, 2022 •

edited

Loading