Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow restoration of the status metadata field #1687

Closed
rayanebel opened this issue Jul 23, 2019 · 30 comments · Fixed by #4785
Closed

Allow restoration of the status metadata field #1687

rayanebel opened this issue Jul 23, 2019 · 30 comments · Fixed by #4785
Labels
Enhancement/User End-User Enhancement to Velero Icebox We see the value, but it is not slated for the next couple releases. kind/requirement Kubernetes Resources Pertains to backup/restoration of Kubernetes resources Needs Product Blocked needing input or feedback from Product Reviewed Q2 2021

Comments

@rayanebel
Copy link

rayanebel commented Jul 23, 2019

Describe the problem/challenge you have
To manage our clusters, we are using Rancher. Rancher run on top of Kubernetes and save all the data in the same ETCD server (it's like a big operator). We tried to use velero to be able to restore our Rancher server in case of disaster recovery.

Backup is running perfectly but we are facing to an issue during the restore phase because Rancher save some state information inside status part and velero clean all status and metadata before restoring all the objects. So currently, we use velero for the backup and restore the file with multiple kubectl command line.

Describe the solution you'd like
I don't have the perfect solution but, I saw in velero code the function resetMetadataAndStatus

https://github.com/heptio/velero/blob/e371ba78b0844b28359f902076d470ae5a5de0b9/pkg/restore/restore.go#L1117

Maybe as a first step we can add a simple flag --include-status or something else to let the choice to the user to execute this function.

Maybe you have another solution or another idea ?

Anything else you would like to add:

We also open a ticket in Rancher side : rancher/rancher#21647

Environment:

  • Velero version
    • Client: v1.0.0
    • Server: v0.11.0
  • Kubernetes:
    • Client: v1.14
    • Server: v1.12.8
  • Kubernetes installer & version: AKS
  • Cloud provider or hardware configuration: Azure AKS/Rancher (2.2.5)
@prydonius
Copy link
Contributor

Thanks for filing the issue upstream with Rancher, I do think that it's odd for Rancher to store state in status. That said, I don't think it would hurt to add an annotation or flag (my preference would be to annotate items with something like velero.io/restore-status: true). It's possible that other Operators might use status in some meaningful way, so this could be generally useful.

@prydonius
Copy link
Contributor

wdyt @skriss @carlisia @nrb?

@skriss
Copy link
Contributor

skriss commented Aug 12, 2019

I believe there's another issue here, which is that status is actually ignored by the API server when a resource is created or updated, since it's not intended to be specified by the user. To restore status, I believe we'd have to explicitly capture it and write it to the /status subresource for whatever resource we are operating on. CRDs haven't always supported the /status subresource, so that may not be true for some older clusters/CRDs, but going forward it should be. I'm not sure off the top of my head exactly how we would implement this - there are likely complexities around not being able to create the spec and status atomically, and possibly other issues.

@skriss
Copy link
Contributor

skriss commented Aug 12, 2019

We probably need to discuss as a team and see if this is something we want to invest further time in to investigate/prototype.

@i300543
Copy link

i300543 commented Aug 25, 2019

Hello,
we are also trying to use velero for our cluster backups, and faced with similar issue.
We would rather prefer to keep the state in the status field in case it was not defined as a subresource:

  1. We would like to keep some state in the object that is visible to the user as status, but not modifiable by the user
  2. If the CRD object doesn't explicitly define /status as a subresource, then a backup/recover operation should not remove the status. Doing so means that objects that rely on it will fail, while not doing so would mean that objects that don't rely on it would still work correctly.
    Nothing is required of the Velero code other than "don't modify the backup as it was made.

Please let us know if and when you reach a decision.
Thanks!

SAP Webide dev team

@nrb
Copy link
Contributor

nrb commented Aug 26, 2019

Thanks for adding your use case here @i300543.

Originally this code was designed under the assumption that status would be recoverable based on the spec. Can you help me understand why that's not the case with your CRDs?

@i300543
Copy link

i300543 commented Aug 27, 2019

For various reasons, our application was developed in this way, and changing it is not easy at this point. Why does Velero add a restriction to Kubernetes objects which is explicitly not enforced by Kubernetes itself? In Kubernetes, a CRD which doesn't explicitly define a "/status" subresource is able to use the Status section like our application does. Velero's assumption breaks our flow, while removing this assumption will not hurt any use case as far as we can see no ?

@skriss
Copy link
Contributor

skriss commented Sep 3, 2019

@rayanebel @i300543 could you provide a couple specific examples of the types of information that are stored in status that need to be restored?

Also, in terms of UX - would you want to enable/disable restoring status on a per-CRD level (i.e. for all items of a given CRD type), or at some other level of granularity?

@skriss skriss added the Needs info Waiting for information label Sep 3, 2019
@rayanebel
Copy link
Author

rayanebel commented Sep 4, 2019

hello @skriss

As Rancher work with custom resources definitions (xxxxxx.cattle.io) for all its object. I think It will be great to restore status at the CRD level (for all item in CRD with a wildcard for example).

With Rancher for example, when we create a cluster we have an object of kind clusters.management.cattle.io which are added in kubernetes with the following content :

apiVersion: management.cattle.io/v3
kind: Cluster
metadata:
  labels:
    cattle.io/creator: norman
  name: c-cpcnd
  resourceVersion: "11616341"
  selfLink: /apis/management.cattle.io/v3/clusters/c-cpcnd
  uid: e34c0df9-cefe-11e9-8978-067a6cd806b3
spec:
  description: ""
  desiredAgentImage: ""
  desiredAuthImage: ""
  displayName: pp-test-aks2
  dockerRootDir: /var/lib/docker
  enableClusterAlerting: false
  [...]
status:
  agentImage: rancher/rancher-agent:v2.2.8
  allocatable:
    cpu: 5793m
    memory: 14003528Ki
    pods: "330"
  apiEndpoint: https://xxxxxxxxxxxxxxxxxxxxxxxx:443
  appliedEnableNetworkPolicy: false
  appliedPodSecurityPolicyTemplateId: ""
  appliedSpec:
    description: ""
    desiredAgentImage: ""
    desiredAuthImage: ""
    displayName: pp-test-aks2
    dockerRootDir: /var/lib/docker
    enableClusterAlerting: false
    enableClusterMonitoring: false
    enableNetworkPolicy: false
    internal: false
    localClusterAuthEndpoint:
      enabled: false
  authImage: ""
  caCert: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
  capabilities:
    loadBalancerCapabilities:
      enabled: true
      healthCheckSupported: true
      protocolsSupported:
      - TCP
      - UDP
      provider: Azure L4 LB
  capacity:
    cpu: "6"
    memory: 21339464Ki
    pods: "330"
  componentStatuses:
  - conditions:
    - message: 'Get http://127.0.0.1:10252/healthz: dial tcp 127.0.0.1:10252: connect:
        connection refused'
      status: "False"
      type: Healthy
    name: controller-manager
  - conditions:
    - message: '{"health": "true"}'
      status: "True"
      type: Healthy
    name: etcd-0
  - conditions:
    - message: 'Get http://127.0.0.1:10251/healthz: dial tcp 127.0.0.1:10251: connect:
        connection refused'
      status: "False"
      type: Healthy
    name: scheduler
  conditions:
  - status: "True"
    type: Pending
  - lastUpdateTime: "2019-09-04T10:50:03Z"
    status: "True"
    type: Provisioned
  - lastUpdateTime: "2019-09-04T10:50:19Z"
    status: "True"
    type: Waiting
  - lastUpdateTime: "2019-09-04T10:29:34Z"
    status: "True"
    type: BackingNamespaceCreated
  - lastUpdateTime: "2019-09-04T10:29:34Z"
    status: "True"
    type: DefaultProjectCreated
  - lastUpdateTime: "2019-09-04T10:29:34Z"
    status: "True"
    type: SystemProjectCreated
  - lastUpdateTime: "2019-09-04T10:29:34Z"
    status: "True"
    type: InitialRolesPopulated
  - lastUpdateTime: "2019-09-04T10:29:40Z"
    status: "True"
    type: CreatorMadeOwner
  - lastUpdateTime: "2019-09-04T10:31:35Z"
    status: "True"
    type: NoDiskPressure
  - lastUpdateTime: "2019-09-04T10:31:35Z"
    status: "True"
    type: NoMemoryPressure
  - lastUpdateTime: "2019-09-04T10:50:05Z"
    status: "True"
    type: GlobalAdminsSynced
  - lastUpdateTime: "2019-09-04T10:50:05Z"
    status: "False"
    type: AlertingEnabled
  - lastUpdateTime: "2019-09-04T10:50:19Z"
    status: "True"
    type: Ready
  - lastUpdateTime: "2019-09-04T10:50:25Z"
    status: "True"
    type: SystemAccountCreated
  - lastUpdateTime: "2019-09-04T10:50:26Z"
    status: "True"
    type: AgentDeployed
  - lastUpdateTime: "2019-09-04T10:50:26Z"
    status: "False"
    type: PrometheusOperatorDeployed
  - lastUpdateTime: "2019-09-04T10:50:49Z"
    status: "True"
    type: Updated
  - lastUpdateTime: "2019-09-04T10:50:43Z"
    status: "True"
    type: ServiceAccountMigrated
  driver: aks2
  limits:
    cpu: 700m
    memory: 3140Mi
    pods: "0"
  requested:
    cpu: 905m
    memory: 1114Mi
    pods: "17"
  serviceAccountToken: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
  version:
    buildDate: "2019-08-19T11:05:16Z"
    compiler: gc
    gitCommit: 96fac5cd13a5dc064f7d9f4f23030a6aeface6cc
    gitTreeState: clean
    gitVersion: v1.14.6
    goVersion: go1.12.9
    major: "1"
    minor: "14"
    platform: linux/amd64

And for example Rancher need to retrieve data like status.serviceAccountToken or status.caCert or status.apiEndpoint

@skriss skriss added the Enhancement/User End-User Enhancement to Velero label Sep 5, 2019
@prydonius
Copy link
Contributor

Somewhat related to this, there are discussions just starting in SIG Apps to standardise the .status field which probably involves assumptions about not storing state in .status.

@skriss skriss added the Needs Product Blocked needing input or feedback from Product label Sep 13, 2019
@ChrisHaPunkt
Copy link

ChrisHaPunkt commented Oct 22, 2019

Same here. In our migration scenario, Rancher is recreating the System and Default project, because it is not aware of the fact that these were already created und restored by the backup.
So we are ending up with two System and two Default projects.
grafik

@dsu-igeek dsu-igeek removed the Needs info Waiting for information label Oct 22, 2020
@carlisia carlisia added the Kubernetes Resources Pertains to backup/restoration of Kubernetes resources label Oct 22, 2020
@carlisia
Copy link
Contributor

Related: #1272

@nrb nrb changed the title [QUESTION] Let the possibility to keep and restore "status" metadata Allow restoration of the status metadata field Oct 22, 2020
@wcochran53
Copy link

I have a similar issue where the Custom Resource when restored doesn't have the status fields populated, and thus the operator relating to this CR isn't able to reconcile. ( I understand that the operator should not be using the status fields for this purpose but I am in a situation where I have to live with what I have ). Thus I would like the status fields not to be dropped.

I understand this is possible if the Custom Resource Definition does not use the /status subresource, in which case it is a simple RestoreAction plugin to ensure that the Status fields are restored. However because in my case the Custom Resource Definition does use /status subresource I believe this means we can only update the Customer Resource status using the /status endpoint. This becomes impossible using the RestoreAction Plugin because I think the creation of the resource only happens during a velero restore after the restoreAction plugin operation has occurred, and updating teh status using teh /status endpoint can only happen after resource creation.

I had been wondering about whether I could use 2 RestoreAction Plugins, the first one called to recreate the Custom Resource, and the 2nd restore action plugin to update the status via the /status endpoint. However I noticed that the creation of the normal restored resource occurs after the Restore Action plugins have been called and this will fail if a resource of the same name already exists and results in the metadata being reset and status cleared.

So I still would like some mechanism of maintaining status of a Custom Resource where the CRD uses a status subresource, and if this could be done using a RestoreAction Plugin then this would be great.

@gowrisankar22
Copy link

@carlisia @nrb any update on the issue? Only option is to restore the /status is not to enable as Subsource?

@carlisia
Copy link
Contributor

carlisia commented May 4, 2021

We have the intention of making the experience better by at least documenting around this issue. See: #3654 (comment).

Maybe @dsu-igeek has some further insights.

@eleanor-millman eleanor-millman added Reviewed Q2 2021 Icebox We see the value, but it is not slated for the next couple releases. labels May 4, 2021
@nrb
Copy link
Contributor

nrb commented May 4, 2021

I think that @wcochran53's proposed solution with a RestoreItemAction plugin would be how I'd approach this. If someone feels strongly about implementing one that would inspect CRs for a status subresource, then we could definitely review the proposal.

@stale
Copy link

stale bot commented Jul 8, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@RafaeLeal
Copy link
Contributor

Hey, I'm suffering from the same issue.
We have a CICD system built with CRDs, and some of those reference external resources (things like Github's webhooks, deploy keys, and so on). So we have these external resources' ids on status field. Although we could make the controller search Github before creating a new webhook or deploy key, that would mean that when we restore the backup, we would generate a lot of requests on Github, which will probably result in rate limits issues.
So we would prefer to restore the status field and let the controller do that only for new CRDs.

I'd love to help with some initiative to support this

@RafaeLeal
Copy link
Contributor

Well, I was taking a look at the code and reading previous answers.. and it seems that RestoreItemAction does not fit our needs because it runs previous to the resource creation. We could add something like PostRestoreItemAction, but for me, that's too generic for a problem that seems to be inside Velero's domain (correct me if I'm mistaken).
I'd add a spec.restoreStatus field that would work as a resource filter, like this:

kind: Restore
metadata: 
  ...
spec:
  restoreStatus:
    includedResources:
      - webhooks
      - someothercrds
      - deployments # no reason to restrict this to CRDs I guess
    excludedResources: []

The default (restoreStatus being nil) we can assume that would exclude all (keep existing behavior). WDYT?

@reasonerjt
Copy link
Contributor

This seems a valid assumption to me:
#1687 (comment)
I personally wanna refrain from adding too many flexibilities that make users may shoot themselves in the foot in the future.
Could you give us a concrete example of what information you put into the status field of your CR?

@RafaeLeal
Copy link
Contributor

This seems a valid assumption to me: #1687 (comment) I personally wanna refrain from adding too many flexibilities that make users may shoot themselves in the foot in the future. Could you give us a concrete example of what information you put into the status field of your CR?

I think the simplest example I can give is a CRD we have called Webhook, which represents a Github Webhook. On status, we have the Github Webhook ID, so we can know if it's configured correctly (with the correct events, secret and URL).
So we have something like this:

apiVersion: workflows.dev/v1alpha1
kind: Webhook
metadata:
  name: example-webhook
spec:
  baseURL: https://listener.example.url/hook
  events:
  - push
  - pull_request
  repository:
    name: velero
    owner: vmware-tanzu
status:
  externalID: 123456789

If we don't apply the status.externalID, we could duplicate the webhook on Github.
To prevent that, we could on the controller list Github Webhooks and get the webhook that has the same baseURL. That would prevent us to have two webhooks for the same baseURL, but that's not even the biggest issue in this scenario (but it could be in other scenario). The problem is that the controller would do a Github request (list on webhooks) for every instance of this CRD when I restore, and that will reach Github's rate limits.

@reasonerjt
Copy link
Contributor

@RafaeLeal
Thanks for the reply.
We may find a way to retain the externalID across backup and restore without changing the workflow in restore controller.
Could you write a backupitemaction plugin to move the data in status to annotation of the CR so the information will not be lost after it's restored?

@RafaeLeal
Copy link
Contributor

@RafaeLeal Thanks for the reply. We may find a way to retain the externalID across backup and restore without changing the workflow in restore controller. Could you write a backupitemaction plugin to move the data in status to annotation of the CR so the information will not be lost after it's restored?

I think that can solve my issue, but this is a hack and not a general solution... It only works for me because I'm also the developer of the controller, but that's not the case for a lot of other CRDs... I think it makes sense to develop a more general solution, don't you think?
I also think it's a bit weird to have the controller looking for metadata instead of spec
I don't think it would need that many changes on the restore controller...

@hamza3202
Copy link

hamza3202 commented Mar 17, 2022

I am facing the same issue with my CICD system

We are using Argo Workflows which stores information about when the workflow was run, where the logs are etc in the status. After restore, that information is missing.

However, If I manually apply the kubernetes manifests(that velero had backed up) then status field is restored.

@RafaeLeal
Copy link
Contributor

Hey, I just proposed a solution in this PR
#4785

I built, deployed, and tested it on a cluster: it works. We might need to tweak some tests though.

Let me know what you think, and if we should discuss more deeply something.
I'm interested to know if that would work well for you as well @hamza3202

@reasonerjt reasonerjt added kind/requirement Icebox We see the value, but it is not slated for the next couple releases. Reviewed Q2 2021 Kubernetes Resources Pertains to backup/restoration of Kubernetes resources Needs Product Blocked needing input or feedback from Product Enhancement/User End-User Enhancement to Velero and removed Enhancement/User End-User Enhancement to Velero Needs Product Blocked needing input or feedback from Product Kubernetes Resources Pertains to backup/restoration of Kubernetes resources Reviewed Q2 2021 Icebox We see the value, but it is not slated for the next couple releases. labels May 20, 2022
@sunglim
Copy link

sunglim commented Jun 17, 2024

Hi
May I ask a question about the process of restore?
Restoring CR without status[1], and updating status[2] are not a single operation.
That means, if there is an operator observing CR, when velero restores a CR without status[1], an reconciliation in the operator can be started, and then, velero updates the status[2]. Is this a scenario that can happen?

@blackpiglet
Copy link
Contributor

It's possible, but if my memory serves me, the client-go create method doesn't allow to create the k8s resource with status.

@kaovilai
Copy link
Member

@blackpiglet I believe one could follow up with a patch call to add status.

@sseago
Copy link
Collaborator

sseago commented Jun 24, 2024

@kaovilai that's exactly how we do it (see #4785 )

The question was asked above about the possibility of some other controller processing the restored item between the create and patch calls. And yes, this is a possiblity, but we haven't been successful so far in coming up with a way to avoid this possibility since you can't create the resource with status included.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement/User End-User Enhancement to Velero Icebox We see the value, but it is not slated for the next couple releases. kind/requirement Kubernetes Resources Pertains to backup/restoration of Kubernetes resources Needs Product Blocked needing input or feedback from Product Reviewed Q2 2021
Projects
None yet
Development

Successfully merging a pull request may close this issue.