-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
restore: restore crd after accidental deletion #170
Conversation
df48cdf
to
386f6da
Compare
cmd/commands/restore.go
Outdated
// RestoreCmd represents the restore commands | ||
var RestoreCmd = &cobra.Command{ | ||
Use: "restore", | ||
Short: "Reads the crd and crd name to restore. Ex: restore cephcluster my-cluster", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about if the CR name is optional? If they only provide the CRD type name, then we can automatically restore all instances that have been marked for deletion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before making changes, I think thinking is this a valid condition to support since for deletion of a resource we need to pass the name of the resource.
Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To clarify my suggestion, there are two cases:
- If they pass the (optional) resource name, then check if that resource is marked for deletion, and then restore it.
- If they don't pass the resource name, query for all resources of that type (the name would also be queried in this step), then restore the instances that are marked for deletion.
For CephCluster, there is most commonly only a single resource anyway, but for pools or other resource types perhaps there could be multiple accidentally marked for deletion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To clarify my suggestion, there are two cases:
- If they pass the (optional) resource name, then check if that resource is marked for deletion, and then restore it.
- If they don't pass the resource name, query for all resources of that type (the name would also be queried in this step), then restore the instances that are marked for deletion.
I'm thinking of option 2, how to get the specific rook resource like cephcluster, cephfilesystem... and use rook API to fetch them with the right call.
For CephCluster, there is most commonly only a single resource anyway, but for pools or other resource types perhaps there could be multiple accidentally marked for deletion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking it may be better not to use the Rook API since then the CRD settings would be tied to the version of the API that this plugin is referencing. For this scenario, we need to make a copy of the CR, then delete it, then create it again. So we need to make sure all settings are preserved, independent of the Rook version. So if we can use more generic queries for the CRs it will be better.
Also, I need help getting the right error message rather than just the error code from the below code kubectl-rook-ceph/pkg/exec/bash.go Lines 25 to 38 in e9dffaf
for example: when running a kubectl command to create a resource that already exits, the above code throughs I tried a few things but failed. any suggestion? Note, I can use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I need help getting the right error message rather than just the error code from the below code
kubectl-rook-ceph/pkg/exec/bash.go
Lines 25 to 38 in e9dffaf
func ExecuteBashCommand(command string) string { cmd := exec.Command("/bin/bash", "-x", // Print commands and their arguments as they are executed "-e", // Exit immediately if a command exits with a non-zero status. "-m", // Terminal job control, allows job to be terminated by SIGTERM "-c", // Command to run command, ) stdout, err := cmd.Output() if err != nil { logging.Fatal(err) } return string(stdout) } for example: when running a kubectl command to create a resource that already exits, the above code throughs
exit code 1
and then exits but what I really want is to print the Kubernetes error thatAlready exits ......
and then exit.I tried a few things but failed. any suggestion? @travisn @BlaineEXE
Note, I can use
os.stdout, os.stderr
but want to return the output from the method and not just log in to the terminal.
Do you see the "Already exists" error code in the stderr? Do we just need to look for that string since no error code is returned?
cmd/commands/restore.go
Outdated
// RestoreCmd represents the restore commands | ||
var RestoreCmd = &cobra.Command{ | ||
Use: "restore", | ||
Short: "Reads the crd and crd name to restore. Ex: restore cephcluster my-cluster", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking it may be better not to use the Rook API since then the CRD settings would be tied to the version of the API that this plugin is referencing. For this scenario, we need to make a copy of the CR, then delete it, then create it again. So we need to make sure all settings are preserved, independent of the Rook version. So if we can use more generic queries for the CRs it will be better.
386f6da
to
b2f94b5
Compare
Testing result Test Passing argument cr and crName ./bin/kubectl-rook-ceph restore cephblockpool replicapool
Warning: rook version 'rook: v1.12.0-alpha.0.227.g3e248bf9a' is running a pre-release version of Rook.
Info: Restoring crd cephblockpool name replicapool
Info: Scaling down the operator to 0
Info: Backing up kubernetes and crd resources
Info: Backed up crd cephblockpool named replicapool in file cephblockpool.yaml
Info: Backed up secret in file secret.yaml
Info: Backed up configmap in file configmap.yaml
Info: deleting validating webhook rook-ceph-webhook if present
Info: Fetching cephblockpool replicapool uid
Info: Successfully fetched uid 0f8b59df-816d-4cc8-b888-ec128b14c308 from cephblockpool replicapool
Info: removing ownerreferences from resources with matching uid 0f8b59df-816d-4cc8-b888-ec128b14c308
Info: removing finalizers from cephblockpool replicapool
Info: cephblockpool.ceph.rook.io/replicapool patched
Info: re-creating the cr cephblockpool from file replicapool.yaml created above
Info: cephblockpool.ceph.rook.io/replicapool created
Info: Scaling up the operator to 1
Info: CR should be successfully restored. Please watch the operator logs and check the crd
~/go/src/github.com/kubectl-rook-ceph
~/go/src/github.com/kubectl-rook-ceph
srai@fedora ~ (restore-crd) $ kc get cephblockpool
NAME PHASE
replicapool Progressing
~/go/src/github.com/kubectl-rook-ceph
srai@fedora ~ (restore-crd) $ kc get cephblockpool
NAME PHASE
replicapool Ready Test passing only cephcluster./bin/kubectl-rook-ceph restore cephcluster
Warning: rook version 'rook: v1.12.0-alpha.0.227.g3e248bf9a' is running a pre-release version of Rook.
Info: Restoring crd cephcluster name my-cluster
Info: Scaling down the operator to 0
Info: Backing up kubernetes and crd resources
Info: Backed up crd cephcluster named my-cluster in file cephcluster.yaml
Info: Backed up secret in file secret.yaml
Info: Backed up configmap in file configmap.yaml
Info: deleting validating webhook rook-ceph-webhook if present
Info: Fetching cephcluster my-cluster uid
Info: Successfully fetched uid a20a0b18-52d5-4844-a542-782e2d224f98 from cephcluster my-cluster
Info: removing ownerreferences from resources with matching uid a20a0b18-52d5-4844-a542-782e2d224f98
Info: removing owner references for service rook-ceph-exporter
Info: Removed ownerReference for service: rook-ceph-exporter
Info: removing finalizers from cephcluster my-cluster
Info: cephcluster.ceph.rook.io/my-cluster patched
Info: re-creating the cr cephcluster from file my-cluster.yaml created above
Info: cephcluster.ceph.rook.io/my-cluster created
Info: Scaling up the operator to 1
Info: CR should be successfully restored. Please watch the operator logs and check the crd
~/go/src/github.com/kubectl-rook-ceph
srai@fedora ~ (restore-crd) $ kc get cephcluster
NAME DATADIRHOSTPATH MONCOUNT AGE PHASE MESSAGE HEALTH EXTERNAL FSID
my-cluster /var/lib/rook 1 7m44s Ready Cluster created successfully HEALTH_OK 5dc8aca4-cb12-47ee-a83e-a9f773b39a0a |
yes, In the error stderr I saw the right error string but after checking the error from cmd. output it has error code. |
@travisn I was thinking of adding CI test for this as another pr or do you think I should include it in this pr only? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@travisn I was thinking of adding CI test for this as another pr or do you think I should include it in this pr only?
Let's have the CI test with this PR. It's critical that we test this feature fully before anyone tries to use it.
@travisn @Madhu-1 Let's discuss the error in this thread. So, currently, the flow is if we get any error after scaling down to the rook operator pod, and before scaling up the rook operator pod, I'm scaling back the rook operator first and then returning the error. What I'm gathering from the comments now are,
Let me know your thoughts. |
IMHO scaling back the operator makes sense only if it is scaled down by the tool not by the user, the user might have scaled it down for some other reasons. |
Yes, for this command tool scale down the operator |
180be3d
to
63b6e66
Compare
pkg/restore/crd.go
Outdated
) | ||
|
||
var getCrName = ` | ||
kubectl -n %s get %s -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.metadata.deletionGracePeriodSeconds}{"\n"}{end}' | awk '$2=="0" {print $1}' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
currently, the tool is doesn't have the capability of finding if it kubectl or oc. But I think it will be better if in the future if the tool is able to find the differences.
If we can't autodetect it yet, we need a CLI arg that lets them override it to oc
.
df3a05d
to
04a6e9f
Compare
depends #180 |
ad52fa4
to
7e62bce
Compare
7e62bce
to
0f10b5b
Compare
eb37fb2
to
4cd49d0
Compare
4cd49d0
to
684e910
Compare
|
||
// RestoreCmd represents the restore commands | ||
var RestoreCmd = &cobra.Command{ | ||
Use: "restore-deleted", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use: "restore-deleted", | |
Use: "restore-deleted", |
any suggestion on the name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
undelete
is another option. restore-deleted
still sounds better to me, but we could take a vote. Other thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm okay with restore-deleted
|
||
// RestoreCmd represents the restore commands | ||
var RestoreCmd = &cobra.Command{ | ||
Use: "restore-deleted", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
undelete
is another option. restore-deleted
still sounds better to me, but we could take a vote. Other thoughts?
30e6fa1
to
fb95687
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few small suggestions
docs/crd.md
Outdated
While the underlying Ceph data and daemons continue to be available, the CRs will be stuck indefinitely in a Deleting state in which the operator will not continue to ensure cluster health. Upgrades will be blocked, further updates to the CRs are prevented, and so on. Since Kubernetes does not allow undeleting resources, the command below will allow repairing the CRs without even necessarily suffering cluster downtime. | ||
|
||
> [!NOTE] | ||
> If there are multiple deleted resources in the cluster and no specific resource is mentioned, in that case the first resource will be restored and in order to restore all resources we need to re-run the command. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
> If there are multiple deleted resources in the cluster and no specific resource is mentioned, in that case the first resource will be restored and in order to restore all resources we need to re-run the command. | |
> If there are multiple deleted resources in the cluster and no specific resource is mentioned, the first resource will be restored. To restore all deleted resources, re-run the command multiple times. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, could we just loop through all the resources?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we could loop through all the resources but IMO it's better to restore one resource at a time. It's better for monitoring and debugging if something goes wrong. Also, users may get overwhelmed with all the log messages.
Also, we later decide to restore all the resources at once. In that case, it will be better if we do so after Javier's PR since he is adding the code to list resources from dynamic API, it will help in looping over the resource. And currently we are getting the resource from the command and then reading it as a string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok we can do this for now, it's really a corner case anyway.
with these changes we'll be able to restore any crd which stuck deletion due some dependencies after after accidental deletion. example: kubectl rook-ceph -n <ns> restore <cr> <cr_name> Signed-off-by: subhamkrai <[email protected]>
fb95687
to
110cbe2
Compare
with these changes we'll be able to restore any
crd which stuck deletion due some dependencies
after after accidental deletion.
example:
kubectl rook-ceph -n restore <cr_name>
fixes: #68