-
Notifications
You must be signed in to change notification settings - Fork 281
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Backport release-1.26] etcd snapshot cleanup fails if node name changes #4536
Comments
Validated on release-1.26 branch with commit e181a2dEnvironment DetailsInfrastructure
Node(s) CPU architecture, OS, and Version:
Cluster Configuration:
Config.yaml: Main ETCD SERVER (+CONTROL PLANE) CONFIG:
Sample Secondary Etcd, control plane config.yaml:
AGENT CONFIG:
Testing Steps
Note: First round node-names:
Reproduce issue Using Version:
4a. Also check the s3 bucket/folder in aws to see the snapshots listed.
7a. Also check the s3 bucket/folder in aws to see the snapshots listed.
Replication Results:
Node names in order of update for the main etcd server:
Final list of snapshots - after multiple node name changes:
As we can see above, previous snapshots with different node-names are still listed and not cleaned up. Validation Results:
After updating node-names 2 times, the snapshots listed are:
As we can see, the previous snapshots with old node-names are no longer retained and get cleaned up. Additional context / logs:
|
This is a backport issue for #3714, automatically created via rancherbot by @vitorsavian
Original issue description:
Environmental Info:
RKE2 Version:
rke2 version v1.21.14+rke2r1 (514ae51)
go version go1.16.14b7
Node(s) CPU architecture, OS, and Version:
Cluster Configuration:
We have multiple rke2 clusters, but all of them have at least 3 control plane nodes and multiple workers
Describe the bug:
We have multiple rke2 clusters and all of them have automatic etcd snapshots enabled (taken every 5 hours). We also configured s3 uploading of those snapshots. Recently, we found that no s3 snapshots are uploaded anymore. We investigated the issue and found the following rke2-server output:
I checked the code and found that rke2 is leveraging the etcd snapshot capabilities from k3s for this. A function is executed periodically on all control plane nodes. The function takes local snapshots, uploads them to s3 (if configured) and also reconciles a configmap which contains all snapshots and metadata about them. Looking at the code it seems that the reconcilation of that "sync" configmap is based on the name of the node which executes the etcd snapshot. Same goes for the s3 retention functions (only old objects which contain the node name will be cleaned up). As we are replacing all our nodes in the clusters whenever there is a new flatcar version, the node names change quite often. This leads to orphaned entries in the config map and also orphaned objects in the s3 buckets (although this could be worked around with a lifecycle policy).
Are there any ideas what could be done to fix this?
I found this bug report which describes the too large configmap in the rancher repo.
Steps To Reproduce:
Enable etcd snapshots and s3 uploading. After replacing the control plane nodes with new machines (new names), there will be orphaned entries in the 'rke2-etcd-snapshots' configmap. Whenever the configmap grew too large, no new snapshots will be uploaded to s3 anymore.
Expected behavior:
The sync configmap only contains the snapshots of the current nodes of the clusters and removes all other ones.
The text was updated successfully, but these errors were encountered: