-
Notifications
You must be signed in to change notification settings - Fork 169
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add new cosa cloud-prune command for garbage collection #3798
Conversation
2a5764d
to
11779a5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
initial pass - i need to go into more depth.
src/remote_prune
Outdated
if "cloud-uploads" in actions.keys(): | ||
cloud_uploads_duration = actions["cloud-uploads"] | ||
build_duration = actions["build"] | ||
# assumption we are keeping the duration in years |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably easier if we don't make assumptions.. i.e. maybe we just use a unit of days and even though it's a little harder to work with from a human perspective, it's easier not to make a mistake on the code side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to keep one unit i.e. could be just days or months or years? OR do we want to handle every possible inputs such as 90d, 4m and 2y?
src/remote_prune
Outdated
timestamp = build_id.split('.')[1] | ||
buildDate = datetime.datetime(int(timestamp[0:4]), int(timestamp[4:6]), int(timestamp[-2:])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would be nice here if we can support both FCOS version numbers and RHCOS version numbers here. Maybe let's think about that now, while we are developing this rather than having to update things later.
src/remote_prune
Outdated
|
||
images = { | ||
"amis": metaData.get("amis") or [], | ||
"azure": metaData.get("azure") or [], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for FCOS at least we don't have any cloud resources for Azure that we need to cleanup (yet).
For RHCOS I think we do have some object storage uploads we do.
src/remote_prune
Outdated
if len(errors) != 0: | ||
print(f"Found errors when removing build {build.id}:") | ||
for e in errors: | ||
print(e) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we actually raise an exception here instead of just being informational?
src/remote_prune
Outdated
|
||
from cosalib.aws import deregister_ami, delete_snapshot | ||
|
||
Build = collections.namedtuple("Build", ["id", "timestamp", "images", "arches"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note that we do have a Build() class in cosalib as well. Not sure if that would be useful here to use that instead or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do indeed have that but they are changed according to those files. It seemed better to create a new variable since we are using cloud_images
here.
src/remote_prune
Outdated
with open(f"builds/builds.json", "w") as json_file: | ||
json_file.write(json.dumps(buildJsonData)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just pointing out we will need some special handling of builds.json when we upload it back to s3
11779a5
to
30a2b9c
Compare
32c8543
to
0c44c3c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a full review yet. Just a first pass.
60ddc54
to
00628d2
Compare
We have been looking at this today with Adam and have the following thought : since all the clouds artifacts are stored in a S3 bucket, except the container images : should we split out the |
00628d2
to
8a59d55
Compare
It would make sense to keep the overall pruning of builds/resources to be kept within this file. For the exact logic for deletion of resources, it's definitely okay to make a separate script; similar to what we have for AWS and GCP. |
8a59d55
to
be2cb86
Compare
src/cmd-cloud-prune
Outdated
snapshot_id = ami.get("snapshot") | ||
if ami_id and region_name: | ||
try: | ||
deregister_ami(ami_id, region=region_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is currently using the Python bindings. My inclination is to move away from that and just have it call ore
instead, exactly like how remove_gcp_image()
does it currently. For example, the golang code for deleting AMIs already exists and correctly handles the "is already deleted" case mentioned in my previous comment (see e.g. deregisterImageIfExists()
).
That should also allow us to directly pass the config file to ore
and avoid the env var hack.
src/cmd-cloud-prune
Outdated
gcp_image = gcp.get("image") | ||
json_key = gcp_cloud_config.get("gcp", {}).get("json-key") | ||
project = gcp_cloud_config.get("gcp", {}).get("project") | ||
if gcp_image and json_key and project: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like we should error out here if the information is incomplete.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean in the case of either of the parameters (gcp_image
and json_key
and project
) not being present.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. We can't add the cloud-uploads
marker to a build if we couldn't actually remove a cloud image just because we were missing information about how to remove it.
src/cmd-cloud-prune
Outdated
for e in errors: | ||
raise Exception(e) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's not going to work as expected. On the first iteration, it will raise an exception which will cause the program to exit. I think here instead we would print all of the errors and then raise
once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see just printing the errors during the process as below. But how do you want to raise
once? Would that be once we have gone through all the builds?
if len(errors) != 0:
print(f"Found errors when removing cloud-uploads for {build.id}:")
for e in errors:
print(e)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can add a raise Exception("some errors were encountered")
right after the for-loop.
src/cmd-cloud-prune
Outdated
retry=retry_boto_exception, | ||
retry_error_callback=retry_callback, | ||
) | ||
def s3_copy(s3_client, src, bucket, key, max_age, acl, extra_args={}, dry_run=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's move this to some library where we can share it with cmd-buildupload
.
Some thoughts while I was working on this :
I am confused by the
|
be2cb86
to
0ebe9d2
Compare
Thanks for looking at this! I think from our discussions today, we landed on just keeping the container image pruning separate for now because (1) it seems more natural there to make the source of truth be the registry repo, (2) the GC policy is much more aggressive than the policy we have in mind here, so this would require adding another key to the We could always look at re-merging them in the future if we feel it's worth it. I'll note this is similar also to the ostree repo pruner which treats the OSTree repo as canonical and is currently also handled by separate code (https://github.com/coreos/fedora-coreos-releng-automation/tree/main/fedora-ostree-pruner). I could similarly imagine having this pruning code be generalized to handle multiple repos under the Fedora umbrella (this relates to e.g. rpm-software-management/dnf5#833 (comment), but also we'll likely want tagged releases for the base bootc images too which will then also need GC). |
08e7d98
to
e4a2ecd
Compare
e4a2ecd
to
3823521
Compare
8f438fd
to
98d2753
Compare
mantle/cmd/ore/aws/delete-image.go
Outdated
|
||
var ( | ||
cmdDeleteImage = &cobra.Command{ | ||
Use: "delete-image <name>...", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're passing the image ID via a flag though, so the synopsis here is incorrect.
mantle/cmd/ore/aws/delete-image.go
Outdated
|
||
func runDeleteImage(cmd *cobra.Command, args []string) { | ||
exit := 0 | ||
err := API.RemoveByImageID(resourceID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whether this function errors out on the AMI not existing should probably be a flag like --allow-missing
which we would pass as a bool here.
mantle/platform/api/aws/images.go
Outdated
@@ -733,6 +733,29 @@ func (a *API) FindImage(name string) (string, error) { | |||
return "", nil | |||
} | |||
|
|||
// Deregisters the ami or snapshot. | |||
func (a *API) RemoveByImageID(imageID string) error { | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: extraneous empty line
mantle/platform/api/aws/images.go
Outdated
// Deregisters the ami or snapshot. | ||
func (a *API) RemoveByImageID(imageID string) error { | ||
|
||
if imageID != "" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why handle this case? It's a caller error to pass an empty ID.
src/cmd-cloud-prune
Outdated
try: | ||
deregister_aws_resource(ami_id, region=region_name, credentials_file=aws_credentials) | ||
except Exception as e: | ||
errors.append(e) | ||
if snapshot_id and region_name: | ||
try: | ||
deregister_aws_resource(snapshot_id, region=region_name, credentials_file=aws_credentials) | ||
except Exception as e: | ||
errors.append(e) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This construction seems odd. Isn't this calling ore twice, once for the AMI and once for the snapshot? I don't understand how this can work since the AWS API to delete those is different.
But anyway, I think it'd be cleaner to call ore just once and pass both the AMI ID and snapshot ID.
src/cmd-cloud-prune
Outdated
else: | ||
errors.append(f"Missing paramaters to remove {gcp_image}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need the same for AWS above.
src/cmd-cloud-prune
Outdated
|
||
|
||
def get_period_in_months(duration): | ||
val, unit = duration.split(" ") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should set maxsplit to 1.
src/cmd-cloud-prune
Outdated
with open("builds/builds.json", "w") as json_file: | ||
json.dump(builds_json_data, json_file, indent=2) | ||
# This copies the local builds.json and updates the S3 bucket version. | ||
if args.upload_builds_json: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not quite. This needs to be a different mode entirely. So basically:
cosa cloud-prune --policy ...
prunes resources, updatesbuilds.json
locally.cosa cloud-prune --upload-builds-json
does not do any pruning. It downloads the latestbuilds.json
, merges the modifications from the localbuilds.json
into the latest, and uploads to S3.
src/cosalib/gcp.py
Outdated
@@ -26,6 +26,7 @@ def remove_gcp_image(gcp_id, json_key, project): | |||
'--json-key', json_key, | |||
'--project', project | |||
]) | |||
print(f"GCP: successfully removed image {gcp_id}") | |||
except SystemExit: | |||
raise Exception("Failed to remove image") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess for consistency with the AWS one let's add the gcp_id
to this exception.
This script calls skopeo delete to prune image from a remote directory. Currently only supports the FCOS tag structure. This consumes the same policy.yaml defined in coreos#3798 See coreos/fedora-coreos-tracker#1367 See coreos/fedora-coreos-pipeline#995
This script calls skopeo delete to prune image from a remote directory. Currently only supports the FCOS tag structure. This consumes the same policy.yaml defined in coreos#3798 See coreos/fedora-coreos-tracker#1367 See coreos/fedora-coreos-pipeline#995
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One comment, LGTM otherwise.
Let's split up the commit also.
update_policy_cleanup(current_builds_json, remote_builds_json) | ||
if not dry_run: | ||
# Make sure we have the merged json as local builds/builds.json | ||
save_builds_json(remote_builds_json) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to bump the timestamp here too.
Actually, probably simpler to just fold the timestamp bumping into save_builds_json
directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Superficial LGTM
This script calls skopeo delete to prune image from a remote directory. Currently only supports the FCOS tag structure. This consumes the same policy.yaml defined in coreos#3798 See coreos/fedora-coreos-tracker#1367 See coreos/fedora-coreos-pipeline#995
This enables us to deregister aws AMIs and/or snapshots. It also handles the cases of resource IDs unavailable and notFound by not returning an error.
When the allowMissing flag is passed to the ore gcloud delete-images command, it allows the process to continue without erroring out when a 404 error is returned, indicating that the resource was not found.
0c1b891
to
d534b7e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
Commit "Factor out the S3 functions to cosalib" needs to be before the introduction of
cmd-cloud-prune
which uses it. -
Let's add the name of the new command in the commit title. E.g.:
Add new
cosa cloud-prune
command for garbage collection
src/cosalib/prune.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK with this, but as mentioned earlier, I think get_unreferenced_s3_builds()
could still be useful. At least, I don't want to lose the overall idea behind it. I think basically cosa cloud-prune
in the future should do this. Let's for now add a comment in cmd-cloud-prune
like:
# XXX: We should also prune unreferenced build directories here. See also
# `get_unreferenced_s3_builds()` in the git log.
Since we use s3_check_exists and s3_copy at other places we factored out those functions to cosalib
caea473
to
d26e760
Compare
Add the ability to run garbage collection on resources using cosa cloud-prune. This script would take policy.yaml and run the garbage collection accordingly for the stream specified.
With the addition of cmd-cloud-prune, this command is mostly deprecated and hence deleting it.
d26e760
to
aec49c7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Nice work!
This script calls skopeo delete to prune image from a remote directory. Currently only supports the FCOS tag structure. This consumes the same policy.yaml defined in coreos#3798 See coreos/fedora-coreos-tracker#1367 See coreos/fedora-coreos-pipeline#995
This script calls skopeo delete to prune image from a remote directory. Currently only supports the FCOS tag structure. This consumes the same policy.yaml defined in coreos#3798 See coreos/fedora-coreos-tracker#1367 See coreos/fedora-coreos-pipeline#995
This script calls skopeo delete to prune image from a remote directory. Currently only supports the FCOS tag structure. This consumes the same policy.yaml defined in coreos#3798 See coreos/fedora-coreos-tracker#1367 See coreos/fedora-coreos-pipeline#995
This script calls skopeo delete to prune image from a remote directory. Currently only supports the FCOS tag structure. This consumes the same policy.yaml defined in #3798 See coreos/fedora-coreos-tracker#1367 See coreos/fedora-coreos-pipeline#995
Add the ability to run garbage collection on resources using
cosa cloud-prune. This script would take policy.yaml and run
the garbage collection accordingly for the stream specified.
Test it with:
cosa cloud-prune --stream next-devel --gcp-json-key gcp-account-demo.json --gcp-project fedora-coreos-*** --aws-config-file ./creds fcos-developer-pipelines/gursewak/next-devel