Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new cosa cloud-prune command for garbage collection #3798

Merged
merged 5 commits into from
Jul 12, 2024

Conversation

gursewak1997
Copy link
Member

@gursewak1997 gursewak1997 commented May 9, 2024

Add the ability to run garbage collection on resources using
cosa cloud-prune. This script would take policy.yaml and run
the garbage collection accordingly for the stream specified.


Test it with:
cosa cloud-prune --stream next-devel --gcp-json-key gcp-account-demo.json --gcp-project fedora-coreos-*** --aws-config-file ./creds fcos-developer-pipelines/gursewak/next-devel

Copy link
Member

@dustymabe dustymabe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

initial pass - i need to go into more depth.

src/remote_prune Outdated
if "cloud-uploads" in actions.keys():
cloud_uploads_duration = actions["cloud-uploads"]
build_duration = actions["build"]
# assumption we are keeping the duration in years
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably easier if we don't make assumptions.. i.e. maybe we just use a unit of days and even though it's a little harder to work with from a human perspective, it's easier not to make a mistake on the code side.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to keep one unit i.e. could be just days or months or years? OR do we want to handle every possible inputs such as 90d, 4m and 2y?

src/remote_prune Outdated Show resolved Hide resolved
src/remote_prune Outdated
Comment on lines 84 to 107
timestamp = build_id.split('.')[1]
buildDate = datetime.datetime(int(timestamp[0:4]), int(timestamp[4:6]), int(timestamp[-2:]))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be nice here if we can support both FCOS version numbers and RHCOS version numbers here. Maybe let's think about that now, while we are developing this rather than having to update things later.

src/remote_prune Outdated

images = {
"amis": metaData.get("amis") or [],
"azure": metaData.get("azure") or [],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for FCOS at least we don't have any cloud resources for Azure that we need to cleanup (yet).

For RHCOS I think we do have some object storage uploads we do.

src/remote_prune Outdated
Comment on lines 181 to 184
if len(errors) != 0:
print(f"Found errors when removing build {build.id}:")
for e in errors:
print(e)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we actually raise an exception here instead of just being informational?

src/remote_prune Outdated Show resolved Hide resolved
src/remote_prune Outdated

from cosalib.aws import deregister_ami, delete_snapshot

Build = collections.namedtuple("Build", ["id", "timestamp", "images", "arches"])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note that we do have a Build() class in cosalib as well. Not sure if that would be useful here to use that instead or not.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do indeed have that but they are changed according to those files. It seemed better to create a new variable since we are using cloud_images here.

src/remote_prune Outdated
Comment on lines 115 to 135
with open(f"builds/builds.json", "w") as json_file:
json_file.write(json.dumps(buildJsonData))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just pointing out we will need some special handling of builds.json when we upload it back to s3

src/remote_prune Outdated Show resolved Hide resolved
Copy link
Member

@jlebon jlebon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a full review yet. Just a first pass.

src/remote_prune Outdated Show resolved Hide resolved
src/remote_prune Outdated Show resolved Hide resolved
src/remote_prune Outdated Show resolved Hide resolved
src/remote_prune Outdated Show resolved Hide resolved
src/remote_prune Outdated Show resolved Hide resolved
src/remote_prune Outdated Show resolved Hide resolved
src/remote_prune Outdated Show resolved Hide resolved
src/remote_prune Outdated Show resolved Hide resolved
src/remote_prune Outdated Show resolved Hide resolved
src/remote_prune Outdated Show resolved Hide resolved
@gursewak1997 gursewak1997 force-pushed the cloudUploadsGC branch 3 times, most recently from 60ddc54 to 00628d2 Compare June 20, 2024 07:34
@jbtrystram
Copy link
Contributor

jbtrystram commented Jun 20, 2024

We have been looking at this today with Adam and have the following thought : since all the clouds artifacts are stored in a S3 bucket, except the container images : should we split out the builds.json parsing to a shared library and have a separate script that uses that and do the quay API calls rather than writing non-s3 logic in there ?

@gursewak1997
Copy link
Member Author

should we split out the builds.json parsing to a shared library and have a separate script that uses that and do the quay API calls rather than writing non-s3 logic in there ?

It would make sense to keep the overall pruning of builds/resources to be kept within this file. For the exact logic for deletion of resources, it's definitely okay to make a separate script; similar to what we have for AWS and GCP.
The S3 logic is just there to make sure the we get the information for builds.json and update the file accordingly once the pruning is done.

src/cmd-cloud-prune Outdated Show resolved Hide resolved
src/cmd-cloud-prune Outdated Show resolved Hide resolved
src/cmd-cloud-prune Outdated Show resolved Hide resolved
src/cmd-cloud-prune Show resolved Hide resolved
src/cmd-cloud-prune Outdated Show resolved Hide resolved
src/cmd-cloud-prune Outdated Show resolved Hide resolved
snapshot_id = ami.get("snapshot")
if ami_id and region_name:
try:
deregister_ami(ami_id, region=region_name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is currently using the Python bindings. My inclination is to move away from that and just have it call ore instead, exactly like how remove_gcp_image() does it currently. For example, the golang code for deleting AMIs already exists and correctly handles the "is already deleted" case mentioned in my previous comment (see e.g. deregisterImageIfExists()).

That should also allow us to directly pass the config file to ore and avoid the env var hack.

gcp_image = gcp.get("image")
json_key = gcp_cloud_config.get("gcp", {}).get("json-key")
project = gcp_cloud_config.get("gcp", {}).get("project")
if gcp_image and json_key and project:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like we should error out here if the information is incomplete.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean in the case of either of the parameters (gcp_image and json_key and project) not being present.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. We can't add the cloud-uploads marker to a build if we couldn't actually remove a cloud image just because we were missing information about how to remove it.

Comment on lines 242 to 243
for e in errors:
raise Exception(e)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not going to work as expected. On the first iteration, it will raise an exception which will cause the program to exit. I think here instead we would print all of the errors and then raise once.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see just printing the errors during the process as below. But how do you want to raise once? Would that be once we have gone through all the builds?

if len(errors) != 0:
    print(f"Found errors when removing cloud-uploads for {build.id}:")
    for e in errors:
        print(e)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can add a raise Exception("some errors were encountered") right after the for-loop.

retry=retry_boto_exception,
retry_error_callback=retry_callback,
)
def s3_copy(s3_client, src, bucket, key, max_age, acl, extra_args={}, dry_run=False):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move this to some library where we can share it with cmd-buildupload.

@jbtrystram
Copy link
Contributor

Some thoughts while I was working on this :

I am confused by the policy.yaml design:

  • I assume image-keep will skip the pruning for those artifacts. But how long ? Indefinitely ?
  • what does build refer to ? all the artifiacts for a build ?
  • Should we have containers added as a key under images ? This way we can iterate on it like any other resources ? That would prevent having different GC policy for those. I see there is an unused oscontainer key. Would that work ? Is it only used for RHCOS ?

@jlebon
Copy link
Member

jlebon commented Jun 26, 2024

Some thoughts while I was working on this :

Thanks for looking at this! I think from our discussions today, we landed on just keeping the container image pruning separate for now because (1) it seems more natural there to make the source of truth be the registry repo, (2) the GC policy is much more aggressive than the policy we have in mind here, so this would require adding another key to the builds.json's policy-cleanup which would in principle need to be added to all builds we've created so far, and (3) there's extra logic needed there for barrier releases which makes it not as good a fit for the model here.

We could always look at re-merging them in the future if we feel it's worth it. I'll note this is similar also to the ostree repo pruner which treats the OSTree repo as canonical and is currently also handled by separate code (https://github.com/coreos/fedora-coreos-releng-automation/tree/main/fedora-ostree-pruner). I could similarly imagine having this pruning code be generalized to handle multiple repos under the Fedora umbrella (this relates to e.g. rpm-software-management/dnf5#833 (comment), but also we'll likely want tagged releases for the base bootc images too which will then also need GC).


var (
cmdDeleteImage = &cobra.Command{
Use: "delete-image <name>...",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're passing the image ID via a flag though, so the synopsis here is incorrect.


func runDeleteImage(cmd *cobra.Command, args []string) {
exit := 0
err := API.RemoveByImageID(resourceID)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whether this function errors out on the AMI not existing should probably be a flag like --allow-missing which we would pass as a bool here.

mantle/cmd/ore/gcloud/delete-images.go Show resolved Hide resolved
@@ -733,6 +733,29 @@ func (a *API) FindImage(name string) (string, error) {
return "", nil
}

// Deregisters the ami or snapshot.
func (a *API) RemoveByImageID(imageID string) error {

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: extraneous empty line

// Deregisters the ami or snapshot.
func (a *API) RemoveByImageID(imageID string) error {

if imageID != "" {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why handle this case? It's a caller error to pass an empty ID.

Comment on lines 217 to 262
try:
deregister_aws_resource(ami_id, region=region_name, credentials_file=aws_credentials)
except Exception as e:
errors.append(e)
if snapshot_id and region_name:
try:
deregister_aws_resource(snapshot_id, region=region_name, credentials_file=aws_credentials)
except Exception as e:
errors.append(e)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This construction seems odd. Isn't this calling ore twice, once for the AMI and once for the snapshot? I don't understand how this can work since the AWS API to delete those is different.

But anyway, I think it'd be cleaner to call ore just once and pass both the AMI ID and snapshot ID.

Comment on lines 239 to 240
else:
errors.append(f"Missing paramaters to remove {gcp_image}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need the same for AWS above.



def get_period_in_months(duration):
val, unit = duration.split(" ")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should set maxsplit to 1.

with open("builds/builds.json", "w") as json_file:
json.dump(builds_json_data, json_file, indent=2)
# This copies the local builds.json and updates the S3 bucket version.
if args.upload_builds_json:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite. This needs to be a different mode entirely. So basically:

  • cosa cloud-prune --policy ... prunes resources, updates builds.json locally.
  • cosa cloud-prune --upload-builds-json does not do any pruning. It downloads the latest builds.json, merges the modifications from the local builds.json into the latest, and uploads to S3.

@@ -26,6 +26,7 @@ def remove_gcp_image(gcp_id, json_key, project):
'--json-key', json_key,
'--project', project
])
print(f"GCP: successfully removed image {gcp_id}")
except SystemExit:
raise Exception("Failed to remove image")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess for consistency with the AWS one let's add the gcp_id to this exception.

jbtrystram added a commit to jbtrystram/coreos-assembler that referenced this pull request Jul 12, 2024
This script calls skopeo delete to prune image from a remote
directory. Currently only supports the FCOS tag structure.

This consumes the same policy.yaml defined in
coreos#3798

See coreos/fedora-coreos-tracker#1367
See coreos/fedora-coreos-pipeline#995
jbtrystram added a commit to jbtrystram/coreos-assembler that referenced this pull request Jul 12, 2024
This script calls skopeo delete to prune image from a remote
directory. Currently only supports the FCOS tag structure.

This consumes the same policy.yaml defined in
coreos#3798

See coreos/fedora-coreos-tracker#1367
See coreos/fedora-coreos-pipeline#995
Copy link
Member

@jlebon jlebon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One comment, LGTM otherwise.
Let's split up the commit also.

update_policy_cleanup(current_builds_json, remote_builds_json)
if not dry_run:
# Make sure we have the merged json as local builds/builds.json
save_builds_json(remote_builds_json)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to bump the timestamp here too.

Actually, probably simpler to just fold the timestamp bumping into save_builds_json directly.

jbtrystram
jbtrystram previously approved these changes Jul 12, 2024
Copy link
Contributor

@jbtrystram jbtrystram left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Superficial LGTM

jbtrystram added a commit to jbtrystram/coreos-assembler that referenced this pull request Jul 12, 2024
This script calls skopeo delete to prune image from a remote
directory. Currently only supports the FCOS tag structure.

This consumes the same policy.yaml defined in
coreos#3798

See coreos/fedora-coreos-tracker#1367
See coreos/fedora-coreos-pipeline#995
This enables us to deregister aws AMIs and/or snapshots.
It also handles the cases of resource IDs unavailable and
notFound by not returning an error.
When the allowMissing flag is passed to the ore gcloud delete-images command,
it allows the process to continue without erroring out when a 404 error is
returned, indicating that the resource was not found.
Copy link
Member

@jlebon jlebon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Commit "Factor out the S3 functions to cosalib" needs to be before the introduction of cmd-cloud-prune which uses it.

  • Let's add the name of the new command in the commit title. E.g.:

    Add new cosa cloud-prune command for garbage collection

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK with this, but as mentioned earlier, I think get_unreferenced_s3_builds() could still be useful. At least, I don't want to lose the overall idea behind it. I think basically cosa cloud-prune in the future should do this. Let's for now add a comment in cmd-cloud-prune like:

# XXX: We should also prune unreferenced build directories here. See also
# `get_unreferenced_s3_builds()` in the git log.

Since we use s3_check_exists and s3_copy at other places
we factored out those functions to cosalib
@gursewak1997 gursewak1997 force-pushed the cloudUploadsGC branch 3 times, most recently from caea473 to d26e760 Compare July 12, 2024 20:20
Add the ability to run garbage collection on resources using
cosa cloud-prune. This script would take policy.yaml and run
the garbage collection accordingly for the stream specified.
With the addition of cmd-cloud-prune, this command is
mostly deprecated and hence deleting it.
Copy link
Member

@jlebon jlebon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Nice work!

@jlebon jlebon enabled auto-merge (rebase) July 12, 2024 21:07
@jlebon jlebon changed the title Adding garbage removal for cloud artifacts Add new cosa cloud-prune command for garbage collection Jul 12, 2024
@jlebon jlebon merged commit 8ade3ec into coreos:main Jul 12, 2024
5 checks passed
@gursewak1997 gursewak1997 deleted the cloudUploadsGC branch July 12, 2024 23:25
jbtrystram added a commit to jbtrystram/coreos-assembler that referenced this pull request Jul 16, 2024
This script calls skopeo delete to prune image from a remote
directory. Currently only supports the FCOS tag structure.

This consumes the same policy.yaml defined in
coreos#3798

See coreos/fedora-coreos-tracker#1367
See coreos/fedora-coreos-pipeline#995
jbtrystram added a commit to jbtrystram/coreos-assembler that referenced this pull request Jul 16, 2024
This script calls skopeo delete to prune image from a remote
directory. Currently only supports the FCOS tag structure.

This consumes the same policy.yaml defined in
coreos#3798

See coreos/fedora-coreos-tracker#1367
See coreos/fedora-coreos-pipeline#995
jbtrystram added a commit to jbtrystram/coreos-assembler that referenced this pull request Jul 16, 2024
This script calls skopeo delete to prune image from a remote
directory. Currently only supports the FCOS tag structure.

This consumes the same policy.yaml defined in
coreos#3798

See coreos/fedora-coreos-tracker#1367
See coreos/fedora-coreos-pipeline#995
jbtrystram added a commit that referenced this pull request Jul 17, 2024
This script calls skopeo delete to prune image from a remote
directory. Currently only supports the FCOS tag structure.

This consumes the same policy.yaml defined in
#3798

See coreos/fedora-coreos-tracker#1367
See coreos/fedora-coreos-pipeline#995
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants