Skip to content

Commit

Permalink
Help preserve release-branch CI VM Images
Browse files Browse the repository at this point in the history
For release-branches, CI VM images must be retained long-term since they
are difficult/impossible to rebuild.  A number of times, often due to
human error, these images have been accidentally lost.

Update automation tooling such that these images may be specially marked.
Upon encountering a permanently marked image, ensure it is never
deprecated or removed.  When found as deprecated, issue a loud error that
will be delivered to the podman-monitor list.

Operational mechanism: A "meta" job should be executing regularly via
cirrus-cron on every single important release-branch. This job uses the
imgts container image.  Therefore, the branch/tag/pr it is running for
may be retrieved and used to determine if the image should be marked as
permanent.  Otherwise, manually updating all images in this way is
possible but would be labor intensive.

Update documentation to reflect these changes.

Signed-off-by: Chris Evich <[email protected]>
  • Loading branch information
cevich committed Jul 22, 2022
1 parent 4f34a04 commit 39becc6
Show file tree
Hide file tree
Showing 5 changed files with 103 additions and 14 deletions.
15 changes: 11 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -349,18 +349,25 @@ infinite-growth of the VM image count.
VM is utilized. It records the usage details, along with a timestamp
into the GCE VM image "labels" (metadata). Failure to update
metadata is considered critical, and the task will fail to prompt
immediate corrective action by automation maintainers.
immediate corrective action by automation maintainers. When this
container detects it's running on behalf of a release-branch, it
will make a best-effort attempt to flag all VM images for permanent
retention.

* `imgobsolete` is triggered periodically by cirrus-cron *only* on this
repository. It scans through all GCE VM Images, filtering any which
haven't been used within the last 30 days (according to `imgts`
updated labels). Identified images are deprecated by marking them
`obsolete` in GCE. This status blocks them from being used, but
does not actually remove them.
updated labels). Excluding any images which are marked for permanent
retention, disused images are deprecated by marking them as `obsolete`
in GCE. This will cause an error in any CI run which references them.
The images will still be recoverable manually, using the `gcloud`
utility.

* `imgprune` also runs periodically, immediately following `imgobsolete`.
It scans all currently obsolete GCE images, filtering any which were
deprecated more than 30 days ago (according to deprecation metadata).
It will fail with a loud error message should it encounter a image marked
obsolete **and** labeled for permanent retention. Otherwise,
Images which have been obsolete for more than 30 days, are permanently
removed.

Expand Down
10 changes: 10 additions & 0 deletions imgobsolete/entrypoint.sh
Original file line number Diff line number Diff line change
Expand Up @@ -39,10 +39,20 @@ $GCLOUD compute images list --format="$FORMAT" --filter="$FILTER" | \
count_image
reason=""
created_ymd=$(date --date=$creationTimestamp --iso-8601=date)
permanent=$(egrep --only-matching --max-count=1 --ignore-case 'permanent=true' <<< $labels || true)
last_used=$(egrep --only-matching --max-count=1 'last-used=[[:digit:]]+' <<< $labels || true)

LABELSFX="labels: '$labels'"

# Any image marked with a `permanent=true` label should be retained forever.
# Typically this will be due to it's use by CI in a release-branch. The images
# `repo-ref` and `build-id` labels should provide clues as to where it's
# required (may be multiple repos.) - for any future auditing purposes.
if [[ -n "$permanent" ]]; then
msg "Retaining forever $name | $labels"
continue
fi

# No label was set
if [[ -z "$last_used" ]]
then # image lacks any tracking labels
Expand Down
5 changes: 4 additions & 1 deletion imgprune/entrypoint.sh
Original file line number Diff line number Diff line change
Expand Up @@ -39,8 +39,11 @@ $GCLOUD compute images list --show-deprecated \
do
count_image
reason=""
permanent=$(egrep --only-matching --max-count=1 --ignore-case 'permanent=true' <<< $labels || true)
[[ -z "$permanent" ]] || \
die 1 "Refusing to delete a deprecated image labeled permanent=true. Please use gcloud utility to set image active, then research the cause of deprecation."
[[ "$dep_state" == "OBSOLETE" ]] || \
die 1 "Error: Unexpected depreciation-state encountered for $name: $dep_state; labels: $labels"
die 1 "Unexpected depreciation-state encountered for $name: $dep_state; labels: $labels"
reason="Obsolete as of $del_date; labels: $labels"
echo "$name $reason" >> $TODELETE
done
Expand Down
2 changes: 1 addition & 1 deletion imgts/Containerfile
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ FROM quay.io/centos/centos:stream8
# Only needed for installing build-time dependencies
COPY /imgts/google-cloud-sdk.repo /etc/yum.repos.d/google-cloud-sdk.repo
RUN dnf -y --setopt=keepcache=true update && \
dnf -y --setopt=keepcache=true install epel-release python3 && \
dnf -y --setopt=keepcache=true install epel-release python3 jq && \
dnf -y --setopt=keepcache=true --exclude=google-cloud-sdk-366.0.0-1 \
install google-cloud-sdk

Expand Down
85 changes: 77 additions & 8 deletions imgts/entrypoint.sh
Original file line number Diff line number Diff line change
Expand Up @@ -14,14 +14,17 @@ req_env_var GCPJSON GCPNAME GCPPROJECT IMGNAMES BUILDID REPOREF

gcloud_init

# Set this to 1 for testing
DRY_RUN="${DRY_RUN:-0}"

# These must be defined by the cirrus-ci job using the container
# shellcheck disable=SC2154
ARGS="
--update-labels=last-used=$(date +%s)
--update-labels=build-id=$BUILDID
--update-labels=repo-ref=$REPOREF
--update-labels=project=$GCPPROJECT
"
ARGS=(\
"--update-labels=last-used=$(date +%s)"
"--update-labels=build-id=$BUILDID"
"--update-labels=repo-ref=$REPOREF"
"--update-labels=project=$GCPPROJECT"
)

# Must be defined by the cirrus-ci job using the container
# shellcheck disable=SC2154
Expand All @@ -37,11 +40,73 @@ ERRIMGS=''
# It's possible for multiple simultaneous label updates to clash
CLASHMSG='Labels fingerprint either invalid or resource labels have changed'

# In an effort to avoid unintentional deletion of release-branch VM images
# this image-use context must be detected. This function accepts a single
# argument: the Cirrus-CI build ID. It attempts to determine if that build
# occured on a non-main branch, and if so will return zero. Otherwise,
# it will return non-zero for executions on behalf of all PRs or tags.
is_release_branch_image(){
local buildId api query result prefix branch tag
buildId=$1
api="https://api.cirrus-ci.com/graphql"
query="{
\"query\": \"query {
build(id: $buildId) {
branch
tag
pullRequest
}
}\"
}"

# It's possible for an image to be missing it's build ID label.
# For example, the first time imgts operates on a new image.
if ((${#buildId}<12)); then
warn 0 "Empty/invalid BuildId value found on image, ignoring: '$buildId'"
return 1
fi

prefix=".data.build"
result=$(curl --silent --location \
--request POST --data @- --url "$api" <<<"$query") \
|| \
die 3 "Error communicating with GraphQL API $api: $result"

# Best effort: It's possible GraphQL query errored, is invalid,
# or the build is so old, there is no record of it. Issue a
# warning and move on.
if ! jq -e "$prefix" <<<"$result" &> /dev/null; then
warn 0 "Response from Cirrus API query '$query' missing requested outputs: '$result'"
return 1
fi

branch=$(jq --raw-output "${prefix}.branch" <<<"$result")
tag=$(jq --raw-output "${prefix}.tag" <<<"$result" | sed 's/null//g')
# Cirrus-CI sets `branch=pull/#` for pull-requests, dependabot creates
# randomly named branches. Check for something looking like a version number.
# N/B: Cirrus will set $branch to a value for both PRs and Tags.
if [[ -z "$tag" && "$branch" =~ (v)|(release-)[0-9]+.* ]]; then
msg "Found build $buildId for branch $branch with images to keep forever."
return 0
fi

# Ignore image last used by a tag or pull-request
return 1
}

unset SET_PERM
if is_release_branch_image $BUILDID; then
ARGS+=("--update-labels=permanent=true")
SET_PERM=1
fi

if ((DRY_RUN)); then GCLOUD='echo'; fi

# Must be defined by the cirrus-ci job using the container
# shellcheck disable=SC2154
for image in $IMGNAMES
do
if ! OUTPUT=$($GCLOUD compute images update "$image" $ARGS 2>&1); then
if ! OUTPUT=$($GCLOUD compute images update "$image" "${ARGS[@]}" 2>&1); then
echo "$OUTPUT" > /dev/stderr
if grep -iq "$CLASHMSG" <<<"$OUTPUT"; then
# Updating the 'last-used' label is most important.
Expand All @@ -52,7 +117,11 @@ do
msg "Detected update error for '$image'" > /dev/stderr
ERRIMGS="$ERRIMGS $image"
else
echo "$OUTPUT" > /dev/stderr
# Display the URI to the updated image for reference
(
echo "$OUTPUT"
if ((SET_PERM)); then echo " IMAGE MARKED FOR PERMANENT RETENTION"; fi
) > /dev/stderr
fi
done

Expand Down

0 comments on commit 39becc6

Please sign in to comment.