Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Monitoring] Instrumenting cronjob exit codes #4270

Merged
merged 4 commits into from
Sep 25, 2024
Merged

Conversation

vitorguidi
Copy link
Collaborator

@vitorguidi vitorguidi commented Sep 24, 2024

Motivation

Kubernetes signals that cronjobs fail, after retries, through events.

GKE does not make it easy to alert on this failure mode. For cronjobs without retries, the failure will be evident in the GCP panel. For cronjobs with retries, there will be a success indicator for the last successful cronjob, and the failure will be registered under the events panel.

Alternatives considered

Options available to monitor failing cronjobs are:

  • the container/restart_count metric, from GKE. This will be flaky, since a job might succeed on the third attempt. Also, it is not easy to pinpoint the cronjob, since we get the container name as a label
  • alert on a log based metric from the cluster events. The output of kubectl get events gets dumped to cloud logging, so we can create a metric on events of the form "Saw completed job: oss-fuzz-apply-ccs-28786460, status: Failed". However this requires regex manipulation, has to be manually applied across all projects, and also makes it hard to derive the failing cronjob from the container name. It also adds a hard dependency on kubernetes.

Solution

The proposed solution is to reuse the builtin clusterfuzz metrics implementation, and add a gauge metric CLUSTERFUZZ_CRON_EXIT_CODE, with the cron name as a label.

If the metric is at 1, then the cronjob is unhealthy, otherwise it is healthy. An alert must be set for when it reaches the 1 state, for every label.

Since cronjobs are ephemeral, there is no need for a thread to continuously flush metrics. The option to use monitoring without a flushing thread was added. The same approach can be used to fix metrics for swarming/batch.

Also, the flusher thread was changed to make sure that leftover metrics are flushed before it stops.

Note: this PR is part of this initiative

@@ -55,12 +57,35 @@ def main():
except ImportError:
pass

# Few metrics get collected per job run, so
# no need for a thread to continuously push those
monitor.initialize(use_flusher_thread=False)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there enough of a benefit to do this special casing ?

If there's not much performance benefit, it's probably better to keep things more simple and uniform (i.e. always have a thread).

Copy link
Collaborator Author

@vitorguidi vitorguidi Sep 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is mostly because of batch and swarming. Suppose a task takes 1m to run. We would have to wait for the entirety of FLUSH_INTERVAL_SECONDS (10m hardcoded) for them to flush, which would cause queueing and imply on wasted money for idle jobs.

Another possibility would be to allow this interval to be injected, to avoid the special casing. It is hard to reason about this though, how would I choose an appropriate interval for cronjobs that vary wildly in duration (think ~1m for generating oss fuzz certs, and ~18h for manage vms in oss fuzz)? How about batch/swarming?

This thread is a weird concept for ephemeral use cases, where only one task gets executed and the bot immediately quits. Hence this special casing. The thread still makes sense for long lived bots, though.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an interesting tradeoff, but 10 minutes seems tolerable for most of our cron jobs and 10 minutes is probably overly conservative anyway and can be reduced.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For crons maybe, but for batch/swarming where there is stuff that runs under a minute, it is really hard for me to see value in using this thread.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tend to agree that as long as we make sure the thread flushes before stopping, it seems like a micro-optimization to avoid having a flusher thread.

Copy link
Collaborator Author

@vitorguidi vitorguidi Sep 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What value of FLUSH_INTERVAL_SECONDS should be chosen for this thread to be an universal solution?
If there is such a value that will prevent getting rate limited by the GCP metrics endpoint, and also not decrease the throughput of batch, I am fine with it.

However, this would take experimentation, which is not a really wise time investment. I would rather not have the trouble of choosing and stabilizing this value for short lived use cases.

Keep in mind that once this lands in OSS Fuzz with a reduced interval, it is gonna be 100k+ VMs bombarding the metrics endpoint

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the interval isn't really critical to this decision. For shorter batch / cron jobs, could we instrument the interpreter exit instead to flush metrics? i.e. have monitor.initialize() register an atexit handler.

That way, we keep things uniform in that we initialize this the same way, while not worrying about flush intervals.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can refactor this in the upcoming batch PR.
There is one last complexity: kubernetes can kill cronjobs due to timeout with a SIGTERM, followed by a SIGKILL. This SIGTERM has to be treated somehow

https://stackoverflow.com/questions/59973600/what-does-shutdown-look-like-for-a-kubernetes-cron-job-pod-when-its-being-termi

# Make sure there are no metrics left to flush after monitor.stop()
if self.stop_event.wait(FLUSH_INTERVAL_SECONDS):
should_stop = True
flush_metrics()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add details of this change to PR description also? Is this addressing a problem with the final set of metrics not being flushed?

@vitorguidi vitorguidi merged commit 091c6c2 into master Sep 25, 2024
7 checks passed
@vitorguidi vitorguidi deleted the feature/cron-metrics branch September 25, 2024 03:28
vitorguidi added a commit that referenced this pull request Oct 3, 2024
vitorguidi added a commit that referenced this pull request Oct 4, 2024
@vitorguidi vitorguidi changed the title Instrumenting cronjob exit codes [Monitoring] Instrumenting cronjob exit codes Nov 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants