[Monitoring] Instrumenting cronjob exit codes #4270

vitorguidi · 2024-09-24T19:23:00Z

Motivation

Kubernetes signals that cronjobs fail, after retries, through events.

GKE does not make it easy to alert on this failure mode. For cronjobs without retries, the failure will be evident in the GCP panel. For cronjobs with retries, there will be a success indicator for the last successful cronjob, and the failure will be registered under the events panel.

Alternatives considered

Options available to monitor failing cronjobs are:

the container/restart_count metric, from GKE. This will be flaky, since a job might succeed on the third attempt. Also, it is not easy to pinpoint the cronjob, since we get the container name as a label
alert on a log based metric from the cluster events. The output of kubectl get events gets dumped to cloud logging, so we can create a metric on events of the form "Saw completed job: oss-fuzz-apply-ccs-28786460, status: Failed". However this requires regex manipulation, has to be manually applied across all projects, and also makes it hard to derive the failing cronjob from the container name. It also adds a hard dependency on kubernetes.

Solution

The proposed solution is to reuse the builtin clusterfuzz metrics implementation, and add a gauge metric CLUSTERFUZZ_CRON_EXIT_CODE, with the cron name as a label.

If the metric is at 1, then the cronjob is unhealthy, otherwise it is healthy. An alert must be set for when it reaches the 1 state, for every label.

Since cronjobs are ephemeral, there is no need for a thread to continuously flush metrics. The option to use monitoring without a flushing thread was added. The same approach can be used to fix metrics for swarming/batch.

Also, the flusher thread was changed to make sure that leftover metrics are flushed before it stops.

Note: this PR is part of this initiative

oliverchang · 2024-09-25T03:01:04Z

src/python/bot/startup/run_cron.py

@@ -55,12 +57,35 @@ def main():
  except ImportError:
    pass

+  # Few metrics get collected per job run, so
+  # no need for a thread to continuously push those
+  monitor.initialize(use_flusher_thread=False)


Is there enough of a benefit to do this special casing ?

If there's not much performance benefit, it's probably better to keep things more simple and uniform (i.e. always have a thread).

It is mostly because of batch and swarming. Suppose a task takes 1m to run. We would have to wait for the entirety of FLUSH_INTERVAL_SECONDS (10m hardcoded) for them to flush, which would cause queueing and imply on wasted money for idle jobs.

Another possibility would be to allow this interval to be injected, to avoid the special casing. It is hard to reason about this though, how would I choose an appropriate interval for cronjobs that vary wildly in duration (think ~1m for generating oss fuzz certs, and ~18h for manage vms in oss fuzz)? How about batch/swarming?

This thread is a weird concept for ephemeral use cases, where only one task gets executed and the bot immediately quits. Hence this special casing. The thread still makes sense for long lived bots, though.

This is an interesting tradeoff, but 10 minutes seems tolerable for most of our cron jobs and 10 minutes is probably overly conservative anyway and can be reduced.

For crons maybe, but for batch/swarming where there is stuff that runs under a minute, it is really hard for me to see value in using this thread.

I tend to agree that as long as we make sure the thread flushes before stopping, it seems like a micro-optimization to avoid having a flusher thread.

What value of FLUSH_INTERVAL_SECONDS should be chosen for this thread to be an universal solution?
If there is such a value that will prevent getting rate limited by the GCP metrics endpoint, and also not decrease the throughput of batch, I am fine with it.

However, this would take experimentation, which is not a really wise time investment. I would rather not have the trouble of choosing and stabilizing this value for short lived use cases.

Keep in mind that once this lands in OSS Fuzz with a reduced interval, it is gonna be 100k+ VMs bombarding the metrics endpoint

I think the interval isn't really critical to this decision. For shorter batch / cron jobs, could we instrument the interpreter exit instead to flush metrics? i.e. have monitor.initialize() register an atexit handler.

That way, we keep things uniform in that we initialize this the same way, while not worrying about flush intervals.

I can refactor this in the upcoming batch PR.
There is one last complexity: kubernetes can kill cronjobs due to timeout with a SIGTERM, followed by a SIGKILL. This SIGTERM has to be treated somehow

https://stackoverflow.com/questions/59973600/what-does-shutdown-look-like-for-a-kubernetes-cron-job-pod-when-its-being-termi

oliverchang · 2024-09-25T03:01:52Z

src/clusterfuzz/_internal/metrics/monitor.py

+      # Make sure there are no metrics left to flush after monitor.stop()
+      if self.stop_event.wait(FLUSH_INTERVAL_SECONDS):
+        should_stop = True
+      flush_metrics()


Can you add details of this change to PR description also? Is this addressing a problem with the final set of metrics not being flushed?

This reverts commit 091c6c2.

Instrumenting cronjob exit codes

70b0f07

vitorguidi requested review from oliverchang, jonathanmetzman and alhijazi September 24, 2024 19:23

vitorguidi added 3 commits September 24, 2024 19:37

Fix linting

a5b1a1e

Adding the option to one shot dumpmetrics

ca57d52

Fixing linting

28f3a18

oliverchang approved these changes Sep 25, 2024

View reviewed changes

vitorguidi merged commit 091c6c2 into master Sep 25, 2024
7 checks passed

vitorguidi deleted the feature/cron-metrics branch September 25, 2024 03:28

vitorguidi added a commit that referenced this pull request Oct 3, 2024

Revert "Instrumenting cronjob exit codes (#4270)"

7991d12

This reverts commit 091c6c2.

vitorguidi added a commit that referenced this pull request Oct 4, 2024

Revert "Instrumenting cronjob exit codes (#4270)"

c2f4060

This reverts commit 091c6c2.

vitorguidi changed the title ~~Instrumenting cronjob exit codes~~ [Monitoring] Instrumenting cronjob exit codes Nov 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Monitoring] Instrumenting cronjob exit codes #4270

[Monitoring] Instrumenting cronjob exit codes #4270

vitorguidi commented Sep 24, 2024 •

edited

Loading

oliverchang Sep 25, 2024

vitorguidi Sep 25, 2024 •

edited

Loading

jonathanmetzman Sep 25, 2024

vitorguidi Sep 26, 2024

letitz Sep 26, 2024

vitorguidi Sep 26, 2024 •

edited

Loading

oliverchang Sep 27, 2024

vitorguidi Sep 27, 2024

oliverchang Sep 25, 2024

[Monitoring] Instrumenting cronjob exit codes #4270

[Monitoring] Instrumenting cronjob exit codes #4270

Conversation

vitorguidi commented Sep 24, 2024 • edited Loading

Motivation

Alternatives considered

Solution

oliverchang Sep 25, 2024

Choose a reason for hiding this comment

vitorguidi Sep 25, 2024 • edited Loading

Choose a reason for hiding this comment

jonathanmetzman Sep 25, 2024

Choose a reason for hiding this comment

vitorguidi Sep 26, 2024

Choose a reason for hiding this comment

letitz Sep 26, 2024

Choose a reason for hiding this comment

vitorguidi Sep 26, 2024 • edited Loading

Choose a reason for hiding this comment

oliverchang Sep 27, 2024

Choose a reason for hiding this comment

vitorguidi Sep 27, 2024

Choose a reason for hiding this comment

oliverchang Sep 25, 2024

Choose a reason for hiding this comment

vitorguidi commented Sep 24, 2024 •

edited

Loading

vitorguidi Sep 25, 2024 •

edited

Loading

vitorguidi Sep 26, 2024 •

edited

Loading