-
Notifications
You must be signed in to change notification settings - Fork 565
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Monitoring] Instrumenting cronjob exit codes #4270
Conversation
@@ -55,12 +57,35 @@ def main(): | |||
except ImportError: | |||
pass | |||
|
|||
# Few metrics get collected per job run, so | |||
# no need for a thread to continuously push those | |||
monitor.initialize(use_flusher_thread=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there enough of a benefit to do this special casing ?
If there's not much performance benefit, it's probably better to keep things more simple and uniform (i.e. always have a thread).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is mostly because of batch and swarming. Suppose a task takes 1m to run. We would have to wait for the entirety of FLUSH_INTERVAL_SECONDS (10m hardcoded) for them to flush, which would cause queueing and imply on wasted money for idle jobs.
Another possibility would be to allow this interval to be injected, to avoid the special casing. It is hard to reason about this though, how would I choose an appropriate interval for cronjobs that vary wildly in duration (think ~1m for generating oss fuzz certs, and ~18h for manage vms in oss fuzz)? How about batch/swarming?
This thread is a weird concept for ephemeral use cases, where only one task gets executed and the bot immediately quits. Hence this special casing. The thread still makes sense for long lived bots, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an interesting tradeoff, but 10 minutes seems tolerable for most of our cron jobs and 10 minutes is probably overly conservative anyway and can be reduced.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For crons maybe, but for batch/swarming where there is stuff that runs under a minute, it is really hard for me to see value in using this thread.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tend to agree that as long as we make sure the thread flushes before stopping, it seems like a micro-optimization to avoid having a flusher thread.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What value of FLUSH_INTERVAL_SECONDS should be chosen for this thread to be an universal solution?
If there is such a value that will prevent getting rate limited by the GCP metrics endpoint, and also not decrease the throughput of batch, I am fine with it.
However, this would take experimentation, which is not a really wise time investment. I would rather not have the trouble of choosing and stabilizing this value for short lived use cases.
Keep in mind that once this lands in OSS Fuzz with a reduced interval, it is gonna be 100k+ VMs bombarding the metrics endpoint
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the interval isn't really critical to this decision. For shorter batch / cron jobs, could we instrument the interpreter exit instead to flush metrics? i.e. have monitor.initialize()
register an atexit handler.
That way, we keep things uniform in that we initialize this the same way, while not worrying about flush intervals.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can refactor this in the upcoming batch PR.
There is one last complexity: kubernetes can kill cronjobs due to timeout with a SIGTERM, followed by a SIGKILL. This SIGTERM has to be treated somehow
# Make sure there are no metrics left to flush after monitor.stop() | ||
if self.stop_event.wait(FLUSH_INTERVAL_SECONDS): | ||
should_stop = True | ||
flush_metrics() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add details of this change to PR description also? Is this addressing a problem with the final set of metrics not being flushed?
This reverts commit 091c6c2.
This reverts commit 091c6c2.
Motivation
Kubernetes signals that cronjobs fail, after retries, through events.
GKE does not make it easy to alert on this failure mode. For cronjobs without retries, the failure will be evident in the GCP panel. For cronjobs with retries, there will be a success indicator for the last successful cronjob, and the failure will be registered under the events panel.
Alternatives considered
Options available to monitor failing cronjobs are:
Solution
The proposed solution is to reuse the builtin clusterfuzz metrics implementation, and add a gauge metric CLUSTERFUZZ_CRON_EXIT_CODE, with the cron name as a label.
If the metric is at 1, then the cronjob is unhealthy, otherwise it is healthy. An alert must be set for when it reaches the 1 state, for every label.
Since cronjobs are ephemeral, there is no need for a thread to continuously flush metrics. The option to use monitoring without a flushing thread was added. The same approach can be used to fix metrics for swarming/batch.
Also, the flusher thread was changed to make sure that leftover metrics are flushed before it stops.
Note: this PR is part of this initiative