Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ClusterFuzz monitoring improvements #4271

Closed
vitorguidi opened this issue Sep 25, 2024 · 0 comments
Closed

ClusterFuzz monitoring improvements #4271

vitorguidi opened this issue Sep 25, 2024 · 0 comments

Comments

@vitorguidi
Copy link
Collaborator

This issue serves as an umbrella for the monitoring initiative

vitorguidi added a commit that referenced this issue Sep 25, 2024
### Motivation

Kubernetes signals that cronjobs fail, after retries, through
[events](https://kubernetes.io/docs/reference/kubernetes-api/cluster-resources/event-v1/).

GKE does not make it easy to alert on this failure mode. For cronjobs
without retries, the failure will be evident in the GCP panel. For
cronjobs with retries, there will be a success indicator for the last
successful cronjob, and the failure will be registered under the events
panel.

### Alternatives considered

Options available to monitor failing cronjobs are:

- the container/restart_count metric, from
[GKE](https://cloud.google.com/monitoring/api/metrics_kubernetes). This
will be flaky, since a job might succeed on the third attempt. Also, it
is not easy to pinpoint the cronjob, since we get the container name as
a label
- alert on a log based metric from the cluster events. The output of
kubectl get events gets dumped to cloud logging, so we can create a
metric on events of the form "Saw completed job:
oss-fuzz-apply-ccs-28786460, status: Failed". However this requires
regex manipulation, has to be manually applied across all projects, and
also makes it hard to derive the failing cronjob from the container
name. It also adds a hard dependency on kubernetes.

### Solution

The proposed solution is to reuse the builtin clusterfuzz metrics
implementation, and add a gauge metric CLUSTERFUZZ_CRON_EXIT_CODE, with
the cron name as a label.

If the metric is at 1, then the cronjob is unhealthy, otherwise it is
healthy. An alert must be set for when it reaches the 1 state, for every
label.
 
Since cronjobs are ephemeral, there is no need for a thread to
continuously flush metrics. The option to use monitoring without a
flushing thread was added. The same approach can be used to fix metrics
for swarming/batch.
 
Also, the flusher thread was changed to make sure that leftover metrics
are flushed before it stops.
 
Note: this PR is part of this
[initiative](#4271)
vitorguidi added a commit that referenced this issue Oct 7, 2024
As things currently stand, metrics are flushed by a background thread
every 10m. However, for the batch/swarming use case, we will lose
metrics for jobs that finish before this interval. To handle that,
monitor.stop will be called before the bot exits.

Finally, sigterm is handled in run_bot to avoid losing metrics when
instances get preempted

This PR is part of #4271.
vitorguidi added a commit that referenced this issue Oct 10, 2024
Whitelisting a service account is not enough to get GCP Uptime to work,
to verify if the endpoints on the clusterfuzz frontend are available.
Attempting to annotate get handlers with oauth, so the healthcheck does
not get stuck at the login page.

This pr is part of #4271
vitorguidi added a commit that referenced this issue Oct 11, 2024
### Motivation

In order to get healthchecks on appengine handlers, we need to use GCP
Uptimes. They authenticate through an [oauth id
token](https://cloud.google.com/monitoring/uptime-checks#create). As
things currently stand, oauth on appengine only supportes access tokens,
so this PR solves that.


### Alternatives considered

It would be ideal if we only did one api call, either to validade an id
or access token. However, the
[specification](https://auth0.com/docs/secure/tokens/id-tokens/id-token-structure)
for oauth2 does not present a standard way of, given a token,
differentiating between the two. For that reason, two api calls to GCP
are made.

Part of #4271
vitorguidi added a commit that referenced this issue Oct 11, 2024
### Motivation

Uptime healthchecks are breaking for fuzzer/jobs/corpora. This happens
because the check_user_access annotation is placed BEFORE the oauth one,
which leads to the verification being
[asserted](https://github.com/google/clusterfuzz/blob/master/src/appengine/libs/access.py#L89)
before credentials are fetched.

This PR fixes the annotation order, making authentication happen before
authorization.

Part of #4271
vitorguidi added a commit that referenced this issue Oct 14, 2024
### Motivation

Opening and closing bugs is the last step in the clusterfuzz user
journey, and we currently do not have that instrumentation. This PR:

* Enables monitoring in the kubernetes environment
* Enables monitoring on cron jobs
* Moves the wrap_with_monitoring context manager from run_bot.py to
monitoring.py, so it can get reused in run_cron.py
* Collects bugs opened metrics from the triage cronjob
* Collects bugs closed metrics from the cleanup cronjob

The only relevant label for these metrics is the fuzzer name.

For bug filing, we measure how many attempts:
* Succeeded
* Failed
* Got throttled

For bug closing, we measure how many attempts:
* Succeeded
* Failed

Part of #4271
vitorguidi added a commit that referenced this issue Oct 15, 2024
### Motivation

Sync admins only replicates admins to datastore if they conform to
'user:{email}', but it is interesting to allow service accounts for
monitoring purposes (uptime) of the form 'serviceAccount:{email}'.

This PR fixes that.

Part of #4271
vitorguidi added a commit that referenced this issue Oct 16, 2024
### Motivation

As a request from the Chrome team, it would be nice to know what:
* OS type
* OS version
* Release (prod/candidate)

a bot corresponds to. This PR implements that.

Part of #4271
vitorguidi added a commit that referenced this issue Oct 23, 2024
### Motivation

We currently have no way to tell if the chrome test syncer is running
successfully. This PR implements a counter metric, that can be used to
alert on if missing for 12h (or some arbitrary treshold).

The monitor wrapper must be done in main for this script, since it is
the entrypoint
Reference:
https://github.com/google/clusterfuzz/blob/master/docker/chromium/tests-syncer/Dockerfile#L17

Part of #4271
vitorguidi added a commit that referenced this issue Oct 23, 2024
### Motivation 

#4312 added the oauth handler to several get endpoints, in order for GCP
uptime to probe them. However, the decorator assumed that all handlers
would be of the form func(self), not declaring args or kwargs.

This is not true, for the following signatures:
```
coverage_report.py
...
  def get(self, report_type=None, argument=None, date=None, extra=None):
  
fuzzer_stats.py
...
  def get(self, extra=None):
```

This PR adds *args and **kwargs to the wrapper, so it can work for these
endpoints.

Part of #4271 

Error groups:

[coverage](https://pantheon.corp.google.com/errors/detail/CKrE1Jfd88vKIQ;service=;version=;filter=%5B%22handler%22%5D;time=P7D;locations=global?e=-13802955&inv=1&invt=AbfeYw&mods=logs_tg_prod&project=clusterfuzz-external)

[stats](https://pantheon.corp.google.com/errors/detail/CMiEwKaYs4DfEA;service=;version=;filter=%5B%22handler%22%5D;time=P7D;locations=global?e=-13802955&inv=1&invt=AbfeYw&mods=logs_tg_prod&project=clusterfuzz-external)
jonathanmetzman added a commit that referenced this issue Dec 20, 2024
There is a clear way to partition ErrorType enums in the criteria
proposed, reverting this one.

Part of #4271

Co-authored-by: Vitor Guidi <[email protected]>
vitorguidi added a commit that referenced this issue Dec 23, 2024
#4516)

### Motivation

#4458 implemented an error rate for utasks, only considering exceptions.
In #4499 , outcomes were split between success, failure and maybe_retry
conditions. There we learned that the volume of retryable outcomes is
negligible, so it makes sense to count them as failures.

Listing out all the success conditions under _MetricRecorder is not
desirable. However, we are consciously taking this technical debt so we
can deliver #4271 .

A refactor of uworker main will be later performed, so we can split the
success and failure conditions, both of which are mixed in
uworker_output.ErrorType.

Reference for tech debt acknowledgement: #4517
vitorguidi added a commit that referenced this issue Dec 26, 2024
#4516)

### Motivation

#4458 implemented an error rate for utasks, only considering exceptions.
In #4499 , outcomes were split between success, failure and maybe_retry
conditions. There we learned that the volume of retryable outcomes is
negligible, so it makes sense to count them as failures.

Listing out all the success conditions under _MetricRecorder is not
desirable. However, we are consciously taking this technical debt so we
can deliver #4271 .

A refactor of uworker main will be later performed, so we can split the
success and failure conditions, both of which are mixed in
uworker_output.ErrorType.

Reference for tech debt acknowledgement: #4517
vitorguidi added a commit that referenced this issue Dec 26, 2024
### Motivation

As a final step for the monitoring project, this PR creates a dashboard
for overall system health.

The reasoning behind it is to have:
* Business level metrics (fuzzing hours, generated testcases, issues
filed, issues closed, testcases for which processing is pending)
* Testcase metrics (untriaged testcase age and count)
* SQS metrics (queue size, and published messages, per topic)
* Datastore/GCS metrics (number of requests, error rate, and latencies)
* Utask level metrics (duration, number of executions, error rate,
latency)

These are sufficient to apply the [RED
methodology](https://grafana.com/blog/2018/08/02/the-red-method-how-to-instrument-your-services/)
(rate, error and duration), provide high level metrics we can alert on,
and aid in troubleshooting outages with a well defined methodology.

There were two options to commit this to version control: terraform, or
butler definitions. The first was chosen, since it is the preffered long
term solution, and it is also simpler to implement, since it supports
copy pasting the JSON definition from GCP.

### Attention points

This should be automatically imported from main.tf, so it (should be)
sufficient to just place the .tf file in the same folder, and have
butler deploy handle the terraform apply step.

### How to review

Head over to go/cf-chrome-health-beta, internally. It is not expected
that the actual dashboard definition is reviewed, it is just a dump of
what was manually created in GCP.

Part of #4271
vitorguidi added a commit that referenced this issue Dec 26, 2024
This brings back #4497 . The issue was that, in promql definitions, the
'{}' characters were lacking a double escape (\\\\{ and \\\\}).
Also, the only parameter to the terraform object should be
dashboard_json.

Testing: terraform validate


![image](https://github.com/user-attachments/assets/f67434d0-c8e4-4bc0-be38-27188c98ea49)

Part of #4271
vitorguidi added a commit that referenced this issue Dec 27, 2024
There is a clear way to partition ErrorType enums in the criteria
proposed, reverting this one.

Part of #4271
vitorguidi added a commit that referenced this issue Dec 27, 2024
#4516)

### Motivation

#4458 implemented an error rate for utasks, only considering exceptions.
In #4499 , outcomes were split between success, failure and maybe_retry
conditions. There we learned that the volume of retryable outcomes is
negligible, so it makes sense to count them as failures.

Listing out all the success conditions under _MetricRecorder is not
desirable. However, we are consciously taking this technical debt so we
can deliver #4271 .

A refactor of uworker main will be later performed, so we can split the
success and failure conditions, both of which are mixed in
uworker_output.ErrorType.

Reference for tech debt acknowledgement: #4517
vitorguidi added a commit that referenced this issue Dec 27, 2024
### Motivation

As a final step for the monitoring project, this PR creates a dashboard
for overall system health.

The reasoning behind it is to have:
* Business level metrics (fuzzing hours, generated testcases, issues
filed, issues closed, testcases for which processing is pending)
* Testcase metrics (untriaged testcase age and count)
* SQS metrics (queue size, and published messages, per topic)
* Datastore/GCS metrics (number of requests, error rate, and latencies)
* Utask level metrics (duration, number of executions, error rate,
latency)

These are sufficient to apply the [RED
methodology](https://grafana.com/blog/2018/08/02/the-red-method-how-to-instrument-your-services/)
(rate, error and duration), provide high level metrics we can alert on,
and aid in troubleshooting outages with a well defined methodology.

There were two options to commit this to version control: terraform, or
butler definitions. The first was chosen, since it is the preffered long
term solution, and it is also simpler to implement, since it supports
copy pasting the JSON definition from GCP.

### Attention points

This should be automatically imported from main.tf, so it (should be)
sufficient to just place the .tf file in the same folder, and have
butler deploy handle the terraform apply step.

### How to review

Head over to go/cf-chrome-health-beta, internally. It is not expected
that the actual dashboard definition is reviewed, it is just a dump of
what was manually created in GCP.

Part of #4271
vitorguidi added a commit that referenced this issue Dec 27, 2024
This brings back #4497 . The issue was that, in promql definitions, the
'{}' characters were lacking a double escape (\\\\{ and \\\\}).
Also, the only parameter to the terraform object should be
dashboard_json.

Testing: terraform validate


![image](https://github.com/user-attachments/assets/f67434d0-c8e4-4bc0-be38-27188c98ea49)

Part of #4271
vitorguidi added a commit that referenced this issue Dec 27, 2024
Chrome is split in two projects internally, and the current way
terraform is organized will create the dashboard in the wrong project.

This PR makes the dashboard reference the secondary project id, thus
solving the problem.

Part of #4271
vitorguidi added a commit that referenced this issue Dec 27, 2024
### Motivation

As a final step for the monitoring project, this PR creates a dashboard
for overall system health.

The reasoning behind it is to have:
* Business level metrics (fuzzing hours, generated testcases, issues
filed, issues closed, testcases for which processing is pending)
* Testcase metrics (untriaged testcase age and count)
* SQS metrics (queue size, and published messages, per topic)
* Datastore/GCS metrics (number of requests, error rate, and latencies)
* Utask level metrics (duration, number of executions, error rate,
latency)

These are sufficient to apply the [RED
methodology](https://grafana.com/blog/2018/08/02/the-red-method-how-to-instrument-your-services/)
(rate, error and duration), provide high level metrics we can alert on,
and aid in troubleshooting outages with a well defined methodology.

There were two options to commit this to version control: terraform, or
butler definitions. The first was chosen, since it is the preffered long
term solution, and it is also simpler to implement, since it supports
copy pasting the JSON definition from GCP.

### Attention points

This should be automatically imported from main.tf, so it (should be)
sufficient to just place the .tf file in the same folder, and have
butler deploy handle the terraform apply step.

### How to review

Head over to go/cf-chrome-health-beta, internally. It is not expected
that the actual dashboard definition is reviewed, it is just a dump of
what was manually created in GCP.

Part of #4271
vitorguidi added a commit that referenced this issue Dec 27, 2024
This brings back #4497 . The issue was that, in promql definitions, the
'{}' characters were lacking a double escape (\\\\{ and \\\\}).
Also, the only parameter to the terraform object should be
dashboard_json.

Testing: terraform validate


![image](https://github.com/user-attachments/assets/f67434d0-c8e4-4bc0-be38-27188c98ea49)

Part of #4271
vitorguidi added a commit that referenced this issue Dec 27, 2024
Chrome is split in two projects internally, and the current way
terraform is organized will create the dashboard in the wrong project.

This PR makes the dashboard reference the secondary project id, thus
solving the problem.

Part of #4271
vitorguidi added a commit that referenced this issue Jan 2, 2025
Appengine crons are currently not instrumented, so we cannot get
testcase metrics for the google 3 deployment. This PR fixes that.

Part of #4271
jonathanmetzman pushed a commit that referenced this issue Jan 8, 2025
### Motivation

We currently have no metric that tracks the error rate for each task.
This PR implements that, and the error rate can be obtained by summing
up the metric with outcome=failure, divided by the overall sum.

This is useful for SLI alerting.

Part of #4271
jonathanmetzman pushed a commit that referenced this issue Jan 8, 2025
…COUNT (#4500)

### Motivation

The UNTRIAGED_TESTCASE_COUNT metric worked as expected, however it is
hard to figure out which job the testcase belongs to, and why the
testcase is stuck. This PR adds the job label, and a status one, which
can be one of four values:

PENDING_CRITICAL_TASKS = 'pending_critical_tasks'
PENDING_PROGRESSION = 'pending_progression'
PENDING_GROUPING = 'pending_grouping'
PENDING_FILING = 'pending_filing'

Since (job, status) tuples might disappear from one triage run to the
next, the period of time the gauge is valid for should be, at most, the
interval at which the triage cronjob runs, otherwise it might overcount
things.

Part of #4271
jonathanmetzman pushed a commit that referenced this issue Jan 8, 2025
…ss, maybe_retry and failure outcomes (#4499)

### Motivation

#4458 implemented a task outcome metric, so we can track error rates in
utasks, by job/task/subtask.

As failures are expected for ClusterFuzz, initially only unhandled
exceptions would be considered as actual errors. Chrome folks asked for
a better partitioning of error codes, which is implemented here as the
following outcomes:

* success: the task has unequivocally succeeded, producing a sane result
* maybe_retry: some transient error happened, and the task is
potentially being retried. This might capture some unretriable failure
condition, but it is a compromise we are willing to make in order to
decrease false positives.
* failure: the task has unequivocally failed.

Part of #4271
jonathanmetzman pushed a commit that referenced this issue Jan 8, 2025
…zzer generated test cases (#4481)

### Motivation

[#4364](#4364) implemented the
tracking for the time it takes clusterfuzz to complete several steps of
the manually uploaded testcase lifecycle.

As per Chrome's request, the metric will now contain an 'origin' label,
which indicates if the testcase was 'manually_uploaded' or generated by
a 'fuzzer'.

The code was also simplified, by reusing the get_age_in_seconds method
from the TestCase entity.

Also, it adds the 'stuck_in_triage' boolean field to the testcase
entity, to facilitate figuring out what testcases are in a stuck state,
so follow up actions can be taken.

Part of #4271
jonathanmetzman pushed a commit that referenced this issue Jan 8, 2025
There is a clear way to partition ErrorType enums in the criteria
proposed, reverting this one.

Part of #4271
jonathanmetzman pushed a commit that referenced this issue Jan 8, 2025
#4516)

### Motivation

#4458 implemented an error rate for utasks, only considering exceptions.
In #4499 , outcomes were split between success, failure and maybe_retry
conditions. There we learned that the volume of retryable outcomes is
negligible, so it makes sense to count them as failures.

Listing out all the success conditions under _MetricRecorder is not
desirable. However, we are consciously taking this technical debt so we
can deliver #4271 .

A refactor of uworker main will be later performed, so we can split the
success and failure conditions, both of which are mixed in
uworker_output.ErrorType.

Reference for tech debt acknowledgement: #4517
jonathanmetzman pushed a commit that referenced this issue Jan 8, 2025
### Motivation

As a final step for the monitoring project, this PR creates a dashboard
for overall system health.

The reasoning behind it is to have:
* Business level metrics (fuzzing hours, generated testcases, issues
filed, issues closed, testcases for which processing is pending)
* Testcase metrics (untriaged testcase age and count)
* SQS metrics (queue size, and published messages, per topic)
* Datastore/GCS metrics (number of requests, error rate, and latencies)
* Utask level metrics (duration, number of executions, error rate,
latency)

These are sufficient to apply the [RED
methodology](https://grafana.com/blog/2018/08/02/the-red-method-how-to-instrument-your-services/)
(rate, error and duration), provide high level metrics we can alert on,
and aid in troubleshooting outages with a well defined methodology.

There were two options to commit this to version control: terraform, or
butler definitions. The first was chosen, since it is the preffered long
term solution, and it is also simpler to implement, since it supports
copy pasting the JSON definition from GCP.

### Attention points

This should be automatically imported from main.tf, so it (should be)
sufficient to just place the .tf file in the same folder, and have
butler deploy handle the terraform apply step.

### How to review

Head over to go/cf-chrome-health-beta, internally. It is not expected
that the actual dashboard definition is reviewed, it is just a dump of
what was manually created in GCP.

Part of #4271
jonathanmetzman pushed a commit that referenced this issue Jan 8, 2025
This brings back #4497 . The issue was that, in promql definitions, the
'{}' characters were lacking a double escape (\\\\{ and \\\\}).
Also, the only parameter to the terraform object should be
dashboard_json.

Testing: terraform validate


![image](https://github.com/user-attachments/assets/f67434d0-c8e4-4bc0-be38-27188c98ea49)

Part of #4271
jonathanmetzman pushed a commit that referenced this issue Jan 8, 2025
Chrome is split in two projects internally, and the current way
terraform is organized will create the dashboard in the wrong project.

This PR makes the dashboard reference the secondary project id, thus
solving the problem.

Part of #4271
jonathanmetzman pushed a commit that referenced this issue Jan 8, 2025
Appengine crons are currently not instrumented, so we cannot get
testcase metrics for the google 3 deployment. This PR fixes that.

Part of #4271
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant