Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Monitoring] Testcase upload metrics for the triage lifecycle #4364

Merged
merged 26 commits into from
Nov 13, 2024

Conversation

vitorguidi
Copy link
Collaborator

@vitorguidi vitorguidi commented Oct 30, 2024

Motivation

Chrome security shepherds manually upload testcases through appengine, triggering analyze task and, in case of a legitimate crash, the followup progression tasks:

  • Minimize
  • Analyze
  • Impact
  • Regression
  • Cleanup cronjob, when updating a bug to inform the user that all above stages were finished

This PR adds instrumentation to track the time elapsed between the user upload, and the completion of the above events.

Attention points

  • TestcaseUploadMetadata.timestamp was being mutated on the preprocess stage for analyze task. This mutation was removed, so that this entity can be the source of truth for when a testcase was in fact uploaded by the user.

  • The job name could be retrieved from the JOB_NAME env var within the uworker, however this does not work for the cleanup use case. For this reason, the job name is fetched from datastore instead.

  • The query_testcase_upload_metadata method was moved from analyze_task.py to a helpers file, so it could be reused across tasks and on the cleanup cronjob

Testing strategy

Every task mentioned was executed locally, with a valid uploaded testcase. The codepath for the metric emission was hit and produced the desired output, both for the tasks and the cronjob.

Part of #4271

@vitorguidi vitorguidi changed the title [WIP] Testcase upload metrics for the triage lifecycle Testcase upload metrics for the triage lifecycle Nov 1, 2024
@vitorguidi vitorguidi changed the title Testcase upload metrics for the triage lifecycle [Monitoring] Testcase upload metrics for the triage lifecycle Nov 8, 2024
@jonathanmetzman
Copy link
Collaborator

TestcaseUploadMetadata.timestamp was being mutated on the preprocess stage for analyze task. This mutation was removed, so that this entity can be the source of truth for when a testcase was in fact uploaded by the user.

Why was it OK to remove this?

Copy link
Collaborator

@jonathanmetzman jonathanmetzman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Copy link
Collaborator

@alhijazi alhijazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM % Jonathan's comments

@vitorguidi
Copy link
Collaborator Author

TestcaseUploadMetadata.timestamp was being mutated on the preprocess stage for analyze task. This mutation was removed, so that this entity can be the source of truth for when a testcase was in fact uploaded by the user.

Why was it OK to remove this?

The only place where this gets used is here

query = datastore_query.Query(data_types.TestcaseUploadMetadata)

This would only change the presentation of the uploaded testcases page in appengine, so I expect no badness from this movement.

@vitorguidi vitorguidi merged commit 2073870 into master Nov 13, 2024
7 checks passed
@vitorguidi vitorguidi deleted the feat/upload-time branch November 13, 2024 15:37
vitorguidi added a commit that referenced this pull request Nov 22, 2024
Chrome security shepherds manually upload testcases through appengine,
triggering analyze task and, in case of a legitimate crash, the followup
progression tasks:
* Minimize
* Analyze
* Impact
* Regression
* Cleanup cronjob, when updating a bug to inform the user that all above
stages were finished

This PR adds instrumentation to track the time elapsed between the user
upload, and the completion of the above events.

* TestcaseUploadMetadata.timestamp was being mutated on the preprocess
stage for analyze task. This mutation was removed, so that this entity
can be the source of truth for when a testcase was in fact uploaded by
the user.

* The job name could be retrieved from the JOB_NAME env var within the
uworker, however this does not work for the cleanup use case. For this
reason, the job name is fetched from datastore instead.

* The ```query_testcase_upload_metadata``` method was moved from
analyze_task.py to a helpers file, so it could be reused across tasks
and on the cleanup cronjob

Every task mentioned was executed locally, with a valid uploaded
testcase. The codepath for the metric emission was hit and produced the
desired output, both for the tasks and the cronjob.

Part of #4271
vitorguidi added a commit that referenced this pull request Dec 12, 2024
…estcase count (#4494)

### Motivation

#4364 implemented a metric to track the percentile distributions of
untriaged testcase age. It was overcounting testcases:
* Testcases for which a bug was already filed
* Testcases for which the crash was unimportant

This PR solves this issue, and adds the UNTRIAGED_TESTCASE_COUNT metric,
drilled down by job and platform, so we can also know how many testcases
are stuck, and not only their age distribution.
vitorguidi added a commit that referenced this pull request Dec 16, 2024
…zzer generated test cases (#4481)

### Motivation

[#4364](#4364) implemented the
tracking for the time it takes clusterfuzz to complete several steps of
the manually uploaded testcase lifecycle.

As per Chrome's request, the metric will now contain an 'origin' label,
which indicates if the testcase was 'manually_uploaded' or generated by
a 'fuzzer'.

The code was also simplified, by reusing the get_age_in_seconds method
from the TestCase entity.

Also, it adds the 'stuck_in_triage' boolean field to the testcase
entity, to facilitate figuring out what testcases are in a stuck state,
so follow up actions can be taken.

Part of #4271
vitorguidi added a commit that referenced this pull request Dec 16, 2024
…zzer generated test cases (#4481)

[#4364](#4364) implemented the
tracking for the time it takes clusterfuzz to complete several steps of
the manually uploaded testcase lifecycle.

As per Chrome's request, the metric will now contain an 'origin' label,
which indicates if the testcase was 'manually_uploaded' or generated by
a 'fuzzer'.

The code was also simplified, by reusing the get_age_in_seconds method
from the TestCase entity.

Also, it adds the 'stuck_in_triage' boolean field to the testcase
entity, to facilitate figuring out what testcases are in a stuck state,
so follow up actions can be taken.

Part of #4271
vitorguidi added a commit that referenced this pull request Dec 23, 2024
…stuck in analyze (#4547)

### Motivation

We currently have no way to tell if analyze task was successfully
executed. The TESTCASE_UPLOAD_TRIAGE_DURATION metric from #4364 would
only track duration for tasks that did finish.

An analyze_pending field is added to the Testcase entity in datastore,
which is set to False by default, to True for manually uploaded
testcases, and to False once analyze task postprocess runs.

It also increments the UNTRIAGED_TESTCASE_AGE metric from #4381 with a
status label, so we can know at what step the testcase is stuck, thus
allowing us to alert if analyze is taking longer to finish than
expected.

The alert itself could be, for instance, P50 age of untriaged testcase
(status=analyze_pending) > 3h.

Also, this retroactively addresses comments from #4481:

* Fixes docstring for emit_testcase_triage_duration_metric
* Removes assertions
* Renames TESTCASE_UPLOAD_TRIAGE_DURATION to TESTCASE_TRIAGE_DURATION,
since it now accounts for fuzzer generated testcases
* Use a boolean "from_fuzzer" field, instead of "origin" string, in
TESTCASE_TRIAGE_DURATION
vitorguidi added a commit that referenced this pull request Dec 26, 2024
…stuck in analyze (#4547)

### Motivation

We currently have no way to tell if analyze task was successfully
executed. The TESTCASE_UPLOAD_TRIAGE_DURATION metric from #4364 would
only track duration for tasks that did finish.

An analyze_pending field is added to the Testcase entity in datastore,
which is set to False by default, to True for manually uploaded
testcases, and to False once analyze task postprocess runs.

It also increments the UNTRIAGED_TESTCASE_AGE metric from #4381 with a
status label, so we can know at what step the testcase is stuck, thus
allowing us to alert if analyze is taking longer to finish than
expected.

The alert itself could be, for instance, P50 age of untriaged testcase
(status=analyze_pending) > 3h.

Also, this retroactively addresses comments from #4481:

* Fixes docstring for emit_testcase_triage_duration_metric
* Removes assertions
* Renames TESTCASE_UPLOAD_TRIAGE_DURATION to TESTCASE_TRIAGE_DURATION,
since it now accounts for fuzzer generated testcases
* Use a boolean "from_fuzzer" field, instead of "origin" string, in
TESTCASE_TRIAGE_DURATION
vitorguidi added a commit that referenced this pull request Dec 27, 2024
…stuck in analyze (#4547)

### Motivation

We currently have no way to tell if analyze task was successfully
executed. The TESTCASE_UPLOAD_TRIAGE_DURATION metric from #4364 would
only track duration for tasks that did finish.

An analyze_pending field is added to the Testcase entity in datastore,
which is set to False by default, to True for manually uploaded
testcases, and to False once analyze task postprocess runs.

It also increments the UNTRIAGED_TESTCASE_AGE metric from #4381 with a
status label, so we can know at what step the testcase is stuck, thus
allowing us to alert if analyze is taking longer to finish than
expected.

The alert itself could be, for instance, P50 age of untriaged testcase
(status=analyze_pending) > 3h.

Also, this retroactively addresses comments from #4481:

* Fixes docstring for emit_testcase_triage_duration_metric
* Removes assertions
* Renames TESTCASE_UPLOAD_TRIAGE_DURATION to TESTCASE_TRIAGE_DURATION,
since it now accounts for fuzzer generated testcases
* Use a boolean "from_fuzzer" field, instead of "origin" string, in
TESTCASE_TRIAGE_DURATION
jonathanmetzman pushed a commit that referenced this pull request Jan 8, 2025
…estcase count (#4494)

### Motivation

#4364 implemented a metric to track the percentile distributions of
untriaged testcase age. It was overcounting testcases:
* Testcases for which a bug was already filed
* Testcases for which the crash was unimportant

This PR solves this issue, and adds the UNTRIAGED_TESTCASE_COUNT metric,
drilled down by job and platform, so we can also know how many testcases
are stuck, and not only their age distribution.
jonathanmetzman pushed a commit that referenced this pull request Jan 8, 2025
…zzer generated test cases (#4481)

### Motivation

[#4364](#4364) implemented the
tracking for the time it takes clusterfuzz to complete several steps of
the manually uploaded testcase lifecycle.

As per Chrome's request, the metric will now contain an 'origin' label,
which indicates if the testcase was 'manually_uploaded' or generated by
a 'fuzzer'.

The code was also simplified, by reusing the get_age_in_seconds method
from the TestCase entity.

Also, it adds the 'stuck_in_triage' boolean field to the testcase
entity, to facilitate figuring out what testcases are in a stuck state,
so follow up actions can be taken.

Part of #4271
jonathanmetzman pushed a commit that referenced this pull request Jan 8, 2025
…stuck in analyze (#4547)

### Motivation

We currently have no way to tell if analyze task was successfully
executed. The TESTCASE_UPLOAD_TRIAGE_DURATION metric from #4364 would
only track duration for tasks that did finish.

An analyze_pending field is added to the Testcase entity in datastore,
which is set to False by default, to True for manually uploaded
testcases, and to False once analyze task postprocess runs.

It also increments the UNTRIAGED_TESTCASE_AGE metric from #4381 with a
status label, so we can know at what step the testcase is stuck, thus
allowing us to alert if analyze is taking longer to finish than
expected.

The alert itself could be, for instance, P50 age of untriaged testcase
(status=analyze_pending) > 3h.

Also, this retroactively addresses comments from #4481:

* Fixes docstring for emit_testcase_triage_duration_metric
* Removes assertions
* Renames TESTCASE_UPLOAD_TRIAGE_DURATION to TESTCASE_TRIAGE_DURATION,
since it now accounts for fuzzer generated testcases
* Use a boolean "from_fuzzer" field, instead of "origin" string, in
TESTCASE_TRIAGE_DURATION
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants