[Monitoring] Testcase upload metrics for the triage lifecycle #4364

vitorguidi · 2024-10-30T19:22:15Z

Motivation

Chrome security shepherds manually upload testcases through appengine, triggering analyze task and, in case of a legitimate crash, the followup progression tasks:

Minimize
Analyze
Impact
Regression
Cleanup cronjob, when updating a bug to inform the user that all above stages were finished

This PR adds instrumentation to track the time elapsed between the user upload, and the completion of the above events.

Attention points

TestcaseUploadMetadata.timestamp was being mutated on the preprocess stage for analyze task. This mutation was removed, so that this entity can be the source of truth for when a testcase was in fact uploaded by the user.
The job name could be retrieved from the JOB_NAME env var within the uworker, however this does not work for the cleanup use case. For this reason, the job name is fetched from datastore instead.
The query_testcase_upload_metadata method was moved from analyze_task.py to a helpers file, so it could be reused across tasks and on the cleanup cronjob

Testing strategy

Every task mentioned was executed locally, with a valid uploaded testcase. The codepath for the metric emission was hit and produced the desired output, both for the tasks and the cronjob.

Part of #4271

src/clusterfuzz/_internal/metrics/monitoring_metrics.py

src/clusterfuzz/_internal/bot/tasks/commons/testcase_utils.py

src/clusterfuzz/_internal/bot/tasks/utasks/minimize_task.py

src/clusterfuzz/_internal/bot/tasks/utasks/analyze_task.py

…oups

… task

jonathanmetzman · 2024-11-08T19:50:01Z

TestcaseUploadMetadata.timestamp was being mutated on the preprocess stage for analyze task. This mutation was removed, so that this entity can be the source of truth for when a testcase was in fact uploaded by the user.

Why was it OK to remove this?

jonathanmetzman

lgtm

alhijazi

LGTM % Jonathan's comments

vitorguidi · 2024-11-13T15:08:08Z

TestcaseUploadMetadata.timestamp was being mutated on the preprocess stage for analyze task. This mutation was removed, so that this entity can be the source of truth for when a testcase was in fact uploaded by the user.

Why was it OK to remove this?

The only place where this gets used is here

clusterfuzz/src/appengine/handlers/upload_testcase.py

Line 95 in 082b008

query = datastore_query.Query(data_types.TestcaseUploadMetadata)

This would only change the presentation of the uploaded testcases page in appengine, so I expect no badness from this movement.

Chrome security shepherds manually upload testcases through appengine, triggering analyze task and, in case of a legitimate crash, the followup progression tasks: * Minimize * Analyze * Impact * Regression * Cleanup cronjob, when updating a bug to inform the user that all above stages were finished This PR adds instrumentation to track the time elapsed between the user upload, and the completion of the above events. * TestcaseUploadMetadata.timestamp was being mutated on the preprocess stage for analyze task. This mutation was removed, so that this entity can be the source of truth for when a testcase was in fact uploaded by the user. * The job name could be retrieved from the JOB_NAME env var within the uworker, however this does not work for the cleanup use case. For this reason, the job name is fetched from datastore instead. * The ```query_testcase_upload_metadata``` method was moved from analyze_task.py to a helpers file, so it could be reused across tasks and on the cleanup cronjob Every task mentioned was executed locally, with a valid uploaded testcase. The codepath for the metric emission was hit and produced the desired output, both for the tasks and the cronjob. Part of #4271

…estcase count (#4494) ### Motivation #4364 implemented a metric to track the percentile distributions of untriaged testcase age. It was overcounting testcases: * Testcases for which a bug was already filed * Testcases for which the crash was unimportant This PR solves this issue, and adds the UNTRIAGED_TESTCASE_COUNT metric, drilled down by job and platform, so we can also know how many testcases are stuck, and not only their age distribution.

…zzer generated test cases (#4481) ### Motivation [#4364](#4364) implemented the tracking for the time it takes clusterfuzz to complete several steps of the manually uploaded testcase lifecycle. As per Chrome's request, the metric will now contain an 'origin' label, which indicates if the testcase was 'manually_uploaded' or generated by a 'fuzzer'. The code was also simplified, by reusing the get_age_in_seconds method from the TestCase entity. Also, it adds the 'stuck_in_triage' boolean field to the testcase entity, to facilitate figuring out what testcases are in a stuck state, so follow up actions can be taken. Part of #4271

…zzer generated test cases (#4481) [#4364](#4364) implemented the tracking for the time it takes clusterfuzz to complete several steps of the manually uploaded testcase lifecycle. As per Chrome's request, the metric will now contain an 'origin' label, which indicates if the testcase was 'manually_uploaded' or generated by a 'fuzzer'. The code was also simplified, by reusing the get_age_in_seconds method from the TestCase entity. Also, it adds the 'stuck_in_triage' boolean field to the testcase entity, to facilitate figuring out what testcases are in a stuck state, so follow up actions can be taken. Part of #4271

…stuck in analyze (#4547) ### Motivation We currently have no way to tell if analyze task was successfully executed. The TESTCASE_UPLOAD_TRIAGE_DURATION metric from #4364 would only track duration for tasks that did finish. An analyze_pending field is added to the Testcase entity in datastore, which is set to False by default, to True for manually uploaded testcases, and to False once analyze task postprocess runs. It also increments the UNTRIAGED_TESTCASE_AGE metric from #4381 with a status label, so we can know at what step the testcase is stuck, thus allowing us to alert if analyze is taking longer to finish than expected. The alert itself could be, for instance, P50 age of untriaged testcase (status=analyze_pending) > 3h. Also, this retroactively addresses comments from #4481: * Fixes docstring for emit_testcase_triage_duration_metric * Removes assertions * Renames TESTCASE_UPLOAD_TRIAGE_DURATION to TESTCASE_TRIAGE_DURATION, since it now accounts for fuzzer generated testcases * Use a boolean "from_fuzzer" field, instead of "origin" string, in TESTCASE_TRIAGE_DURATION

…estcase count (#4494) ### Motivation #4364 implemented a metric to track the percentile distributions of untriaged testcase age. It was overcounting testcases: * Testcases for which a bug was already filed * Testcases for which the crash was unimportant This PR solves this issue, and adds the UNTRIAGED_TESTCASE_COUNT metric, drilled down by job and platform, so we can also know how many testcases are stuck, and not only their age distribution.

…zzer generated test cases (#4481) ### Motivation [#4364](#4364) implemented the tracking for the time it takes clusterfuzz to complete several steps of the manually uploaded testcase lifecycle. As per Chrome's request, the metric will now contain an 'origin' label, which indicates if the testcase was 'manually_uploaded' or generated by a 'fuzzer'. The code was also simplified, by reusing the get_age_in_seconds method from the TestCase entity. Also, it adds the 'stuck_in_triage' boolean field to the testcase entity, to facilitate figuring out what testcases are in a stuck state, so follow up actions can be taken. Part of #4271

…stuck in analyze (#4547) ### Motivation We currently have no way to tell if analyze task was successfully executed. The TESTCASE_UPLOAD_TRIAGE_DURATION metric from #4364 would only track duration for tasks that did finish. An analyze_pending field is added to the Testcase entity in datastore, which is set to False by default, to True for manually uploaded testcases, and to False once analyze task postprocess runs. It also increments the UNTRIAGED_TESTCASE_AGE metric from #4381 with a status label, so we can know at what step the testcase is stuck, thus allowing us to alert if analyze is taking longer to finish than expected. The alert itself could be, for instance, P50 age of untriaged testcase (status=analyze_pending) > 3h. Also, this retroactively addresses comments from #4481: * Fixes docstring for emit_testcase_triage_duration_metric * Removes assertions * Renames TESTCASE_UPLOAD_TRIAGE_DURATION to TESTCASE_TRIAGE_DURATION, since it now accounts for fuzzer generated testcases * Use a boolean "from_fuzzer" field, instead of "origin" string, in TESTCASE_TRIAGE_DURATION

jonathanmetzman reviewed Oct 31, 2024

View reviewed changes

src/clusterfuzz/_internal/metrics/monitoring_metrics.py Outdated Show resolved Hide resolved

vitorguidi requested review from jonathanmetzman, alhijazi and oliverchang November 1, 2024 19:32

vitorguidi changed the title ~~[WIP] Testcase upload metrics for the triage lifecycle~~ Testcase upload metrics for the triage lifecycle Nov 1, 2024

vitorguidi commented Nov 1, 2024

View reviewed changes

src/clusterfuzz/_internal/bot/tasks/commons/testcase_utils.py Outdated Show resolved Hide resolved

jonathanmetzman reviewed Nov 4, 2024

View reviewed changes

vitorguidi changed the title ~~Testcase upload metrics for the triage lifecycle~~ [Monitoring] Testcase upload metrics for the triage lifecycle Nov 8, 2024

vitorguidi added 22 commits November 8, 2024 17:25

Adding testcase upload metric

304b0fe

Fix lint, update metric description

a6ede4b

Adding helper methods to be reused across several tasks

c1c1540

Track time elapsed between testcase upload and analyze task start

9f29762

Renaming metric to TESTCASE_UPLOAD_TRIAGE_DURATION

0a175bc

Emiting TESTCASE_UPLOAD_TRIAGE_DURATION on analyze completion

b3eadc3

Refactoring metric emission to only take testcase id as argument

238d0d0

Adding triage duration to minimize task

10d8c5e

Making testcase metric emission a warning log to avoid noisy error gr…

2462b8f

…oups

Adding regression completed

75f701a

Adding impact completed

b12b1fb

Adding issue updated

bcb2d38

Fix lint

19890b1

Fixing unit tests

c840c82

Stop UploadedTestcaseMetadata.timestamp from being mutated on analyze…

d97bd1c

… task

Using correct JOB_NAME env var

442b372

Using the job type from the Testcase entity in datastore

140f08d

Fix lint

5e2b34c

Fix lint

eb8524e

Move testcase_utils to _internal/common

8f81600

Fix nits

9f2df37

Fix lint

bc9e364

Fix reference to old function name

cdb5960

vitorguidi force-pushed the feat/upload-time branch from d59292c to cdb5960 Compare November 8, 2024 17:29

Fix lint

f81b23d

jonathanmetzman approved these changes Nov 8, 2024

View reviewed changes

alhijazi approved these changes Nov 12, 2024

View reviewed changes

Merge branch 'master' into feat/upload-time

83a5d04

Moving steps to constants

a962985

vitorguidi merged commit 2073870 into master Nov 13, 2024
7 checks passed

vitorguidi deleted the feat/upload-time branch November 13, 2024 15:37

This was referenced Dec 10, 2024

[Monitoring] Extend TESTCASE_UPLOAD_TRIAGE_DURATION to account for fuzzer generated test cases #4481

Merged

[Monitoring] Fix untriaged testcase age oversampling, add untriaged testcase count #4494

Merged

vitorguidi mentioned this pull request Dec 23, 2024

[Monitoring] Enrich UNTRIAGED_TESTCASE_AGE metric to track testcases stuck in analyze #4547

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Monitoring] Testcase upload metrics for the triage lifecycle #4364

[Monitoring] Testcase upload metrics for the triage lifecycle #4364

vitorguidi commented Oct 30, 2024 •

edited

Loading

jonathanmetzman commented Nov 8, 2024

jonathanmetzman left a comment

alhijazi left a comment

vitorguidi commented Nov 13, 2024

[Monitoring] Testcase upload metrics for the triage lifecycle #4364

[Monitoring] Testcase upload metrics for the triage lifecycle #4364

Conversation

vitorguidi commented Oct 30, 2024 • edited Loading

Motivation

Attention points

Testing strategy

jonathanmetzman commented Nov 8, 2024

jonathanmetzman left a comment

Choose a reason for hiding this comment

alhijazi left a comment

Choose a reason for hiding this comment

vitorguidi commented Nov 13, 2024

vitorguidi commented Oct 30, 2024 •

edited

Loading