Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KAFKA-18223 Flaky test report script #17938

Merged
merged 7 commits into from
Dec 12, 2024
Merged

Conversation

santhoshct
Copy link
Contributor

@santhoshct santhoshct commented Nov 25, 2024

Summary

This pull request introduces a new script, develocity_reports.py, designed to enhance our detailed reports on flaky tests. It leverages the Develocity API to fetch and analyze test results, focusing on identifying and reporting quarantined tests with high failure rates. The script is intended to help developers/CI quickly identify problematic tests that require attention, thereby improving the overall quality of our codebase.

Changes

  • New Script Addition: Introduced develocity_reports.py to the .github/scripts directory.
  • Functionality:
    • Fetches test results from the Develocity API for a specified project and test type.
    • Analyzes test outcomes to identify flaky and failed tests.
    • Generates reports highlighting high-priority quarantined tests based on failure rates and quarantine duration.
    • Provides detailed timelines and statistics for each test and test case.
  • Logging: Integrated logging to track the script's execution and handle exceptions gracefully.
  • Configuration: Allows configuration of API base URL, authentication token, project name, and thresholds for quarantine and failure rates.
  • Output: Produces a console report summarizing the most problematic tests, including detailed statistics and recent execution timelines.

###Updates

  • Added support for two more reporting types - identify flaky test regressions, clear tests from quarantine.
  • Added support for caching build info in github action cache. This will speed up the report generation without pulling build info of the entire date range everytime. It will only pull the delta date range.

Testing

  • Manual tested. Example output would be like this.
org.apache.kafka.tiered.storage.integration.OffloadAndTxnConsumeFromLeaderTest
==============================================================================
Quarantined for 14 days
Container Failure Rate: 14.51%
Recent Failure Rate: 14.51%

Container Statistics:
  Total Runs: 1013
  Failed: 4
  Flaky: 143
  Passed: 866

Container Recent Executions:
  Date/Time (UTC)      Outcome    Build ID
  ------------------------------------------------
  2024-11-25 02:51  passed     5kqy57pu3uwxs
  2024-11-25 03:18  passed     rgjbtcmlfk7so
  2024-11-25 03:18  passed     hhko6esalsqco
  2024-11-25 03:30  passed     i35nqmpusibpw
  2024-11-25 03:31  flaky      jddx23jdksg5m

Test Cases (Last 7 Days):
  ------------------------------------------------

  → executeTieredStorageTest(String, String)[1]
    Failure Rate: 10.92%
    Runs: 476 | Failed:   0 | Flaky:  52 | Passed: 424

    Recent Executions:
    Date/Time (UTC)      Outcome    Build ID
    --------------------------------------------
    2024-11-25 03:18  passed     hhko6esalsqco
    2024-11-25 03:30  passed     i35nqmpusibpw
    2024-11-25 03:31  flaky      jddx23jdksg5m

  → executeTieredStorageTest(String, String)[2]
    Failure Rate: 5.25%
    Runs: 476 | Failed:   0 | Flaky:  25 | Passed: 451

    Recent Executions:
    Date/Time (UTC)      Outcome    Build ID
    --------------------------------------------
    2024-11-25 03:18  passed     hhko6esalsqco
    2024-11-25 03:30  passed     i35nqmpusibpw
    2024-11-25 03:31  passed     jddx23jdksg5m
  • Testing for the new report types.
Summary for PR:
==============

1. Flaky Test Regressions
-------------------------
No flaky test regressions found.

2. Cleared Tests (Ready for Unquarantine)
----------------------------------------
Several tests show consistent passing behavior:
- org.apache.kafka.clients.producer.KafkaProducerTest (99.02% success, 410 runs)
- kafka.network.DynamicConnectionQuotaTest (98.58% success, 422 runs)
- kafka.api.SslConsumerTest (98.82% success, 422 runs)
- kafka.api.SaslSslConsumerTest (99.05% success, 422 runs)
- org.apache.kafka.connect.integration.OffsetsApiIntegrationTest (84.60% success, 422 runs)

3. Quarantined Tests Analysis
----------------------------
Test: org.apache.kafka.tiered.storage.integration.OffloadAndTxnConsumeFromLeaderTest

Key Metrics:
- Quarantined for: 7 days
- Overall Failure Rate: 14.21%
- Total Runs: 366 (Failed: 4, Flaky: 48, Passed: 314)

Test Cases Analysis:
1. executeTieredStorageTest[1]:
   - Failure Rate: 11.20%
   - Distribution: 366 runs (2 Failed, 39 Flaky, 325 Passed)

2. executeTieredStorageTest[2]: 
   - Failure Rate: 6.28%
   - Distribution: 366 runs (0 Failed, 23 Flaky, 343 Passed)

Detailed logs and complete test history are available in the attached report file.

Test Analysis Report (2024-12-03 08:23:37 UTC).txt

Updated report with Test Report summary section:

Test Analysis Report (2024-12-11 11:47:27 UTC)
====================================================================================================

Summary of Most Problematic Tests
==================================================

org.apache.kafka.clients.consumer.internals.ConsumerHeartbeatRequestManagerTest
  → testUnsupportedVersion()                                     100.00%

org.apache.kafka.connect.integration.OffsetsApiIntegrationTest
  → testGetSinkConnectorOffsets()                                50.00%
  → testAlterSinkConnectorOffsetsDifferentKafkaClusterTargeted() 9.93%
  → testGetSinkConnectorOffsetsDifferentKafkaClusterTargeted()   6.14%
  → testResetSinkConnectorOffsets()                              5.23%
  → testResetSinkConnectorOffsetsOverriddenConsumerGroupId()     5.05%
  → testAlterSinkConnectorOffsetsOverriddenConsumerGroupId()     0.72%

org.apache.kafka.tiered.storage.integration.TransactionsWithTieredStoreTest
  → testReadCommittedConsumerShouldNotSeeUndecidedData(String, String)[2] 29.96%
  → testBumpTransactionalEpochWithTV2Enabled(String, String, boolean)[1] 22.43%
  → testBumpTransactionalEpochWithTV2Enabled(String, String, boolean)[2] 20.40%
  → testBumpTransactionalEpochWithTV2Disabled(String, String, boolean)[1] 17.69%

kafka.api.TransactionsTest
  → testBumpTransactionalEpochWithTV2Enabled(String, String, boolean)[1] 22.43%
  → testBumpTransactionalEpochWithTV2Enabled(String, String, boolean)[2] 20.04%
  → testBumpTransactionalEpochWithTV2Disabled(String, String, boolean)[1] 9.39%

kafka.api.PlaintextConsumerTest
  → testCloseLeavesGroupOnInterrupt(String, String)[2]           22.38%
  → testCoordinatorFailover(String, String)[2]                   2.17%
  → testCloseLeavesGroupOnInterrupt(String, String)[1]           1.62%
  → testCoordinatorFailover(String, String)[1]                   0.90%

kafka.coordinator.group.CoordinatorPartitionWriterTest
  → testDeleteRecordsResponseContainsError()                     14.29%
  → testDeleteRecordsSuccess()                                   14.29%

==================================================

Detailed Test Reports
====================================================================================================

Flaky Test Regressions
--------------------------------------------------
No flaky test regressions found.

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

@github-actions github-actions bot added the build Gradle build or GitHub Actions label Nov 25, 2024
…ded support for reporting types flaky test regression and clear tests from quarantine
Copy link
Member

@mumrah mumrah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@santhoshct thanks for working on this! This is an excellent start 👍

The report is very detailed, which is great, but can we also include a summary at the top? For example, in the report I just ran it would be great to see something like this for the worst flaky tests:

org.apache.kafka.message.checker.MetadataSchemaCheckerToolTest
  → testVerifyEvolutionGit()  83.33%

org.apache.kafka.tiered.storage.integration.TransactionsWithTieredStoreTest
  → testReadCommittedConsumerShouldNotSeeUndecidedData(String, String)[2] 46.17%
  → testBumpTransactionalEpochWithTV2Enabled(String, String, boolean)[1]  ...
  → testBumpTransactionalEpochWithTV2Enabled(String, String, boolean)[2]  ...
  → testBumpTransactionalEpochWithTV2Disabled(String, String, boolean)[1] ...
  → testBumpTransactionalEpochWithTV2Disabled(String, String, boolean)[2] ...

kafka.api.TransactionsTest
  → testBumpTransactionalEpochWithTV2Enabled(String, String, boolean)[1]  26.09%
  → testBumpTransactionalEpochWithTV2Enabled(String, String, boolean)[2]  ... 
  → testBumpTransactionalEpochWithTV2Disabled(String, String, boolean)[1] ...

@@ -0,0 +1,863 @@
import os
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need a license here. See other scripts for example

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the license

@@ -1,3 +1,4 @@
<<<<<<< HEAD
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like leftovers from a merge conflict

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed.

"""
return f'project:{project} buildStartTime:[{chunk_start.isoformat()} TO {chunk_end.isoformat()}] gradle.requestedTasks:{test_type}'

def process_chunk(self, chunk_start: datetime, chunk_end: datetime, project: str,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this and other methods with many arguments, use the following PEP-8 style:

    def process_chunk(
            self,
            chunk_start: datetime,
            chunk_end: datetime,
            project: str,
            test_type: str,
            remaining_build_ids: set,
            max_builds_per_request: int) -> Dict[str, BuildInfo]:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected this.

reverse=True
)

print(f"\nFound {len(sorted_tests)} high-priority quarantined test containers:")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The count here should be the number of flaky test cases rather than flaky test classes (containers).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, we should not use the term "container" in the report since it's kind of confusion. As far as I know, for our purposes a container is always a test class.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected this to test class.


# Show test case timeline
if test_case.timeline:
print("\n Recent Executions:")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For "Recent" things, let's indicate how far back we're showing in the output.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added info about the runs.

2. Corrected the method signature to pep 8 style.
3. Added license file to the script
4. Corrected the develocity specific term "container" to more generic test classes.
5. Added more info to the recent executions to make it more descriptive.
@mumrah mumrah changed the title KIP 1090 - Reporting integration with Develocity API KAFKA-18223 Flaky test report script Dec 12, 2024
Copy link
Member

@mumrah mumrah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@santhoshct I've run it locally and the output looks great! I'm going to go ahead and merge this so we can let people start trying it out.

@mumrah mumrah merged commit 5bb1ea4 into apache:trunk Dec 12, 2024
15 checks passed
tedyu pushed a commit to tedyu/kafka that referenced this pull request Jan 6, 2025
Adds a python script to generate a detailed flaky test report using the Develocity API

Reviewers: David Arthur <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build Gradle build or GitHub Actions ci-approved
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants