Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Break down gradle check task to smaller maintainable verification tasks to improve flaky test failures #1975

Closed
dreamer-89 opened this issue Jan 26, 2022 · 21 comments
Assignees
Labels
Build Libraries & Interfaces enhancement Enhancement or improvement to existing feature or request

Comments

@dreamer-89
Copy link
Member

dreamer-89 commented Jan 26, 2022

Today upon pull request, a gradle check verification task is started which runs unit test and integration tests along with other verification tasks. Gradle check run itself takes a lot of time (~30min on c5.18xlarge instances) consistently. This time increases exponentially when smaller machines are used for tests and is the reason for moving away from GHA for checks. This check fails quite often for a variety of reasons. Keeping the existing Gradle check as verification task is not sustainable in and thus needs to be divided into smaller logical task groups.

Specifically below work is needed:

  • Break down gradle check to logical tasks groups. This is needed because of
    • Individual runs takes a lot of time
    • Difficult to debug failures for new developers
    • Not possible to repro on developer machines.
  • Move individual group tasks into Github actions targeting <1hr runtime for each task. There is on-going work in this direction but due to complexity, runtime h/w needs of gradle check run, non-GHA options are considered
  • Evaluate macOS runners which provide better compute (2-core 7GB Ubuntu Vs 3 core 14GB macOS) virtual machines for resource intensive tests.

Exit Criteria:

  1. Final set of jobs should take less time compared to existing gradle check
  2. The final set of jobs should cover same or better test coverage.
  3. It should still be possible to run ./gradlew check with same existing test coverage.

Describe alternatives you've considered
#2496

References:

  1. https://docs.github.com/en/actions/using-github-hosted-runners/about-github-hosted-runners#supported-runners-and-hardware-resources
  2. https://github.com/opensearch-project/opensearch-infra/issues/136
@anasalkouz anasalkouz added enhancement Enhancement or improvement to existing feature or request Build Libraries & Interfaces and removed untriaged labels Feb 8, 2022
@dreamer-89 dreamer-89 changed the title Break down gradle check task to smaller maintainable verification tasks running on GHA Break down gradle check task to smaller maintainable verification tasks May 16, 2022
@dreamer-89
Copy link
Member Author

Taken from #3210

To reduce the flaky tests, we can run the Verification tasks in a parallel manner. The tests can be run on either Jenkins or GHA (only if the time is less here as Github provides a 2 x86_64 CPU cores and a 7 GB RAM instance). The current tasks are

Verification tasks
------------------
branchConsistency - Ensures this branch is internally consistent. For example, that versions constants match released versions.
bwcTest - Runs backwards compatibility tests.
check - Runs all checks.
distroTest.deb - Runs deb tests within vagrant
distroTest.deb-arm64 - Runs deb-arm64 tests within vagrant
distroTest.deb-no-jdk - Runs deb-no-jdk tests within vagrant
distroTest.docker-arm64 - Runs docker-arm64 tests within vagrant
distroTest.linux-archive - Runs linux-archive tests within vagrant
distroTest.linux-archive-arm64 - Runs linux-archive-arm64 tests within vagrant
distroTest.linux-archive-no-jdk - Runs linux-archive-no-jdk tests within vagrant
distroTest.rpm - Runs rpm tests within vagrant
distroTest.rpm-arm64 - Runs rpm-arm64 tests within vagrant
distroTest.rpm-no-jdk - Runs rpm-no-jdk tests within vagrant
distroTest.windows-archive - Runs windows-archive tests within vagrant
distroTest.windows-archive-arm64 - Runs windows-archive-arm64 tests within vagrant
distroTest.windows-archive-no-jdk - Runs windows-archive-no-jdk tests within vagrant
distroUpgradeTest.v2.0.0.deb - Runs v2.0.0.deb tests within vagrant
distroUpgradeTest.v2.0.0.deb-arm64 - Runs v2.0.0.deb-arm64 tests within vagrant
distroUpgradeTest.v2.0.0.rpm - Runs v2.0.0.rpm tests within vagrant
distroUpgradeTest.v2.0.0.rpm-arm64 - Runs v2.0.0.rpm-arm64 tests within vagrant
distroUpgradeTest.v2.1.0.deb - Runs v2.1.0.deb tests within vagrant
distroUpgradeTest.v2.1.0.deb-arm64 - Runs v2.1.0.deb-arm64 tests within vagrant
distroUpgradeTest.v2.1.0.rpm - Runs v2.1.0.rpm tests within vagrant
distroUpgradeTest.v2.1.0.rpm-arm64 - Runs v2.1.0.rpm-arm64 tests within vagrant
integTest - Runs rest tests against an opensearch cluster.
internalClusterTest
jacocoTestCoverageVerification - Verifies code coverage metrics based on specified rules for the test task.
jacocoTestReport - Generates code coverage report for the test task.
javaRestTest - Runs the REST tests against an external cluster
precommit - Runs all non-test checks
run - Runs opensearch in the foreground
spotlessApply - Applies code formatting steps to sourcecode in-place.
spotlessCheck - Checks that sourcecode satisfies formatting steps.
spotlessDiagnose
spotlessJava
spotlessJavaApply
spotlessJavaCheck
spotlessJavaDiagnose
test - Runs the test suite.
testAggregateTestReport - Generates aggregated test report.
yamlRestTest - Runs the REST tests against an external cluster

As a POC when ran BWC test independently on GHA (smaller instance) there was an improvement in the time and a little less of timeout failures.
Sample run with BWC Test: https://github.com/owaiskazi19/OpenSearch/runs/4705950671?check_suite_focus=true
Sample run without BWC Test: https://github.com/owaiskazi19/OpenSearch/runs/4807896738?check_suite_focus=true

Integration tests are currently tightly coupled in the codebase so to separate them out will be a huge work.
Ref: https://rnorth.org/posts/faster-parallel-github-builds/

@dreamer-89
Copy link
Member Author

Just had discussion with @peterzhuamazon @prudhvigodithi @zelinh around feasibility of having existing gradle check on github (for existing developer experience) but running the multiple jobs in different docker container. Each job runs mutually exclusive subset of verification tasks which are run with gradle check today.

Today, gradle check runs a subset of verification tasks; say A, B, C, ...., Z. Proposal is to have a new tasks, say T1,..Tn, where each task run a subset of available verification tasks.

@peterzhuamazon
Copy link
Member

@dreamer-89 Correction:
it does not need to be multiple workflow, but multiple container within each run.

@dreamer-89
Copy link
Member Author

dreamer-89 commented Jun 22, 2022

Triggered a gradle check scan to get insights on existing tests and tasks.

https://scans.gradle.com/s/ietqu6zf4ha3m

Distribution of tasks in graldlew check

Type            Count            Time
All tasks       3710	     4h 10m 17.634s	
Tasks avoided	156 (04.2%)	15.150s	
Tasks executed	2266 (61.1%)	4h 7m 56.431s
Lifecycle	789	1m      3.180s	
No source	340	   35.915s	
Skipped	        159	   26.958s	

@dreamer-89
Copy link
Member Author

Ideally tests related tasks (which brings in flakyness) in gradle check should be splitted based on type of tasks which provisions specific resources such as single,multi-node cluster, container etc so that different tasks have logical resource separation. The other option is to split based on individual projects (qa, plugins, server, modules etc) which makes split intuitive.

@dreamer-89
Copy link
Member Author

dreamer-89 commented Jun 26, 2022

From a quick analysis of existing flaky tests issues, 80%-90% of the tests falls in :server:internalClusterTest bucket and probably due to resource contention. This is identified previously here and here. This means that separating internalClusterTest task should reduce gradle check drastically and internalClusterTest should also fail less often.

Next steps

Will run check & internalClusterTest task separately and compare build failure rate compared to when check runs everything. If resource contention with combined run is actual issue, then we should see lesser number of failures with split.
1 ./gradlew check -x ':sever:internalClusterTest'
2 ./gradlew ':sever:internalClusterTest'

OS: ubuntu
Instance: c5.24xlarge

@dreamer-89
Copy link
Member Author

dreamer-89 commented Jun 26, 2022

Ran internalClusterTest separately 50 times where task failed 23 times out of which ~20 failed alone due to testHighWatermarkNotExceeded. This test is notoriously flaky and muted. This needs more deep dive.

ubuntu@ip-172-31-41-165:~/OpenSearch$ for num in {0..50}; do printf "Iteration $num\n\n\n" >> /tmp/june25/temp_$num.txt; ./gradlew ':server:internalClusterTest' >> /tmp/june25/temp_$num.txt 2>&1; sleep 10     ; done;
...
...

@dreamer-89
Copy link
Member Author

dreamer-89 commented Jun 27, 2022

Ran internalClusterTest task again after muting testHighWatermarkNotExceeded. This time the test failed only thrice.

ubuntu@ip-172-31-41-165:~$ grep "BUILD SUCCESSFUL" /tmp/june26/temp_* | wc -l
48
ubuntu@ip-172-31-41-165:~$ grep "BUILD FAILED" /tmp/june26/temp_* | wc -l
3

@dreamer-89
Copy link
Member Author

Ran complete gradle check but got only 3 failures.

for num in {0..50}; do printf "Iteration $num\n\n\n" >> /tmp/june27/temp_check_$num.txt; ./gradlew check  >> /tmp/june27/temp_check_$num.txt 2>&1; sleep 10     ; done;
...
ubuntu@ip-172-31-41-165:~$ grep "^BUILD SUCCESSFUL" /tmp/june27/temp_check_* | wc -l
48
ubuntu@ip-172-31-41-165:~$ grep "^BUILD FAILED" /tmp/june27/temp_check_* | wc -l
3

@dreamer-89
Copy link
Member Author

dreamer-89 commented Jun 28, 2022

Separating out internalClusterTest from gradle check did not help much. On the contrary, running entire gradle check failed only thrice when run in continuation 50 times. Not sure if lesser failures can attribute to beefier instance type c5.24xlarge used for experimentation.

@dreamer-89
Copy link
Member Author

Running ./gradlew check again on both c5.24xlarge & c5.18xlarge to verify if instance type is linked to flakyness.

@peterzhuamazon
Copy link
Member

Running ./gradlew check again on both c5.24xlarge & c5.18xlarge to verify if instance type is linked to flakyness.

Please let me know if I need to switch from c518xlarge to c524xlarge.

Thanks.

@dreamer-89
Copy link
Member Author

c5.24xlarge

Has only 3 failures out of 50 runs

ubuntu@ip-172-31-41-165:~$ grep "^BUILD FAILED" /tmp/june28/temp_check_* | wc -l
3
ubuntu@ip-172-31-41-165:~$ grep "^BUILD SUCCESSFUL" /tmp/june28/temp_check_* | wc -l
48

c5.18xlarge

Job running for more than a day. Instance/test is having issues which cause frequent test suite timeouts (20 mins) and overall job runs for > 1 hr. ps -ef shows bunch of fixture processes active even after terminating script running gradle check. Rebooting the instance and retrying again.

> Task :qa:full-cluster-restart:v2.0.2#oldClusterTest
org.opensearch.upgrades.FullClusterRestartIT > classMethod FAILED
    java.lang.Exception: Suite timeout exceeded (>= 1200000 msec).
        at __randomizedtesting.SeedInfo.seed([F0D8CAF88E6F7EAD]:0)

@peterzhuamazon
Copy link
Member

c5.24xlarge

Has only 3 failures out of 50 runs

ubuntu@ip-172-31-41-165:~$ grep "^BUILD FAILED" /tmp/june28/temp_check_* | wc -l
3
ubuntu@ip-172-31-41-165:~$ grep "^BUILD SUCCESSFUL" /tmp/june28/temp_check_* | wc -l
48

c5.18xlarge

Job running for more than a day. Instance/test is having issues which cause frequent test suite timeouts (20 mins) and overall job runs for > 1 hr. ps -ef shows bunch of fixture processes active even after terminating script running gradle check. Rebooting the instance and retrying again.

> Task :qa:full-cluster-restart:v2.0.2#oldClusterTest
org.opensearch.upgrades.FullClusterRestartIT > classMethod FAILED
    java.lang.Exception: Suite timeout exceeded (>= 1200000 msec).
        at __randomizedtesting.SeedInfo.seed([F0D8CAF88E6F7EAD]:0)

Seems like I can start the next step by switching to c524xlarge.

@saratvemulapalli
Copy link
Member

Coming from opensearch-project/opensearch-build#1572, @dreamer-89 I see you are assigned to this. Are you planning on to picking this up ?

@dreamer-89
Copy link
Member Author

dreamer-89 commented Jul 12, 2022

opensearch-project/opensearch-build#1572

@saratvemulapalli : Thanks for sharing this. The pre-requisite for splitting gradle check was to have reduce flaky test failures (apologies issue title doesn't mention this). Probably we can track opensearch-project/opensearch-build#1572 on separate issue ?

The next steps on this issue is to perform one more round of experiment to compare flaky test failures in check Vs check-internalClusterTest & internalClusterTest on different h/w capacity machines.

@saratvemulapalli
Copy link
Member

Thanks for more information. @dreamer-89
May be lets organize this (if we haven't already) and link them back to 1 meta issue:

  1. Take care of flaky tests
  2. Proposal to split gradle check in to multiple smaller tasks.
  3. Add support for new tests in CI (starting with Packaging tests)

@minalsha
Copy link
Contributor

@dreamer-89 can you share the meta issue as discussed above for this?

@dreamer-89
Copy link
Member Author

Apologies @minalsha @saratvemulapalli for the delay on this. Created a meta issue for gradle improvements including gradle check split. #4053

@dreamer-89 dreamer-89 changed the title Break down gradle check task to smaller maintainable verification tasks Break down gradle check task to smaller maintainable verification tasks to improve flaky test failures Aug 11, 2022
@minalsha
Copy link
Contributor

@dreamer-89 can we close this in lieu of meta issue #4053 ?

@dreamer-89
Copy link
Member Author

@dreamer-89 can we close this in lieu of meta issue #4053 ?

Thank you @minalsha for the question. Yes, we can close this issue in favour of #4053. Also, this issue tracked the improvements in gradle check failures with different task split and does not focus on splitting check task itself. From the exercises i performed, it is identified that breaking down gradle check into smaller sub tasks does not improve test failures.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Build Libraries & Interfaces enhancement Enhancement or improvement to existing feature or request
Projects
None yet
Development

No branches or pull requests

6 participants