Fix db_modes_test #7728

kj4ezj · 2019-08-05T16:38:03Z

Change Description

The db_modes_test has been our most flaky test, with a 4.4% failure rate on the original BASH variant of this test, and a 2.1% failure rate on the Python rewrite.

Working on a branch off release/1.8.x that still had the BASH variant of this test, I fixed the instability by doubling the duration of all sleep statements. I experienced zero failures of db_modes_test in 1,000 runs on the zach-1.8-db-modes-test-timeout branch, giving me extremely high confidence (7 nines) that the instability is fixed.

Next Steps

This is a poor solution because the test still relies on timing alone. Slow enough hardware will cause false failures as nodeos is not done initializing before the sleep statements expire. A better solution would be to rewrite the test correctly so that it knows when nodeos is done initializing, saving time on faster hardware and preventing test instability on slower hardware. We decided it is not worth investing time into rewriting db_modes_test the "correct" way at this time because Hong Kong is already writing a new testing framework which will provide these features. When that framework is available, we should reimplement db_modes_test the correct way on that framework.

My changes here proved reliable even on the slow dual-core, 4 GB of RAM Travis CI macOS agents, so I believe we will not encounter test instability on any commonly-used developer hardware.

Old Metrics

Authenticate with AWS as shown here, sync metrics, then aggregate them:

aws s3 sync s3://auto-eos-test-metrics ~/Work/test-metrics --exclude '*' --include '*.json'
for MM in $(seq -f '%02g' 1 8); do
    echo "Processing 2019-$MM..."
    for DD in $(seq -f '%02g' 1 31); do
        echo "Processing 2019-$MM-$DD..."
        cat 2019-$MM-$DD*.json | jq -c > 2019-$MM-$DD.day.json
    done
done
cat 2019-0[1234]*.day.json 2019-05-1*.day.json 2019-05-2[0123].day.json | jq '.[] | select(.testName == "db_modes_test")' | jq -s > ~/Work/db-modes-test/bash-metrics.json
cat 2019-05-2[56789].day.json 2019-05-3*.day.json 2019-0[678]*.day.json | jq '.[] | select(.testName == "db_modes_test")' | jq -s > ~/Work/db-modes-test/python-metrics.json

Now query the metrics for the BASH variant...

$ cat bash-metrics.json | jq -c '.[] | select(.branch == "develop" or .branch == "release/1.8.x")' | jq -s '. | length' # count number of runs
1496
$ cat bash-metrics.json | jq -c '.[] | select(.branch == "develop" or .branch == "release/1.8.x") | select(.testResult == "Failed")' | jq -s '. | length' # count number of failures
66
$ cat bash-metrics.json | jq -c '.[] | select(.branch == "develop" or .branch == "release/1.8.x") | select(.testResult == "Exception")' | jq -s '. | length' # count number of exceptions
0
$ cat bash-metrics.json | jq -c '.[] | select(.branch == "develop" or .branch == "release/1.8.x") | select(.testResult == "Failed") | .errorMsg' | sort | uniq -c
      3 "test diangostics are not enabled for this pipeline"
     63 "uncategorized"
$ cat bash-metrics.json | jq -r '.[] | select(.branch == "develop" or .branch == "release/1.8.x") | select(.testResult == "Failed") | select(.errorMsg == "test diangostics are not enabled for this pipeline") | .url'
https://buildkite.com/EOSIO/eosio-debug/builds/95#2b62dc5c-43b9-4fcf-93e6-335a94f84bf0
https://buildkite.com/EOSIO/eosio-debug/builds/95#bde90b6d-a32b-4d61-a000-a8473c449eb4
https://buildkite.com/EOSIO/eosio-debug/builds/95#ff749c14-8673-4ed0-92a8-ed26ab7f0721
$ cat bash-metrics.json | jq -r '.[] | select(.branch == "develop" or .branch == "release/1.8.x") | select(.testResult == "Failed") | .os' | sort | uniq -c
     16 CentOS 7
      2 Fedora 27
     21 Mojave
     16 Ubuntu 16.04
     11 Ubuntu 18.04

...and the Python variant:

$ cat python-metrics.json | jq -c '.[] | select(.branch == "develop" or .branch == "release/1.8.x")' | jq -s '. | length' # count number of runs
6974
$ cat python-metrics.json | jq -c '.[] | select(.branch == "develop" or .branch == "release/1.8.x") | select(.testResult == "Failed")' | jq -s '. | length' # count number of failures
145
$ cat python-metrics.json | jq -c '.[] | select(.branch == "develop" or .branch == "release/1.8.x") | select(.testResult == "Exception")' | jq -s '. | length' # count number of exceptions
0
$ cat python-metrics.json | jq -c '.[] | select(.branch == "develop" or .branch == "release/1.8.x") | select(.testResult == "Failed") | .errorMsg' | sort | uniq -c
     60 "ctest: 8"
      3 "test diangostics are not enabled for this pipeline"
     82 "uncategorized"
$ cat python-metrics.json | jq -r '.[] | select(.branch == "develop" or .branch == "release/1.8.x") | select(.testResult == "Failed") | select(.errorMsg == "test diangostics are not enabled for this pipeline") | .url'
https://buildkite.com/EOSIO/eosio-debug/builds/95#2b62dc5c-43b9-4fcf-93e6-335a94f84bf0
https://buildkite.com/EOSIO/eosio-debug/builds/95#bde90b6d-a32b-4d61-a000-a8473c449eb4
https://buildkite.com/EOSIO/eosio-debug/builds/95#ff749c14-8673-4ed0-92a8-ed26ab7f0721
$ cat python-metrics.json | jq -r '.[] | select(.branch == "develop" or .branch == "release/1.8.x") | select(.testResult == "Failed") | .os' | sort | uniq -c
     24 Amazon Linux 2
     24 CentOS 7
      2 Fedora 27
     46 Mojave
     30 Ubuntu 16.04
     19 Ubuntu 18.04

New Metrics

I tested these changes on both Buildkite and Travis CI.

Buildkite:

$ curl -s "https://api.buildkite.com/v2/organizations/EOSIO/pipelines/eosio-beta/builds?access_token=$BUILDKITE_API_KEY&per_page=100&branch=zach-1.8-db-modes-test-timeout" | jq '. | length'
100
$ curl -s "https://api.buildkite.com/v2/organizations/EOSIO/pipelines/eosio-beta/builds?access_token=$BUILDKITE_API_KEY&per_page=100&page=2&branch=zach-1.8-db-modes-test-timeout" | jq '. | length'
53
$ curl -s "https://api.buildkite.com/v2/organizations/EOSIO/pipelines/eosio-beta/builds?access_token=$BUILDKITE_API_KEY&state=failed&per_page=100&branch=zach-1.8-db-modes-test-timeout" | jq '. | length'
1
$ curl -s "https://api.buildkite.com/v2/organizations/EOSIO/pipelines/eosio-beta/builds?access_token=$BUILDKITE_API_KEY&state=failed&per_page=100&branch=zach-1.8-db-modes-test-timeout" | jq -r '.[0].jobs | .[] | select(.state == "failed") | .web_url'
https://buildkite.com/EOSIO/eosio-beta/builds/600#7006130b-da3b-4b6f-924b-63e5fd6ad23a

The one error ended up being unrelated to db_modes_test.

Travis CI:

$ cd ~/Work/eos
$ git checkout zach-1.8-db-modes-test-timeout
$ travis login
$ travis history --all --date -b zach-1.8-db-modes-test-timeout | tee travis-results.log
$ cat travis-results.log | grep -c '' # count all builds
51
$ cat travis-results.log | grep -c 'passed' # count passing builds
47

The four issues we are seeing here are two cancelled builds, one errored build (10 minutes with no log output), and one failure from a different test.

Count all builds and multiply by 5 because each build tests on five operating systems:

$ echo $(( ( 152 + 51 - 3 ) * 5 )) # count total number of runs
1000

Math

Confidence after 1,000 runs that this solution is better than the BASH variant:

(1-(1- (66 / 1496))^1000)*100 = 99.999999999999999997462

Confidence after 1,000 runs that this solution is better than the Python variant:

(1-(1-(145 / 6974))^1000)*100 = 99.999999924981

Consensus Changes

Consensus Changes
None.

API Changes

API Changes
None.

Documentation Additions

Documentation Additions
None.

…test.py

…le test group

kj4ezj · 2019-08-05T17:21:25Z

During demos on Friday, @arhag expressed concerns about including db_modes_test in the parallel test group. After demos, we met and peer-reviewed the code together. We both believe that it is written such that it can be parallelized, but he asked me to perform a test to be sure. I made a test schedule which runs this test twelve times simultaneously in the parallel test group:

zach-1.8-db-modes-test-parallelism
- Code
- Testing
zach-db-modes-test-parallelism
- Code
- Testing

…allel test group

kj4ezj added 3 commits August 2, 2019 13:55

Restore spoonincode's original db_modes_test.sh in favor of db_modes_…

ef25995

…test.py

Fix db_modes_test by doubling the timeout

9cb8f2d

Move db_modes_test from non-parallelizable test group to parallelizab…

be6aa85

…le test group

kj4ezj changed the base branch from master to release/1.8.x August 5, 2019 16:45

kj4ezj requested review from arhag, spoonincode and heifner August 5, 2019 17:23

kj4ezj marked this pull request as ready for review August 5, 2019 17:23

kj4ezj mentioned this pull request Aug 5, 2019

Fix db_modes_test #7729

Merged

3 tasks

kj4ezj removed the request for review from heifner August 5, 2019 17:28

kj4ezj mentioned this pull request Aug 5, 2019

Increase db_modes_test Timeout #7697

Closed

3 tasks

Give db_modes_test the highest cost so ctest runs it first in the par…

28c5a26

…allel test group

spoonincode approved these changes Aug 5, 2019

View reviewed changes

kj4ezj merged commit 163fa86 into release/1.8.x Aug 5, 2019

kj4ezj deleted the zach-1.8-fix-db-modes-test branch August 5, 2019 20:15

allenhan2 mentioned this pull request Jun 3, 2020

[CI/CD Failure Triage: dbModesTest] Unit Tests : db_modes_test : Terminated #8732

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix db_modes_test #7728

Fix db_modes_test #7728

kj4ezj commented Aug 5, 2019 •

edited

Loading

kj4ezj commented Aug 5, 2019 •

edited

Loading

Fix db_modes_test #7728

Fix db_modes_test #7728

Conversation

kj4ezj commented Aug 5, 2019 • edited Loading

Change Description

Next Steps

See Also

Old Metrics

New Metrics

Math

Consensus Changes

API Changes

Documentation Additions

kj4ezj commented Aug 5, 2019 • edited Loading

kj4ezj commented Aug 5, 2019 •

edited

Loading

kj4ezj commented Aug 5, 2019 •

edited

Loading