Skip to content
This repository has been archived by the owner on Aug 2, 2022. It is now read-only.

Fix db_modes_test #7728

Merged
merged 4 commits into from
Aug 5, 2019
Merged

Fix db_modes_test #7728

merged 4 commits into from
Aug 5, 2019

Conversation

kj4ezj
Copy link
Contributor

@kj4ezj kj4ezj commented Aug 5, 2019

Change Description

The db_modes_test has been our most flaky test, with a 4.4% failure rate on the original BASH variant of this test, and a 2.1% failure rate on the Python rewrite.

Working on a branch off release/1.8.x that still had the BASH variant of this test, I fixed the instability by doubling the duration of all sleep statements. I experienced zero failures of db_modes_test in 1,000 runs on the zach-1.8-db-modes-test-timeout branch, giving me extremely high confidence (7 nines) that the instability is fixed.

Next Steps

This is a poor solution because the test still relies on timing alone. Slow enough hardware will cause false failures as nodeos is not done initializing before the sleep statements expire. A better solution would be to rewrite the test correctly so that it knows when nodeos is done initializing, saving time on faster hardware and preventing test instability on slower hardware. We decided it is not worth investing time into rewriting db_modes_test the "correct" way at this time because Hong Kong is already writing a new testing framework which will provide these features. When that framework is available, we should reimplement db_modes_test the correct way on that framework.

My changes here proved reliable even on the slow dual-core, 4 GB of RAM Travis CI macOS agents, so I believe we will not encounter test instability on any commonly-used developer hardware.

See Also

Pull request 7729 against eos:develop

Old Metrics

Authenticate with AWS as shown here, sync metrics, then aggregate them:

aws s3 sync s3://auto-eos-test-metrics ~/Work/test-metrics --exclude '*' --include '*.json'
for MM in $(seq -f '%02g' 1 8); do
    echo "Processing 2019-$MM..."
    for DD in $(seq -f '%02g' 1 31); do
        echo "Processing 2019-$MM-$DD..."
        cat 2019-$MM-$DD*.json | jq -c > 2019-$MM-$DD.day.json
    done
done
cat 2019-0[1234]*.day.json 2019-05-1*.day.json 2019-05-2[0123].day.json | jq '.[] | select(.testName == "db_modes_test")' | jq -s > ~/Work/db-modes-test/bash-metrics.json
cat 2019-05-2[56789].day.json 2019-05-3*.day.json 2019-0[678]*.day.json | jq '.[] | select(.testName == "db_modes_test")' | jq -s > ~/Work/db-modes-test/python-metrics.json

Now query the metrics for the BASH variant...

$ cat bash-metrics.json | jq -c '.[] | select(.branch == "develop" or .branch == "release/1.8.x")' | jq -s '. | length' # count number of runs
1496
$ cat bash-metrics.json | jq -c '.[] | select(.branch == "develop" or .branch == "release/1.8.x") | select(.testResult == "Failed")' | jq -s '. | length' # count number of failures
66
$ cat bash-metrics.json | jq -c '.[] | select(.branch == "develop" or .branch == "release/1.8.x") | select(.testResult == "Exception")' | jq -s '. | length' # count number of exceptions
0
$ cat bash-metrics.json | jq -c '.[] | select(.branch == "develop" or .branch == "release/1.8.x") | select(.testResult == "Failed") | .errorMsg' | sort | uniq -c
      3 "test diangostics are not enabled for this pipeline"
     63 "uncategorized"
$ cat bash-metrics.json | jq -r '.[] | select(.branch == "develop" or .branch == "release/1.8.x") | select(.testResult == "Failed") | select(.errorMsg == "test diangostics are not enabled for this pipeline") | .url'
https://buildkite.com/EOSIO/eosio-debug/builds/95#2b62dc5c-43b9-4fcf-93e6-335a94f84bf0
https://buildkite.com/EOSIO/eosio-debug/builds/95#bde90b6d-a32b-4d61-a000-a8473c449eb4
https://buildkite.com/EOSIO/eosio-debug/builds/95#ff749c14-8673-4ed0-92a8-ed26ab7f0721
$ cat bash-metrics.json | jq -r '.[] | select(.branch == "develop" or .branch == "release/1.8.x") | select(.testResult == "Failed") | .os' | sort | uniq -c
     16 CentOS 7
      2 Fedora 27
     21 Mojave
     16 Ubuntu 16.04
     11 Ubuntu 18.04

...and the Python variant:

$ cat python-metrics.json | jq -c '.[] | select(.branch == "develop" or .branch == "release/1.8.x")' | jq -s '. | length' # count number of runs
6974
$ cat python-metrics.json | jq -c '.[] | select(.branch == "develop" or .branch == "release/1.8.x") | select(.testResult == "Failed")' | jq -s '. | length' # count number of failures
145
$ cat python-metrics.json | jq -c '.[] | select(.branch == "develop" or .branch == "release/1.8.x") | select(.testResult == "Exception")' | jq -s '. | length' # count number of exceptions
0
$ cat python-metrics.json | jq -c '.[] | select(.branch == "develop" or .branch == "release/1.8.x") | select(.testResult == "Failed") | .errorMsg' | sort | uniq -c
     60 "ctest: 8"
      3 "test diangostics are not enabled for this pipeline"
     82 "uncategorized"
$ cat python-metrics.json | jq -r '.[] | select(.branch == "develop" or .branch == "release/1.8.x") | select(.testResult == "Failed") | select(.errorMsg == "test diangostics are not enabled for this pipeline") | .url'
https://buildkite.com/EOSIO/eosio-debug/builds/95#2b62dc5c-43b9-4fcf-93e6-335a94f84bf0
https://buildkite.com/EOSIO/eosio-debug/builds/95#bde90b6d-a32b-4d61-a000-a8473c449eb4
https://buildkite.com/EOSIO/eosio-debug/builds/95#ff749c14-8673-4ed0-92a8-ed26ab7f0721
$ cat python-metrics.json | jq -r '.[] | select(.branch == "develop" or .branch == "release/1.8.x") | select(.testResult == "Failed") | .os' | sort | uniq -c
     24 Amazon Linux 2
     24 CentOS 7
      2 Fedora 27
     46 Mojave
     30 Ubuntu 16.04
     19 Ubuntu 18.04

New Metrics

I tested these changes on both Buildkite and Travis CI.

Buildkite:

$ curl -s "https://api.buildkite.com/v2/organizations/EOSIO/pipelines/eosio-beta/builds?access_token=$BUILDKITE_API_KEY&per_page=100&branch=zach-1.8-db-modes-test-timeout" | jq '. | length'
100
$ curl -s "https://api.buildkite.com/v2/organizations/EOSIO/pipelines/eosio-beta/builds?access_token=$BUILDKITE_API_KEY&per_page=100&page=2&branch=zach-1.8-db-modes-test-timeout" | jq '. | length'
53
$ curl -s "https://api.buildkite.com/v2/organizations/EOSIO/pipelines/eosio-beta/builds?access_token=$BUILDKITE_API_KEY&state=failed&per_page=100&branch=zach-1.8-db-modes-test-timeout" | jq '. | length'
1
$ curl -s "https://api.buildkite.com/v2/organizations/EOSIO/pipelines/eosio-beta/builds?access_token=$BUILDKITE_API_KEY&state=failed&per_page=100&branch=zach-1.8-db-modes-test-timeout" | jq -r '.[0].jobs | .[] | select(.state == "failed") | .web_url'
https://buildkite.com/EOSIO/eosio-beta/builds/600#7006130b-da3b-4b6f-924b-63e5fd6ad23a

The one error ended up being unrelated to db_modes_test.

Travis CI:

$ cd ~/Work/eos
$ git checkout zach-1.8-db-modes-test-timeout
$ travis login
$ travis history --all --date -b zach-1.8-db-modes-test-timeout | tee travis-results.log
$ cat travis-results.log | grep -c '' # count all builds
51
$ cat travis-results.log | grep -c 'passed' # count passing builds
47

The four issues we are seeing here are two cancelled builds, one errored build (10 minutes with no log output), and one failure from a different test.

Count all builds and multiply by 5 because each build tests on five operating systems:

$ echo $(( ( 152 + 51 - 3 ) * 5 )) # count total number of runs
1000

Math

Confidence after 1,000 runs that this solution is better than the BASH variant:

(1-(1- (66 / 1496))^1000)*100 = 99.999999999999999997462

Confidence after 1,000 runs that this solution is better than the Python variant:

(1-(1-(145 / 6974))^1000)*100 = 99.999999924981

Consensus Changes

  • Consensus Changes
    None.

API Changes

  • API Changes
    None.

Documentation Additions

  • Documentation Additions
    None.

@kj4ezj kj4ezj changed the base branch from master to release/1.8.x August 5, 2019 16:45
@kj4ezj
Copy link
Contributor Author

kj4ezj commented Aug 5, 2019

During demos on Friday, @arhag expressed concerns about including db_modes_test in the parallel test group. After demos, we met and peer-reviewed the code together. We both believe that it is written such that it can be parallelized, but he asked me to perform a test to be sure. I made a test schedule which runs this test twelve times simultaneously in the parallel test group:

@kj4ezj kj4ezj requested review from arhag, spoonincode and heifner August 5, 2019 17:23
@kj4ezj kj4ezj marked this pull request as ready for review August 5, 2019 17:23
@kj4ezj kj4ezj mentioned this pull request Aug 5, 2019
3 tasks
@kj4ezj kj4ezj removed the request for review from heifner August 5, 2019 17:28
@kj4ezj kj4ezj mentioned this pull request Aug 5, 2019
3 tasks
@kj4ezj kj4ezj merged commit 163fa86 into release/1.8.x Aug 5, 2019
@kj4ezj kj4ezj deleted the zach-1.8-fix-db-modes-test branch August 5, 2019 20:15
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants