Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BFS7.7.1 Failing Jenkins Job Issue #57

Closed
2samueld opened this issue Nov 17, 2021 · 10 comments
Closed

BFS7.7.1 Failing Jenkins Job Issue #57

2samueld opened this issue Nov 17, 2021 · 10 comments
Labels
bfs7.7.1 bug Something isn't working

Comments

@2samueld
Copy link

Description:

Currently Jenkins jobs for bfs7.71 randomly fails. Below is identified issue:

  1. issue with optimizing bundles.
  2. Jenkins switch node version from 10.19.0 to 8.14.0
    • Kibana does not support the current Node.js version v8.14.0. Please use Node.js v10.19.0.
  3. Functional tests fail due to element not located by webdriver ( jenkins Job 16)
    • Chrome UI Functional Tests.test/functional/apps/console/_console·ts.console app console app "before all" hook for "should show the default request"
    • Chrome UI Functional Tests.test/functional/apps/context/index·js.context app "before all" hook in "context app"
    • Chrome UI Functional Tests.test/functional/apps/status_page/index·js.status page "before each" hook for "should show the kibana plugin as ready"
    • Chrome UI Functional Tests.test/functional/apps/timelion/_expression_typeahead·js.timelion app expression typeahead "before all" hook for "should display function suggestions filtered by function name"
@2samueld 2samueld added the bug Something isn't working label Nov 17, 2021
@Tengda-He
Copy link
Owner

Lets focus on addressing these problems, this is top priority, thanks

@2samueld
Copy link
Author

Update on Above Issue 2 & 3:

  1. Node environment has been fixed with PR Change docker image naming convention for BFS7.7.1 #59
  2. Element not located by webdriver never appeared again. Multiple jenkins job has been run since then and the functional tests always passes

@2samueld
Copy link
Author

2samueld commented Nov 23, 2021

Analysis on Issue 1:

Problem:
Optimization fails due to modules not being found. These modules (such as timelion and statusPage) are located in the kibana/optimize/bundles folder. When the script node scripts/kibana is ran in CiGroups, the following process happens:

  1. Loop through every compilation module to start building
  2. Compare the new build entry path with the previous entry path
  3. Run dll compilation (optimization) in case the previous build doesn't have an entry path or if the new build entry path differs from the previous entry path

Diagnosis:
Logs were added in three functions registerBeforeCompileHook, registerCompilationHook, and registerDoneHook found in kibana/src/optimize/dynamic_dll_plugin/dynamic_dll_plugin.js file.

  • registerBeforeCompileHook: this function is called when dll plugin tasks starts. It ensures modules are registered before compilation. A log was added to track whether all the modules were registered.
  • registerCompilationHook: this is the function that is called when node scripts/kibana is ran as described above. Logs were added to check if the previous and new build entry paths are the same, if compilation is required, and if dll bundles exists.
  • registerDoneHook: this function ensures compilation checks are finished. Logs were added to check whether if there were any modules that needed compilation but were omitted.

Update:

  • stats.compilation.needsDLLCompilation variable returns undefined in registerDoneHook. When optimization failure issue occurs, this variable is undefined.

@seraphjiang
Copy link
Collaborator

what does random means?

beside ciGroup5, did we see any other ciGroup random failure?
what's different between failed ciGroup5 and other passed ciGroup?

did we even pass the ciGroup5? is it passed local run?

@2samueld
Copy link
Author

@seraphjiang in the below job, CiGroup 9 fails because of optimization failure. The difference between the failed CiGroup and passed CiGroup is an optimization issue, where some modules in optimize/bundles directory are not found.

https://jenkins.bfs.sichend.people.aws.dev/blue/organizations/jenkins/Kibana/detail/bfs7.7.1/20/pipeline/86

@seraphjiang
Copy link
Collaborator

how about run single ciGroup separately, could we always pass the tests?
if so, could we try run ciGroup in sequentially instead of in parallel?

@2samueld
Copy link
Author

2samueld commented Nov 23, 2021

Running tests separately/sequentially takes a couple of hours to finish functional tests compared to running it parallel, which takes about half an hour to 45 mins.

Coming to the question of " will running CiGroups separately always pass the tests?". There isn't enough data to show that it'll pass. A couple of Jenkins job that runs functional tests sequentially will be started this evening. That can give us a picture of the test results.

Update (on running sequential functional tests):

5 Jenkins jobs were ran and they all failed due to two functional test case failure in kibana/test/functional/apps/management/_index_pattern_filter.js. A PR will be created to fix this issue. Once PR merged, a jenkins job will be started to confirm all functional tests pass in BFS7.7.1

Jenkin jobs log:

  1. https://jenkins.bfs.sichend.people.aws.dev/blue/rest/organizations/jenkins/pipelines/Kibana/branches/bfs7.7.1/runs/100/nodes/39/steps/41/log/?start=0
  2. https://jenkins.bfs.sichend.people.aws.dev/blue/rest/organizations/jenkins/pipelines/Kibana/branches/bfs7.7.1/runs/99/nodes/39/steps/41/log/?start=0
  3. https://jenkins.bfs.sichend.people.aws.dev/blue/rest/organizations/jenkins/pipelines/Kibana/branches/bfs7.7.1/runs/98/nodes/39/steps/41/log/?start=0
  4. https://jenkins.bfs.sichend.people.aws.dev/blue/rest/organizations/jenkins/pipelines/Kibana/branches/bfs7.7.1/runs/93/nodes/39/steps/41/log/?start=0
  5. https://jenkins.bfs.sichend.people.aws.dev/blue/rest/organizations/jenkins/pipelines/Kibana/branches/bfs7.7.1/runs/95/nodes/41/steps/43/log/?start=0

Error log:

1) management
       
         index pattern filter
           should filter indexed fields:
     TimeoutError: Waiting for element to be located By(xpath, //a[descendant::*[text()='logstash-*']])
Wait timed out after 10002ms
      at /var/lib/jenkins/workspace/Kibana_bfs7.7.1/node_modules/selenium-webdriver/lib/webdriver.js:842:17
      at process._tickCallback (internal/process/next_tick.js:68:7)

               └- ✖ fail: "management  index pattern filter should filter indexed fields"
               │
             └-> "after each" hook
               │ info Taking screenshot "/var/lib/jenkins/workspace/Kibana_bfs7.7.1/test/functional/screenshots/failure/management  index pattern filter _after each_ hook.png"
               │ info Current URL is: http://localhost:5620/app/kibana#/management/kibana/index_patterns?_g=()
               │ info Saving page source to: /var/lib/jenkins/workspace/Kibana_bfs7.7.1/test/functional/failure_debug/html/management  index pattern filter _after each_ hook.html

  1) management
       
         index pattern filter
           "after each" hook for "should filter indexed fields":
     retry.try timeout: Error: retry.try timeout: TimeoutError: Waiting for element to be located By(css selector, [data-test-subj="deleteIndexPatternButton"])
Wait timed out after 10047ms
    at /var/lib/jenkins/workspace/Kibana_bfs7.7.1/node_modules/selenium-webdriver/lib/webdriver.js:842:17
    at process._tickCallback (internal/process/next_tick.js:68:7)
    at onFailure (/var/lib/jenkins/workspace/Kibana_bfs7.7.1/test/common/services/retry/retry_for_success.ts:28:9)
    at retryForSuccess (/var/lib/jenkins/workspace/Kibana_bfs7.7.1/test/common/services/retry/retry_for_success.ts:68:13)
  Error: retry.try timeout: Error: retry.try timeout: TimeoutError: Waiting for element to be located By(css selector, [data-test-subj="deleteIndexPatternButton"])
  Wait timed out after 10047ms
      at /var/lib/jenkins/workspace/Kibana_bfs7.7.1/node_modules/selenium-webdriver/lib/webdriver.js:842:17
      at process._tickCallback (internal/process/next_tick.js:68:7)
      at onFailure (test/common/services/retry/retry_for_success.ts:28:9)
      at retryForSuccess (test/common/services/retry/retry_for_success.ts:68:13)
      at onFailure (test/common/services/retry/retry_for_success.ts:28:9)
      at retryForSuccess (test/common/services/retry/retry_for_success.ts:68:13)

               └- ✖ fail: "management  index pattern filter "after each" hook for "should filter indexed fields""

@sichend
Copy link
Collaborator

sichend commented Nov 29, 2021

I have spent some time from Friday to research on this issue

     │ proc [kibana]  FATAL  Error: Optimizations failure.

     │ proc [kibana]    1926 modules

     │ proc [kibana]     

     │ proc [kibana]     ERROR in ./optimize/bundles/timelion.entry.js

     │ proc [kibana]     Module build failed (from ./node_modules/thread-loader/dist/cjs.js):

     │ proc [kibana]     Thread Loader (Worker 0)

     │ proc [kibana]     ENOENT: no such file or directory, open '/var/lib/jenkins/workspace/Kibana_bfs7.7.1/optimize/bundles/timelion.entry.js'

The cause of this is the combination of the two factors, the optimization process during Kibana startup and the Jenkins pipeline docker plugin. Please see more details below

Background knowledge:

  1. Kibana startup. With the current configuration in the functional testing, Kibana goes through the optimize stage during each start up. The optimizer overwrites the files in the root workspace */optimize/bundles/* with the new compiled plugins.
  2. For each project, Jenkins maintains a workspace. It downloads all the source code into the workspace for each build. When Jenkins pipeline runs commands in a docker environment, it mounts the current project workspace from the Jenkins build agent into the container as the workspace with read and write permission. This type of mount persists any changes in the container, i.e. caches from the build process, into the host file system.

Problem Root Cause:
Let's take a careful look at the docker commands to parallel execute the functional tests

// CI group 1
docker run -t -d -u 111:115 -w /var/lib/jenkins/workspace/Kibana_bfs7.7.1 -v /var/lib/jenkins/workspace/Kibana_bfs7.7.1:/var/lib/jenkins/workspace/Kibana_bfs7.7.1:rw,z -v /var/lib/jenkins/workspace/Kibana_bfs7.7.1@tmp:/var/lib/jenkins/workspace/Kibana_bfs7.7.1@tmp:rw,z  bfs7.7.1-test-image:108 cat

docker top 03ba0a2e99baeea696fbe9bbe1168cc2b199a273cb9a8ccf4a6dd1c6c0c1f22c -eo pid,comm

docker exec -it 03ba0a2e99baeea696fbe9bbe1168cc2b199a273cb9a8ccf4a6dd1c6c0c1f22c  "node scripts/functional_tests.js --config test/functional/config.js --include ciGroup1"

// CI group 2
docker run -t -d -u 111:115 -w /var/lib/jenkins/workspace/Kibana_bfs7.7.1 -v /var/lib/jenkins/workspace/Kibana_bfs7.7.1:/var/lib/jenkins/workspace/Kibana_bfs7.7.1:rw,z -v /var/lib/jenkins/workspace/Kibana_bfs7.7.1@tmp:/var/lib/jenkins/workspace/Kibana_bfs7.7.1@tmp:rw,z bfs7.7.1-test-image:108 cat
docker top 76a08b05326c3f895a8ec75ef8fa032d8ae303648e5602657ea0c9fa59b7dbc9 -eo pid,comm

docker exec -it 76a08b05326c3f895a8ec75ef8fa032d8ae303648e5602657ea0c9fa59b7dbc9  "node scripts/functional_tests.js --config test/functional/config.js --include ciGroup2"

Now, we can clearly see that the Jenkins Pipeline is trying to mount the same workspace to multiple parallel containers with read/write permissions at the same time. Each parallel container starts up a Kibana instance and goes through the optimization process, overwriting the same files at the same location on the host. This creates a race condition among all the parallel containers during the Kibana optimization processes. Hence, we see the plugin file not found errors, with unpredictable files each time.

How to solve this problem?
There are two reasons to cause this issue. Therefore, fixing either of them should resolve the flaky test issue. Here are some suggestions, yet to be tested.

  1. From Jenkins side, you can probably fix this by duplicate the workspace for each parallel container, specify the volume mount for each container. Here is the documentation to start with, link
  2. From Kibana side, disable the optimization should solve the problem. However, I will let you experts to decide if this is an acceptable approach for this issue.

Please let me know if you have any questions. Thanks.

@sichend
Copy link
Collaborator

sichend commented Nov 30, 2021

In order to speed up the progress to remove this blocker, I have tested three different approaches yesterday. I would like to share some test results. Please see details below

  1. Disabling optimization by changing the test Kibana environment configuration. This ended up to be unsuccessful. Considering the intricacy of the Kibana system and startup process, this might not be a good way to move forward.

  2. Changing the working directory of the Kibana environment inside docker containers. There are two approaches.

    • Make a copy of the current Jenkins workspace for each parallel container, mount to each container, run kibana environment, clean up the workspace copies. This approach is quick and dirty. Since the entire Kibana code base is fairly large, copying the directory for 12 times consumes fairly amount of time and disk space. There is the test run link, total run time if 37 minutes. Here is the sample code for this approach
            stage("${currentCiGroup}") {
                sh "rm -rf ${env.WORKSPACE}_${currentCiGroup}"
                sh "cp -r ${env.WORKSPACE} ${env.WORKSPACE}_${currentCiGroup}"
                
                withEnv([
                    "TEST_BROWSER_HEADLESS=1",
                    "CI=1",
                    "CI_GROUP=${currentCiGroup}",
                    "GCS_UPLOAD_PREFIX=fake",
                    "TEST_KIBANA_HOST=localhost",
                    "TEST_KIBANA_PORT=6610",
                    "TEST_ES_TRANSPORT_PORT=9403",
                    "TEST_ES_PORT=9400",
                    "CI_PARALLEL_PROCESS_NUMBER=${currentStep}",
                    "JOB=ci${currentStep}",
                    "CACHE_DIR=${currentCiGroup}"
                ]) {
                    testImage.inside("-v \"${env.WORKSPACE}_${currentCiGroup}:${env.WORKSPACE}_${currentCiGroup}\""){
                        sh "cd ${env.WORKSPACE}_${currentCiGroup}"
                        sh "node scripts/functional_tests.js --config test/functional/config.js --include ${currentCiGroup}"
                    }
                }
                sh "rm -rf ${env.WORKSPACE}_${currentCiGroup}"
            }
    
    • The second approach is an improvement based on the first. Considering all the racing condition happens only in the */optimize/ directory, we can simply ignore all other directory. We only need to create an empty directory and mount into each container as the */optimize/ directory in the workspace. In this case, all the containers are sharing everything but the /optimize/ directory. This approach avoids the inefficiency caused by duplicating workspaces. This run completes in 24 minutes comparing to the previous approach. There is the run link and test code.
    stage("${currentCiGroup}") {
                sh "rm -rf ${env.WORKSPACE}_${currentCiGroup}"
                sh "mkdir ${env.WORKSPACE}_${currentCiGroup}"
                
                withEnv([
                    "TEST_BROWSER_HEADLESS=1",
                    "CI=1",
                    "CI_GROUP=${currentCiGroup}",
                    "GCS_UPLOAD_PREFIX=fake",
                    "TEST_KIBANA_HOST=localhost",
                    "TEST_KIBANA_PORT=6610",
                    "TEST_ES_TRANSPORT_PORT=9403",
                    "TEST_ES_PORT=9400",
                    "CI_PARALLEL_PROCESS_NUMBER=${currentStep}",
                    "JOB=ci${currentStep}",
                    "CACHE_DIR=${currentCiGroup}"
                ]) {
                    testImage.inside("-v \"${env.WORKSPACE}_${currentCiGroup}:${env.WORKSPACE}/optimize\""){
                        sh "node scripts/functional_tests.js --config test/functional/config.js --include ${currentCiGroup}"
                    }
                }
                sh "rm -rf ${env.WORKSPACE}_${currentCiGroup}"
            }
    

My conclusion is that the second approach to change the Jenkins file is a cleaner, more efficient, and least changes to the original test cases and environment. We can consider to use this approach. However, please take a challenge to improve this approach by making it even cleaner. In my test code, I created a directory outside the current Jenkins workspace. This is a quick and dirty way to PoC. We should contain everything within the current Jenkins workspace. Therefore, a better approach is to create a directory ${env.WORKSPACE}/.optimize/${currentCiGroup}, and mount this directory into each container. Eventually remove these temp directories.

Please let me know if you have questions.

@2samueld
Copy link
Author

Fix for parallel racing functional test #80

Applying the following second approach, functional parallel tests passes.

ciGroupsMap["${currentCiGroup}"] = {
            sh "rm -rf ${env.WORKSPACE}/.optimize/${currentCiGroup}"
            sh "mkdir -p ${env.WORKSPACE}/.optimize/${currentCiGroup}"
            stage("${currentCiGroup}") {
                withEnv([
                    "TEST_BROWSER_HEADLESS=1",
                    "JOB=ci${currentStep}",
                    "CACHE_DIR=${currentCiGroup}"
                ]) {
                    image.inside("-v \'${env.WORKSPACE}/.optimize/${currentCiGroup}:${env.WORKSPACE}/optimize\'") {
                        sh "node scripts/functional_tests.js --config test/functional/config.js --include ${currentCiGroup}"
                    }
                }
                sh "rm -rf ${env.WORKSPACE}/.optimize/${currentCiGroup}"
            }
        }

Jenkins job test was run to confirm stable results. The past 24 test jobs passes: https://jenkins.bfs.sichend.people.aws.dev/job/Kibana/job/bfs7.7.1/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bfs7.7.1 bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants