[CI] :qa:full-cluster-restart:v7.3.2#upgradedClusterTest Failed on Windows #46014

original-brownbear · 2019-08-27T08:45:51Z

The test goal failed with this error:


* Where:
--
Build file 'C:\Users\jenkins\workspace\elastic+elasticsearch+master+multijob-windows-compatibility\os\windows-2016\qa\full-cluster-restart\build.gradle' line: 63
 
* What went wrong:
Execution failed for task ':qa:full-cluster-restart:v7.3.2#upgradedClusterTest'.
> Unable to delete directory 'C:\Users\jenkins\workspace\elastic+elasticsearch+master+multijob-windows-compatibility\os\windows-2016\qa\full-cluster-restart\build\testclusters\v7.3.2-0\distro'
Failed to delete some children. This might happen because a process has files open or has its working directory set in the target directory.
- C:\Users\jenkins\workspace\elastic+elasticsearch+master+multijob-windows-compatibility\os\windows-2016\qa\full-cluster-restart\build\testclusters\v7.3.2-0\distro\lib\elasticsearch-7.3.2-SNAPSHOT.jar
- C:\Users\jenkins\workspace\elastic+elasticsearch+master+multijob-windows-compatibility\os\windows-2016\qa\full-cluster-restart\build\testclusters\v7.3.2-0\distro\lib\elasticsearch-cli-7.3.2-SNAPSHOT.jar
- C:\Users\jenkins\workspace\elastic+elasticsearch+master+multijob-windows-compatibility\os\windows-2016\qa\full-cluster-restart\build\testclusters\v7.3.2-0\distro\lib\elasticsearch-core-7.3.2-SNAPSHOT.jar
- C:\Users\jenkins\workspace\elastic+elasticsearch+master+multijob-windows-compatibility\os\windows-2016\qa\full-cluster-restart\build\testclusters\v7.3.2-0\distro\lib\elasticsearch-geo-7.3.2-SNAPSHOT.jar
- C:\Users\jenkins\workspace\elastic+elasticsearch+master+multijob-windows-compatibility\os\windows-2016\qa\full-cluster-restart\build\testclusters\v7.3.2-0\distro\lib\elasticsearch-launchers-7.3.2-SNAPSHOT.jar
- C:\Users\jenkins\workspace\elastic+elasticsearch+master+multijob-windows-compatibility\os\windows-2016\qa\full-cluster-restart\build\testclusters\v7.3.2-0\distro\lib\elasticsearch-plugin-classloader-7.3.2-SNAPSHOT.jar
- C:\Users\jenkins\workspace\elastic+elasticsearch+master+multijob-windows-compatibility\os\windows-2016\qa\full-cluster-restart\build\testclusters\v7.3.2-0\distro\lib\elasticsearch-secure-sm-7.3.2-SNAPSHOT.jar
- C:\Users\jenkins\workspace\elastic+elasticsearch+master+multijob-windows-compatibility\os\windows-2016\qa\full-cluster-restart\build\testclusters\v7.3.2-0\distro\lib\elasticsearch-x-content-7.3.2-SNAPSHOT.jar
- C:\Users\jenkins\workspace\elastic+elasticsearch+master+multijob-windows-compatibility\os\windows-2016\qa\full-cluster-restart\build\testclusters\v7.3.2-0\distro\lib\HdrHistogram-2.1.9.jar
- C:\Users\jenkins\workspace\elastic+elasticsearch+master+multijob-windows-compatibility\os\windows-2016\qa\full-cluster-restart\build\testclusters\v7.3.2-0\distro\lib\hppc-0.8.1.jar
- C:\Users\jenkins\workspace\elastic+elasticsearch+master+multijob-windows-compatibility\os\windows-2016\qa\full-cluster-restart\build\testclusters\v7.3.2-0\distro\lib\jackson-core-2.8.11.jar
- C:\Users\jenkins\workspace\elastic+elasticsearch+master+multijob-windows-compatibility\os\windows-2016\qa\full-cluster-restart\build\testclusters\v7.3.2-0\distro\lib\jackson-dataformat-cbor-2.8.11.jar
- C:\Users\jenkins\workspace\elastic+elasticsearch+master+multijob-windows-compatibility\os\windows-2016\qa\full-cluster-restart\build\testclusters\v7.3.2-0\distro\lib\jackson-dataformat-smile-2.8.11.jar
- C:\Users\jenkins\workspace\elastic+elasticsearch+master+multijob-windows-compatibility\os\windows-2016\qa\full-cluster-restart\build\testclusters\v7.3.2-0\distro\lib\jackson-dataformat-yaml-2.8.11.jar

Not sure where this is coming from, but it seems first a cluster fails to form and then the cleanup fails because Gradle tries to delete files still in use by a running test cluster.

Build Scan

The text was updated successfully, but these errors were encountered:

elasticmachine · 2019-08-27T08:45:53Z

Pinging @elastic/es-core-infra

matriv · 2019-08-29T11:22:08Z

Two more failures:

matriv · 2019-08-29T11:24:15Z

Same error for :x-pack:plugin:ccr:qa:restart:followClusterRestartTest
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-windows-compatibility/os=windows-2019/107/console

henningandersen · 2019-09-02T06:25:37Z

Similar error here, though not identical it looks like the same underlying issue:

https://gradle-enterprise.elastic.co/s/5tp5vjeeruqqk/failure?openFailures=WzBd&openStackTraces=WzFd#top=0

Unable to delete directory 'C:\Users\jenkins\workspace\elastic+elasticsearch+master+multijob-windows-compatibility\os\windows-2019\x-pack\plugin\ccr\qa\restart\build\testclusters\follow-cluster-0\distro'
  Failed to delete some children. This might happen because a process has files open or has its working directory set in the target directory.
  - C:\Users\jenkins\workspace\elastic+elasticsearch+master+multijob-windows-compatibility\os\windows-2019\x-pack\plugin\ccr\qa\restart\build\testclusters\follow-cluster-0\distro\modules\x-pack-ml\platform\windows-x86_64\bin\boost_date_time-vc141-mt-1_65_1.dll
  - C:\Users\jenkins\workspace\elastic+elasticsearch+master+multijob-windows-compatibility\os\windows-2019\x-pack\plugin\ccr\qa\restart\build\testclusters\follow-cluster-0\distro\modules\x-pack-ml\platform\windows-x86_64\bin\boost_filesystem-vc141-mt-1_65_1.dll

Failed around 5 times over the weekend with same/similar error.

alpar-t · 2019-09-02T09:47:27Z

Looking at one of the stack traces of the failure:

at org.elasticsearch.gradle.testclusters.ElasticsearchNode.syncWithLinks(ElasticsearchNode.java:887)
at org.elasticsearch.gradle.testclusters.ElasticsearchNode.createWorkingDir(ElasticsearchNode.java:869)
at org.elasticsearch.gradle.testclusters.ElasticsearchNode.start(ElasticsearchNode.java:395)
at org.elasticsearch.gradle.testclusters.ElasticsearchNode.restart(ElasticsearchNode.java:474)
at org.elasticsearch.gradle.testclusters.ElasticsearchNode.goToNextVersion(ElasticsearchNode.java:486)
at org.gradle.api.internal.DefaultDomainObjectCollection.all(DefaultDomainObjectCollection.java:158)
at org.elasticsearch.gradle.testclusters.ElasticsearchCluster.goToNextVersion(ElasticsearchCluster.java:277)

A restart operation is implemented as a stop followed by a start.
The start does some cleanup if files exist. It's this cleanup that fails because some files can't be deleted possibly because they are still open.
The stop does a recursive walk of processes so it's unlikely that processes linger.
I have encountered this in testing some time ago here.
@rjernst would you agree to add a wait and retry in case the cleanup fails here ?

hendrikmuhs · 2019-09-03T10:54:13Z

another one: https://gradle-enterprise.elastic.co/s/jsaajm7ezmr6q

rjernst · 2019-09-03T17:38:50Z

The start does some cleanup if files exist. It's this cleanup that fails because some files can't be deleted possibly because they are still open.

How can they still be open? It looks like we always do a SIGKILL in this case, and even if we were doing a SIGTERM, there is logic in stop to wait on the pid. Retrying seems like it would be masking a different issue we need to identify.

alpar-t · 2019-09-04T09:10:03Z

@rjernst
My understanding is that the fact that the process is no longer there doesn't guarantee that it's resources were closed. Not on windows. So if we stop elasticsearch and try to remove the files it was using in quick succession there's no way to have a guarantee that those files were closed.
Of course there's no way to know if the OS hasn't released the resources yet or there's something lurking, so I understand your concern, but I think a retry with a short timeout ( e.x. we try to delete for 5 seconds max before we give up ) would work if the OS is lagging behind but is not likely to help if there are processes lurking.

droberts195 · 2019-09-09T11:08:27Z

There is a variation on this in https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-windows-compatibility/os=windows-2019/129/console

> Unable to delete directory 'C:\Users\jenkins\workspace\elastic+elasticsearch+master+multijob-windows-compatibility\os\windows-2019\x-pack\plugin\ccr\qa\restart\build\testclusters\follow-cluster-0\distro'
    Failed to delete some children. This might happen because a process has files open or has its working directory set in the target directory.
    - C:\Users\jenkins\workspace\elastic+elasticsearch+master+multijob-windows-compatibility\os\windows-2019\x-pack\plugin\ccr\qa\restart\build\testclusters\follow-cluster-0\distro\modules\x-pack-ml\platform\windows-x86_64\bin\boost_date_time-vc141-mt-x64-1_71.dll
    - C:\Users\jenkins\workspace\elastic+elasticsearch+master+multijob-windows-compatibility\os\windows-2019\x-pack\plugin\ccr\qa\restart\build\testclusters\follow-cluster-0\distro\modules\x-pack-ml\platform\windows-x86_64\bin\boost_filesystem-vc141-mt-x64-1_71.dll
    - C:\Users\jenkins\workspace\elastic+elasticsearch+master+multijob-windows-compatibility\os\windows-2019\x-pack\plugin\ccr\qa\restart\build\testclusters\follow-cluster-0\distro\modules\x-pack-ml\platform\windows-x86_64\bin\boost_iostreams-vc141-mt-x64-1_71.dll
    - C:\Users\jenkins\workspace\elastic+elasticsearch+master+multijob-windows-compatibility\os\windows-2019\x-pack\plugin\ccr\qa\restart\build\testclusters\follow-cluster-0\distro\modules\x-pack-ml\platform\windows-x86_64\bin\boost_program_options-vc141-mt-x64-1_71.dll
    - C:\Users\jenkins\workspace\elastic+elasticsearch+master+multijob-windows-compatibility\os\windows-2019\x-pack\plugin\ccr\qa\restart\build\testclusters\follow-cluster-0\distro\modules\x-pack-ml\platform\windows-x86_64\bin\boost_regex-vc141-mt-x64-1_71.dll
    - C:\Users\jenkins\workspace\elastic+elasticsearch+master+multijob-windows-compatibility\os\windows-2019\x-pack\plugin\ccr\qa\restart\build\testclusters\follow-cluster-0\distro\modules\x-pack-ml\platform\windows-x86_64\bin\controller.exe
    - C:\Users\jenkins\workspace\elastic+elasticsearch+master+multijob-windows-compatibility\os\windows-2019\x-pack\plugin\ccr\qa\restart\build\testclusters\follow-cluster-0\distro\modules\x-pack-ml\platform\windows-x86_64\bin\libapr-1.dll
    - C:\Users\jenkins\workspace\elastic+elasticsearch+master+multijob-windows-compatibility\os\windows-2019\x-pack\plugin\ccr\qa\restart\build\testclusters\follow-cluster-0\distro\modules\x-pack-ml\platform\windows-x86_64\bin\libapriconv-1.dll
    - C:\Users\jenkins\workspace\elastic+elasticsearch+master+multijob-windows-compatibility\os\windows-2019\x-pack\plugin\ccr\qa\restart\build\testclusters\follow-cluster-0\distro\modules\x-pack-ml\platform\windows-x86_64\bin\libaprutil-1.dll
    - C:\Users\jenkins\workspace\elastic+elasticsearch+master+multijob-windows-compatibility\os\windows-2019\x-pack\plugin\ccr\qa\restart\build\testclusters\follow-cluster-0\distro\modules\x-pack-ml\platform\windows-x86_64\bin\libMlCore.dll
    - C:\Users\jenkins\workspace\elastic+elasticsearch+master+multijob-windows-compatibility\os\windows-2019\x-pack\plugin\ccr\qa\restart\build\testclusters\follow-cluster-0\distro\modules\x-pack-ml\platform\windows-x86_64\bin\libxml2.dll
    - C:\Users\jenkins\workspace\elastic+elasticsearch+master+multijob-windows-compatibility\os\windows-2019\x-pack\plugin\ccr\qa\restart\build\testclusters\follow-cluster-0\distro\modules\x-pack-ml\platform\windows-x86_64\bin\log4cxx.dll
    - C:\Users\jenkins\workspace\elastic+elasticsearch+master+multijob-windows-compatibility\os\windows-2019\x-pack\plugin\ccr\qa\restart\build\testclusters\follow-cluster-0\distro\modules\x-pack-ml\platform\windows-x86_64\bin\msvcp140.dll
    - C:\Users\jenkins\workspace\elastic+elasticsearch+master+multijob-windows-compatibility\os\windows-2019\x-pack\plugin\ccr\qa\restart\build\testclusters\follow-cluster-0\distro\modules\x-pack-ml\platform\windows-x86_64\bin\vcruntime140.dll
    - C:\Users\jenkins\workspace\elastic+elasticsearch+master+multijob-windows-compatibility\os\windows-2019\x-pack\plugin\ccr\qa\restart\build\testclusters\follow-cluster-0\distro\modules\x-pack-ml\platform\windows-x86_64\bin\zlib1.dll
    - C:\Users\jenkins\workspace\elastic+elasticsearch+master+multijob-windows-compatibility\os\windows-2019\x-pack\plugin\ccr\qa\restart\build\testclusters\follow-cluster-0\distro\modules\x-pack-ml\platform\windows-x86_64\bin
    - and more ...

In this case the process that's stopping the files being deleted is the ML controller process.

Doing a TerminateProcess on the ES JVM will not instantly kill the controller. Instead the controller notices that its stdin has been disconnected and exits. But obviously there can be a few milliseconds between brutally killing the ES JVM (but not the controller) and the controller reacting.

but I think a retry with a short timeout ( e.x. we try to delete for 5 seconds max before we give up )

5 seconds should be ample for the controller to notice that the ES JVM has died and exit by itself (especially given that this problem is fairly rare even without any retries).

The only other alternative would be for the testclusters code to kill the controller explicitly as well as the JVM before cleaning up installation directories. But if multiple clusters are running in parallel that may not be easy. It would need to kill the controller whose parent process is the JVM. If it killed all processes named controller.exe then it would cripple the other clusters running in parallel on the same machine. So retrying for up to 5 seconds is probably simpler.

droberts195 · 2019-09-09T11:48:11Z

Another one like the original issue description, i.e. due to .jar files rather than the ML controller: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-windows-compatibility/os=windows-2016/129/console

When retarting the cluster we clean up old distribution files that might still be in use by the OS. Windows closes resources of ded processes async, so we do a couple of retries to get arround it. Closes elastic#46014

…up (#46539) * Retry deleting distro dir on windows When retarting the cluster we clean up old distribution files that might still be in use by the OS. Windows closes resources of ded processes async, so we do a couple of retries to get arround it. Closes #46014 * Avoid having to delete the distro folder.

…up (elastic#46539) * Retry deleting distro dir on windows When retarting the cluster we clean up old distribution files that might still be in use by the OS. Windows closes resources of ded processes async, so we do a couple of retries to get arround it. Closes elastic#46014 * Avoid having to delete the distro folder.

* Use versions specific distribution folders so we don't need to clean up (#46539) * Retry deleting distro dir on windows When retarting the cluster we clean up old distribution files that might still be in use by the OS. Windows closes resources of ded processes async, so we do a couple of retries to get arround it. Closes #46014 * Avoid having to delete the distro folder. * Remove the use of ClusterFormationTasks form RestTestTask (#47022) This PR removes a use-case of the ClusterFormationTasks and converts a project that flew under the radar so far. There's probably more clean-up possible here, but for now the goal is to be able to remove that code after `RunTask` is also updated. * Migrate some 7.x only projects

original-brownbear added :Delivery/Build Build or test infrastructure >test-failure Triaged test failures from CI labels Aug 27, 2019

alpar-t self-assigned this Sep 5, 2019

alpar-t mentioned this issue Sep 10, 2019

Use versions specific distribution folders so we don't need to clean up #46539

Merged

alpar-t closed this as completed in #46539 Sep 16, 2019

mark-vieira added the Team:Delivery Meta label for Delivery team label Nov 11, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] :qa:full-cluster-restart:v7.3.2#upgradedClusterTest Failed on Windows #46014

[CI] :qa:full-cluster-restart:v7.3.2#upgradedClusterTest Failed on Windows #46014

original-brownbear commented Aug 27, 2019

elasticmachine commented Aug 27, 2019

matriv commented Aug 29, 2019 •

edited

Loading

matriv commented Aug 29, 2019

henningandersen commented Sep 2, 2019 •

edited

Loading

alpar-t commented Sep 2, 2019

hendrikmuhs commented Sep 3, 2019

rjernst commented Sep 3, 2019

alpar-t commented Sep 4, 2019

droberts195 commented Sep 9, 2019

droberts195 commented Sep 9, 2019

[CI] :qa:full-cluster-restart:v7.3.2#upgradedClusterTest Failed on Windows #46014

[CI] :qa:full-cluster-restart:v7.3.2#upgradedClusterTest Failed on Windows #46014

Comments

original-brownbear commented Aug 27, 2019

elasticmachine commented Aug 27, 2019

matriv commented Aug 29, 2019 • edited Loading

matriv commented Aug 29, 2019

henningandersen commented Sep 2, 2019 • edited Loading

alpar-t commented Sep 2, 2019

hendrikmuhs commented Sep 3, 2019

rjernst commented Sep 3, 2019

alpar-t commented Sep 4, 2019

droberts195 commented Sep 9, 2019

droberts195 commented Sep 9, 2019

matriv commented Aug 29, 2019 •

edited

Loading

henningandersen commented Sep 2, 2019 •

edited

Loading