Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] :qa:full-cluster-restart:v7.3.2#upgradedClusterTest Failed on Windows #46014

Closed
original-brownbear opened this issue Aug 27, 2019 · 10 comments · Fixed by #46539
Closed

[CI] :qa:full-cluster-restart:v7.3.2#upgradedClusterTest Failed on Windows #46014

original-brownbear opened this issue Aug 27, 2019 · 10 comments · Fixed by #46539
Assignees
Labels
:Delivery/Build Build or test infrastructure Team:Delivery Meta label for Delivery team >test-failure Triaged test failures from CI

Comments

@original-brownbear
Copy link
Member

The test goal failed with this error:


* Where:
--
Build file 'C:\Users\jenkins\workspace\elastic+elasticsearch+master+multijob-windows-compatibility\os\windows-2016\qa\full-cluster-restart\build.gradle' line: 63
 
* What went wrong:
Execution failed for task ':qa:full-cluster-restart:v7.3.2#upgradedClusterTest'.
> Unable to delete directory 'C:\Users\jenkins\workspace\elastic+elasticsearch+master+multijob-windows-compatibility\os\windows-2016\qa\full-cluster-restart\build\testclusters\v7.3.2-0\distro'
Failed to delete some children. This might happen because a process has files open or has its working directory set in the target directory.
- C:\Users\jenkins\workspace\elastic+elasticsearch+master+multijob-windows-compatibility\os\windows-2016\qa\full-cluster-restart\build\testclusters\v7.3.2-0\distro\lib\elasticsearch-7.3.2-SNAPSHOT.jar
- C:\Users\jenkins\workspace\elastic+elasticsearch+master+multijob-windows-compatibility\os\windows-2016\qa\full-cluster-restart\build\testclusters\v7.3.2-0\distro\lib\elasticsearch-cli-7.3.2-SNAPSHOT.jar
- C:\Users\jenkins\workspace\elastic+elasticsearch+master+multijob-windows-compatibility\os\windows-2016\qa\full-cluster-restart\build\testclusters\v7.3.2-0\distro\lib\elasticsearch-core-7.3.2-SNAPSHOT.jar
- C:\Users\jenkins\workspace\elastic+elasticsearch+master+multijob-windows-compatibility\os\windows-2016\qa\full-cluster-restart\build\testclusters\v7.3.2-0\distro\lib\elasticsearch-geo-7.3.2-SNAPSHOT.jar
- C:\Users\jenkins\workspace\elastic+elasticsearch+master+multijob-windows-compatibility\os\windows-2016\qa\full-cluster-restart\build\testclusters\v7.3.2-0\distro\lib\elasticsearch-launchers-7.3.2-SNAPSHOT.jar
- C:\Users\jenkins\workspace\elastic+elasticsearch+master+multijob-windows-compatibility\os\windows-2016\qa\full-cluster-restart\build\testclusters\v7.3.2-0\distro\lib\elasticsearch-plugin-classloader-7.3.2-SNAPSHOT.jar
- C:\Users\jenkins\workspace\elastic+elasticsearch+master+multijob-windows-compatibility\os\windows-2016\qa\full-cluster-restart\build\testclusters\v7.3.2-0\distro\lib\elasticsearch-secure-sm-7.3.2-SNAPSHOT.jar
- C:\Users\jenkins\workspace\elastic+elasticsearch+master+multijob-windows-compatibility\os\windows-2016\qa\full-cluster-restart\build\testclusters\v7.3.2-0\distro\lib\elasticsearch-x-content-7.3.2-SNAPSHOT.jar
- C:\Users\jenkins\workspace\elastic+elasticsearch+master+multijob-windows-compatibility\os\windows-2016\qa\full-cluster-restart\build\testclusters\v7.3.2-0\distro\lib\HdrHistogram-2.1.9.jar
- C:\Users\jenkins\workspace\elastic+elasticsearch+master+multijob-windows-compatibility\os\windows-2016\qa\full-cluster-restart\build\testclusters\v7.3.2-0\distro\lib\hppc-0.8.1.jar
- C:\Users\jenkins\workspace\elastic+elasticsearch+master+multijob-windows-compatibility\os\windows-2016\qa\full-cluster-restart\build\testclusters\v7.3.2-0\distro\lib\jackson-core-2.8.11.jar
- C:\Users\jenkins\workspace\elastic+elasticsearch+master+multijob-windows-compatibility\os\windows-2016\qa\full-cluster-restart\build\testclusters\v7.3.2-0\distro\lib\jackson-dataformat-cbor-2.8.11.jar
- C:\Users\jenkins\workspace\elastic+elasticsearch+master+multijob-windows-compatibility\os\windows-2016\qa\full-cluster-restart\build\testclusters\v7.3.2-0\distro\lib\jackson-dataformat-smile-2.8.11.jar
- C:\Users\jenkins\workspace\elastic+elasticsearch+master+multijob-windows-compatibility\os\windows-2016\qa\full-cluster-restart\build\testclusters\v7.3.2-0\distro\lib\jackson-dataformat-yaml-2.8.11.jar

Not sure where this is coming from, but it seems first a cluster fails to form and then the cleanup fails because Gradle tries to delete files still in use by a running test cluster.

Build Scan

@original-brownbear original-brownbear added :Delivery/Build Build or test infrastructure >test-failure Triaged test failures from CI labels Aug 27, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra

@matriv
Copy link
Contributor

matriv commented Aug 29, 2019

Same error for :x-pack:plugin:ccr:qa:restart:followClusterRestartTest
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-windows-compatibility/os=windows-2019/107/console

@henningandersen
Copy link
Contributor

henningandersen commented Sep 2, 2019

Similar error here, though not identical it looks like the same underlying issue:

https://gradle-enterprise.elastic.co/s/5tp5vjeeruqqk/failure?openFailures=WzBd&openStackTraces=WzFd#top=0

Unable to delete directory 'C:\Users\jenkins\workspace\elastic+elasticsearch+master+multijob-windows-compatibility\os\windows-2019\x-pack\plugin\ccr\qa\restart\build\testclusters\follow-cluster-0\distro'
  Failed to delete some children. This might happen because a process has files open or has its working directory set in the target directory.
  - C:\Users\jenkins\workspace\elastic+elasticsearch+master+multijob-windows-compatibility\os\windows-2019\x-pack\plugin\ccr\qa\restart\build\testclusters\follow-cluster-0\distro\modules\x-pack-ml\platform\windows-x86_64\bin\boost_date_time-vc141-mt-1_65_1.dll
  - C:\Users\jenkins\workspace\elastic+elasticsearch+master+multijob-windows-compatibility\os\windows-2019\x-pack\plugin\ccr\qa\restart\build\testclusters\follow-cluster-0\distro\modules\x-pack-ml\platform\windows-x86_64\bin\boost_filesystem-vc141-mt-1_65_1.dll

Failed around 5 times over the weekend with same/similar error.

@alpar-t
Copy link
Contributor

alpar-t commented Sep 2, 2019

Looking at one of the stack traces of the failure:

at org.elasticsearch.gradle.testclusters.ElasticsearchNode.syncWithLinks(ElasticsearchNode.java:887)
at org.elasticsearch.gradle.testclusters.ElasticsearchNode.createWorkingDir(ElasticsearchNode.java:869)
at org.elasticsearch.gradle.testclusters.ElasticsearchNode.start(ElasticsearchNode.java:395)
at org.elasticsearch.gradle.testclusters.ElasticsearchNode.restart(ElasticsearchNode.java:474)
at org.elasticsearch.gradle.testclusters.ElasticsearchNode.goToNextVersion(ElasticsearchNode.java:486)
at org.gradle.api.internal.DefaultDomainObjectCollection.all(DefaultDomainObjectCollection.java:158)
at org.elasticsearch.gradle.testclusters.ElasticsearchCluster.goToNextVersion(ElasticsearchCluster.java:277)

A restart operation is implemented as a stop followed by a start.
The start does some cleanup if files exist. It's this cleanup that fails because some files can't be deleted possibly because they are still open.
The stop does a recursive walk of processes so it's unlikely that processes linger.
I have encountered this in testing some time ago here.
@rjernst would you agree to add a wait and retry in case the cleanup fails here ?

@hendrikmuhs
Copy link

@rjernst
Copy link
Member

rjernst commented Sep 3, 2019

The start does some cleanup if files exist. It's this cleanup that fails because some files can't be deleted possibly because they are still open.

How can they still be open? It looks like we always do a SIGKILL in this case, and even if we were doing a SIGTERM, there is logic in stop to wait on the pid. Retrying seems like it would be masking a different issue we need to identify.

@alpar-t
Copy link
Contributor

alpar-t commented Sep 4, 2019

@rjernst
My understanding is that the fact that the process is no longer there doesn't guarantee that it's resources were closed. Not on windows. So if we stop elasticsearch and try to remove the files it was using in quick succession there's no way to have a guarantee that those files were closed.
Of course there's no way to know if the OS hasn't released the resources yet or there's something lurking, so I understand your concern, but I think a retry with a short timeout ( e.x. we try to delete for 5 seconds max before we give up ) would work if the OS is lagging behind but is not likely to help if there are processes lurking.

@alpar-t alpar-t self-assigned this Sep 5, 2019
@droberts195
Copy link
Contributor

There is a variation on this in https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-windows-compatibility/os=windows-2019/129/console

> Unable to delete directory 'C:\Users\jenkins\workspace\elastic+elasticsearch+master+multijob-windows-compatibility\os\windows-2019\x-pack\plugin\ccr\qa\restart\build\testclusters\follow-cluster-0\distro'
    Failed to delete some children. This might happen because a process has files open or has its working directory set in the target directory.
    - C:\Users\jenkins\workspace\elastic+elasticsearch+master+multijob-windows-compatibility\os\windows-2019\x-pack\plugin\ccr\qa\restart\build\testclusters\follow-cluster-0\distro\modules\x-pack-ml\platform\windows-x86_64\bin\boost_date_time-vc141-mt-x64-1_71.dll
    - C:\Users\jenkins\workspace\elastic+elasticsearch+master+multijob-windows-compatibility\os\windows-2019\x-pack\plugin\ccr\qa\restart\build\testclusters\follow-cluster-0\distro\modules\x-pack-ml\platform\windows-x86_64\bin\boost_filesystem-vc141-mt-x64-1_71.dll
    - C:\Users\jenkins\workspace\elastic+elasticsearch+master+multijob-windows-compatibility\os\windows-2019\x-pack\plugin\ccr\qa\restart\build\testclusters\follow-cluster-0\distro\modules\x-pack-ml\platform\windows-x86_64\bin\boost_iostreams-vc141-mt-x64-1_71.dll
    - C:\Users\jenkins\workspace\elastic+elasticsearch+master+multijob-windows-compatibility\os\windows-2019\x-pack\plugin\ccr\qa\restart\build\testclusters\follow-cluster-0\distro\modules\x-pack-ml\platform\windows-x86_64\bin\boost_program_options-vc141-mt-x64-1_71.dll
    - C:\Users\jenkins\workspace\elastic+elasticsearch+master+multijob-windows-compatibility\os\windows-2019\x-pack\plugin\ccr\qa\restart\build\testclusters\follow-cluster-0\distro\modules\x-pack-ml\platform\windows-x86_64\bin\boost_regex-vc141-mt-x64-1_71.dll
    - C:\Users\jenkins\workspace\elastic+elasticsearch+master+multijob-windows-compatibility\os\windows-2019\x-pack\plugin\ccr\qa\restart\build\testclusters\follow-cluster-0\distro\modules\x-pack-ml\platform\windows-x86_64\bin\controller.exe
    - C:\Users\jenkins\workspace\elastic+elasticsearch+master+multijob-windows-compatibility\os\windows-2019\x-pack\plugin\ccr\qa\restart\build\testclusters\follow-cluster-0\distro\modules\x-pack-ml\platform\windows-x86_64\bin\libapr-1.dll
    - C:\Users\jenkins\workspace\elastic+elasticsearch+master+multijob-windows-compatibility\os\windows-2019\x-pack\plugin\ccr\qa\restart\build\testclusters\follow-cluster-0\distro\modules\x-pack-ml\platform\windows-x86_64\bin\libapriconv-1.dll
    - C:\Users\jenkins\workspace\elastic+elasticsearch+master+multijob-windows-compatibility\os\windows-2019\x-pack\plugin\ccr\qa\restart\build\testclusters\follow-cluster-0\distro\modules\x-pack-ml\platform\windows-x86_64\bin\libaprutil-1.dll
    - C:\Users\jenkins\workspace\elastic+elasticsearch+master+multijob-windows-compatibility\os\windows-2019\x-pack\plugin\ccr\qa\restart\build\testclusters\follow-cluster-0\distro\modules\x-pack-ml\platform\windows-x86_64\bin\libMlCore.dll
    - C:\Users\jenkins\workspace\elastic+elasticsearch+master+multijob-windows-compatibility\os\windows-2019\x-pack\plugin\ccr\qa\restart\build\testclusters\follow-cluster-0\distro\modules\x-pack-ml\platform\windows-x86_64\bin\libxml2.dll
    - C:\Users\jenkins\workspace\elastic+elasticsearch+master+multijob-windows-compatibility\os\windows-2019\x-pack\plugin\ccr\qa\restart\build\testclusters\follow-cluster-0\distro\modules\x-pack-ml\platform\windows-x86_64\bin\log4cxx.dll
    - C:\Users\jenkins\workspace\elastic+elasticsearch+master+multijob-windows-compatibility\os\windows-2019\x-pack\plugin\ccr\qa\restart\build\testclusters\follow-cluster-0\distro\modules\x-pack-ml\platform\windows-x86_64\bin\msvcp140.dll
    - C:\Users\jenkins\workspace\elastic+elasticsearch+master+multijob-windows-compatibility\os\windows-2019\x-pack\plugin\ccr\qa\restart\build\testclusters\follow-cluster-0\distro\modules\x-pack-ml\platform\windows-x86_64\bin\vcruntime140.dll
    - C:\Users\jenkins\workspace\elastic+elasticsearch+master+multijob-windows-compatibility\os\windows-2019\x-pack\plugin\ccr\qa\restart\build\testclusters\follow-cluster-0\distro\modules\x-pack-ml\platform\windows-x86_64\bin\zlib1.dll
    - C:\Users\jenkins\workspace\elastic+elasticsearch+master+multijob-windows-compatibility\os\windows-2019\x-pack\plugin\ccr\qa\restart\build\testclusters\follow-cluster-0\distro\modules\x-pack-ml\platform\windows-x86_64\bin
    - and more ...

In this case the process that's stopping the files being deleted is the ML controller process.

Doing a TerminateProcess on the ES JVM will not instantly kill the controller. Instead the controller notices that its stdin has been disconnected and exits. But obviously there can be a few milliseconds between brutally killing the ES JVM (but not the controller) and the controller reacting.

but I think a retry with a short timeout ( e.x. we try to delete for 5 seconds max before we give up )

5 seconds should be ample for the controller to notice that the ES JVM has died and exit by itself (especially given that this problem is fairly rare even without any retries).

The only other alternative would be for the testclusters code to kill the controller explicitly as well as the JVM before cleaning up installation directories. But if multiple clusters are running in parallel that may not be easy. It would need to kill the controller whose parent process is the JVM. If it killed all processes named controller.exe then it would cripple the other clusters running in parallel on the same machine. So retrying for up to 5 seconds is probably simpler.

@droberts195
Copy link
Contributor

Another one like the original issue description, i.e. due to .jar files rather than the ML controller: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-windows-compatibility/os=windows-2016/129/console

alpar-t added a commit to alpar-t/elasticsearch that referenced this issue Sep 10, 2019
When retarting the cluster we clean up old distribution files that might
still be in use by the OS.
Windows closes resources of ded processes async, so we do a couple of
retries to get arround it.

Closes elastic#46014
alpar-t added a commit that referenced this issue Sep 16, 2019
…up (#46539)

* Retry deleting distro dir on windows

When retarting the cluster we clean up old distribution files that might
still be in use by the OS.
Windows closes resources of ded processes async, so we do a couple of
retries to get arround it.

Closes #46014

* Avoid having to delete the distro folder.
alpar-t added a commit to alpar-t/elasticsearch that referenced this issue Oct 5, 2019
…up (elastic#46539)

* Retry deleting distro dir on windows

When retarting the cluster we clean up old distribution files that might
still be in use by the OS.
Windows closes resources of ded processes async, so we do a couple of
retries to get arround it.

Closes elastic#46014

* Avoid having to delete the distro folder.
alpar-t added a commit that referenced this issue Oct 7, 2019
* Use versions specific distribution folders so we don't need to clean up (#46539)

* Retry deleting distro dir on windows

When retarting the cluster we clean up old distribution files that might
still be in use by the OS.
Windows closes resources of ded processes async, so we do a couple of
retries to get arround it.

Closes #46014

* Avoid having to delete the distro folder.

* Remove the use of ClusterFormationTasks form RestTestTask (#47022)

This PR removes a use-case of the ClusterFormationTasks and converts a
project that flew under the radar so far.
There's probably more clean-up possible here, but for now the goal is
to be able to remove that code after `RunTask` is also updated.

* Migrate some 7.x only projects
@mark-vieira mark-vieira added the Team:Delivery Meta label for Delivery team label Nov 11, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Delivery/Build Build or test infrastructure Team:Delivery Meta label for Delivery team >test-failure Triaged test failures from CI
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants