Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] AzureStorageCleanupThirdPartyTests.testCreateSnapshot failuire #47202

Closed
henningandersen opened this issue Sep 27, 2019 · 12 comments · Fixed by #47284
Closed

[CI] AzureStorageCleanupThirdPartyTests.testCreateSnapshot failuire #47202

henningandersen opened this issue Sep 27, 2019 · 12 comments · Fixed by #47284
Assignees
Labels
:Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >test-failure Triaged test failures from CI

Comments

@henningandersen
Copy link
Contributor

Example failure:

https://gradle-enterprise.elastic.co/s/zb2xff2k72smu/console-log#L926

same build as in #47201, could be related, looking at the build-stats it also started failing yesterday (Sept 26th):

https://build-stats.elastic.co/app/kibana#/discover?_g=(refreshInterval:(pause:!t,value:0),time:(from:now-90d,mode:quick,to:now))&_a=(columns:!(_source),index:e58bf320-7efd-11e8-bf69-63c8ef516157,interval:auto,query:(language:lucene,query:AzureStorageCleanupThirdPartyTests),sort:!(time,desc))

Failure:

org.elasticsearch.repositories.RepositoryException: [test-repo] concurrent modification of the index-N file, expected current generation [0], actual current generation [1] - possibly due to simultaneous snapshot deletion requests

Trying to reproduce locally I get some other failure.

@henningandersen henningandersen added :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >test-failure Triaged test failures from CI labels Sep 27, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@tlrx tlrx self-assigned this Sep 27, 2019
henningandersen added a commit to henningandersen/elasticsearch that referenced this issue Sep 27, 2019
henningandersen added a commit to henningandersen/elasticsearch that referenced this issue Sep 27, 2019
henningandersen added a commit to henningandersen/elasticsearch that referenced this issue Sep 27, 2019
@henningandersen
Copy link
Contributor Author

testCleanup and testListChildren also fails, see:

https://gradle-enterprise.elastic.co/s/ba5gc5zzf7hgc/console-log#L516

testCleanup failure:

org.elasticsearch.repositories.RepositoryException: [test-repo] concurrent modification of the index-N file, expected current generation [0], actual current generation [1] - possibly due to simultaneous snapshot deletion requests
--

and testListChildren failure:

java.lang.AssertionError:
--
Expected: iterable with items ["foo"] in any order
but: not matched: "tests-FzbJKGU8R_6ldXpsqGAzcg"

All failures so far only against master.

henningandersen added a commit that referenced this issue Sep 27, 2019
Muted testCreateSnapshot, testCleanup and testListChildren

Relates #47202
@original-brownbear
Copy link
Member

This looks like some infra issue with these tests running concurrently on the same bucket (e.g. the above ... we're running against an empty bucket there, the only way the verification file that isn't supposed to exists might exist is if some other actor created it).

@mark-vieira @atorok did anything change infra wise here that may cause the various 3rd party Azure tests to run in parallel now? (I see we had some changes to REST tests but I don't know enough about this to judge whether these may have introduced new parallelism)

@mark-vieira
Copy link
Contributor

I think the change is we run all CI builds in parallel now. That said, how things are parallelized still depends on how the build and projects are setup. The :plugins:repository-azure:thirdPartyTest task only runs a single test class so there's no parallelization there. However, it was running at the same time as :plugins:repository-azure:qa:microsoft-azure-storage:integTestRunner. Do these two test suites both utilize the same Azure services? If so, should we perhaps configure them with unique buckets so they don't clobber eachother?

@alpar-t
Copy link
Contributor

alpar-t commented Sep 30, 2019

Maybe this is similar to #46813 ? Moving the test fixtures that just runs with a JavaExec into a test fixtures powered container will help both verify that multiple are not using it in a way that could break with --parallel and also allows for running multiple instances to keep separate tests independently

@original-brownbear
Copy link
Member

@atorok @mark-vieira thanks for taking a look! In the failing tests we're not running fixtures in the failing tests, the failing tests are 3rd party tests running against real Azure but yea the problem remains the same as Mark points out :)

Do these two test suites both utilize the same Azure services? If so, should we perhaps configure them with unique buckets so they don't clobber eachother?

Jup this is what we need to do your analysis is spot on :) What's the preferred way of doing that exclusion these days?

@tlrx
Copy link
Member

tlrx commented Sep 30, 2019

Instead of different buckets, we might just use different base paths?

@original-brownbear
Copy link
Member

@tlrx yea we could do that (would be an easy change by just adding some random things to the paths in the build.gradle where we define the paths). That might be our best move ... much better than needlessly slowing things down by making them run sequentially :)

@tlrx
Copy link
Member

tlrx commented Sep 30, 2019

@original-brownbear I think so. I'll open a PR in this direction.

@tlrx
Copy link
Member

tlrx commented Sep 30, 2019

I opened #47284

@original-brownbear
Copy link
Member

@tlrx one thing to note here is, that we need that same fix for GCS and S3 as well don't we. I think it's just by random Gradle luck that we run into the issue for Azure only right now.
I don't see a reason why we couldn't theoretically run into the same thing for the other two 3rd party runs?

@tlrx
Copy link
Member

tlrx commented Sep 30, 2019

@original-brownbear Yes, I started to look at this too (but tackling one provider at a time)

tlrx added a commit that referenced this issue Sep 30, 2019
This commit change the repositories base paths used in Azure/S3/GCS 
integration tests so that they don't conflict with each other when tests 
run in parallel on real storage services.

Closes #47202
tlrx added a commit that referenced this issue Oct 1, 2019
…7300)

This commit change the repositories base paths used in Azure/S3/GCS
integration tests so that they don't conflict with each other when tests
 run in parallel on real storage services.

Closes #47202
tlrx added a commit that referenced this issue Oct 1, 2019
This commit change the repositories base paths used in Azure/S3/GCS
integration tests so that they don't conflict with each other when tests
 run in parallel on real storage services.

Closes #47202
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >test-failure Triaged test failures from CI
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants