Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

testAbortedSnapshotDuringInitDoesNotStart fails with ClassCastException #38226

Closed
dnhatn opened this issue Feb 1, 2019 · 6 comments · Fixed by #38368
Closed

testAbortedSnapshotDuringInitDoesNotStart fails with ClassCastException #38226

dnhatn opened this issue Feb 1, 2019 · 6 comments · Fixed by #38368
Assignees
Labels
:Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >test-failure Triaged test failures from CI

Comments

@dnhatn
Copy link
Member

dnhatn commented Feb 1, 2019

This test is failing with ClassCastException.

  1> org.elasticsearch.transport.RemoteTransportException: [node_sd3][127.0.0.1:34425][internal:cluster/snapshot/update_snapshot_status]
  1> Caused by: org.elasticsearch.transport.ResponseHandlerFailureTransportException: java.lang.ClassCastException: org.elasticsearch.snapshots.SnapshotShardsService$UpdateIndexShardSnapshotStatusResponse cannot be cast to org.elasticsearch.transport.TransportResponse$Empty
  1> Caused by: java.lang.ClassCastException: org.elasticsearch.snapshots.SnapshotShardsService$UpdateIndexShardSnapshotStatusResponse cannot be cast to org.elasticsearch.transport.TransportResponse$Empty
  1> 	at org.elasticsearch.snapshots.SnapshotShardsServic

I can't reproduce this locally but this test failed 4 times today.

 ./gradlew :server:integTest \
  -Dtests.seed=F8B30D4F31180D3F \
  -Dtests.class=org.elasticsearch.snapshots.SharedClusterSnapshotRestoreIT \
  -Dtests.method="testAbortedSnapshotDuringInitDoesNotStart" \
  -Dtests.security.manager=true \
  -Dtests.locale=cs \
  -Dtests.timezone=AET \
  -Dcompiler.java=11 \
  -Druntime.java=8

CI:

@dnhatn dnhatn added :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >test-failure Triaged test failures from CI labels Feb 1, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@original-brownbear
Copy link
Member

Nasty ... I'll be able to get to this tomorrow afternoon. Should be an easy fix though.

dnhatn added a commit to dnhatn/elasticsearch that referenced this issue Feb 2, 2019
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Feb 2, 2019
* The response type here is not empty and was always wrong but this only became visible now that 0a604e3 was introduced
   * As a result of 0a604e3 we started actually handling the response
of this request and logging/handling exceptions before that we simply dropped the classcast exception here quietly using the empty response handler
* Closes elastic#38226
original-brownbear added a commit that referenced this issue Feb 3, 2019
* Fix Incorrect Transport Response Handler Type
* The response type here is not empty and was always wrong but this only became visible now that 0a604e3 was introduced
   * As a result of 0a604e3 we started actually handling the response
of this request and logging/handling exceptions before that we simply dropped the classcast exception here quietly using the empty response handler
* fix busy assert not handling `Exception`
* Closes #38226
* Closes #38256
@original-brownbear
Copy link
Member

Reopening since this is still being reported in #38264 (comment)

@cbuescher
Copy link
Member

Muted again on master with 15510da.

@original-brownbear
Copy link
Member

I tracked this down now, this is a real bug. The fix here is to do a refactoring similar to https://github.com/elastic/elasticsearch/compare/master...ywelsch:snapshot-refactored?expand=1#diff-a0853be4492c052f24917b5c1464003dR975 and remove the duplicate spots where we call endSnapshot() to avoid concurrently calling the method.
I'll try to code it up today.

@cbuescher
Copy link
Member

Muted by 715e581 on master

original-brownbear added a commit that referenced this issue Feb 5, 2019
…38368)

* The problem in #38226 is that in some corner cases multiple calls to `endSnapshot` were made concurrently, leading to non-deterministic behavior (`beginSnapshot` was triggering a repository finalization while one that was triggered by a `deleteSnapshot` was already in progress)
   * Fixed by:
      * Making all `endSnapshot` calls originate from the cluster state being in a "completed" state (apart from on short-circuit on initializing an empty snapshot). This forced putting the failure string into `SnapshotsInProgress.Entry`.
      * Adding deduplication logic to `endSnapshot`
* Also:
  * Streamlined the init behavior to work the same way (keep state on the `SnapshotsService` to decide which snapshot entries are stale)
* closes #38226
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >test-failure Triaged test failures from CI
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants