testAbortedSnapshotDuringInitDoesNotStart fails with ClassCastException #38226

dnhatn · 2019-02-01T20:59:08Z

This test is failing with ClassCastException.

  1> org.elasticsearch.transport.RemoteTransportException: [node_sd3][127.0.0.1:34425][internal:cluster/snapshot/update_snapshot_status]
  1> Caused by: org.elasticsearch.transport.ResponseHandlerFailureTransportException: java.lang.ClassCastException: org.elasticsearch.snapshots.SnapshotShardsService$UpdateIndexShardSnapshotStatusResponse cannot be cast to org.elasticsearch.transport.TransportResponse$Empty
  1> Caused by: java.lang.ClassCastException: org.elasticsearch.snapshots.SnapshotShardsService$UpdateIndexShardSnapshotStatusResponse cannot be cast to org.elasticsearch.transport.TransportResponse$Empty
  1> 	at org.elasticsearch.snapshots.SnapshotShardsServic

I can't reproduce this locally but this test failed 4 times today.

 ./gradlew :server:integTest \
  -Dtests.seed=F8B30D4F31180D3F \
  -Dtests.class=org.elasticsearch.snapshots.SharedClusterSnapshotRestoreIT \
  -Dtests.method="testAbortedSnapshotDuringInitDoesNotStart" \
  -Dtests.security.manager=true \
  -Dtests.locale=cs \
  -Dtests.timezone=AET \
  -Dcompiler.java=11 \
  -Druntime.java=8

CI:

The text was updated successfully, but these errors were encountered:

elasticmachine · 2019-02-01T20:59:09Z

Pinging @elastic/es-distributed

Tracked at elastic#38226

original-brownbear · 2019-02-01T21:01:17Z

Nasty ... I'll be able to get to this tomorrow afternoon. Should be an easy fix though.

Tracked at #38226

Tracked at elastic#38226

Tracked at #38226

* The response type here is not empty and was always wrong but this only became visible now that 0a604e3 was introduced * As a result of 0a604e3 we started actually handling the response of this request and logging/handling exceptions before that we simply dropped the classcast exception here quietly using the empty response handler * Closes elastic#38226

* Fix Incorrect Transport Response Handler Type * The response type here is not empty and was always wrong but this only became visible now that 0a604e3 was introduced * As a result of 0a604e3 we started actually handling the response of this request and logging/handling exceptions before that we simply dropped the classcast exception here quietly using the empty response handler * fix busy assert not handling `Exception` * Closes #38226 * Closes #38256

original-brownbear · 2019-02-04T08:09:58Z

Reopening since this is still being reported in #38264 (comment)

cbuescher · 2019-02-04T09:42:10Z

Muted again on master with 15510da.

original-brownbear · 2019-02-04T12:05:08Z

I tracked this down now, this is a real bug. The fix here is to do a refactoring similar to https://github.com/elastic/elasticsearch/compare/master...ywelsch:snapshot-refactored?expand=1#diff-a0853be4492c052f24917b5c1464003dR975 and remove the duplicate spots where we call endSnapshot() to avoid concurrently calling the method.
I'll try to code it up today.

cbuescher · 2019-02-04T14:46:38Z

Muted by 715e581 on master

…38368) * The problem in #38226 is that in some corner cases multiple calls to `endSnapshot` were made concurrently, leading to non-deterministic behavior (`beginSnapshot` was triggering a repository finalization while one that was triggered by a `deleteSnapshot` was already in progress) * Fixed by: * Making all `endSnapshot` calls originate from the cluster state being in a "completed" state (apart from on short-circuit on initializing an empty snapshot). This forced putting the failure string into `SnapshotsInProgress.Entry`. * Adding deduplication logic to `endSnapshot` * Also: * Streamlined the init behavior to work the same way (keep state on the `SnapshotsService` to decide which snapshot entries are stale) * closes #38226

dnhatn added :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >test-failure Triaged test failures from CI labels Feb 1, 2019

dnhatn assigned tlrx and original-brownbear Feb 1, 2019

dnhatn added a commit to dnhatn/elasticsearch that referenced this issue Feb 1, 2019

AwaitsFix testAbortedSnapshotDuringInitDoesNotStart

ce1ac36

Tracked at elastic#38226

dnhatn mentioned this issue Feb 1, 2019

AwaitsFix testAbortedSnapshotDuringInitDoesNotStart #38227

Merged

dnhatn added a commit that referenced this issue Feb 1, 2019

AwaitsFix testAbortedSnapshotDuringInitDoesNotStart (#38227)

9c39dea

Tracked at #38226

dnhatn added a commit to dnhatn/elasticsearch that referenced this issue Feb 2, 2019

AwaitsFix testAbortedSnapshotDuringInitDoesNotStart (elastic#38227)

2bc7f16

Tracked at elastic#38226

dnhatn mentioned this issue Feb 2, 2019

AwaitsFix testAbortedSnapshotDuringInitDoesNotStart #38252

Merged

dnhatn added a commit that referenced this issue Feb 2, 2019

AwaitsFix testAbortedSnapshotDuringInitDoesNotStart (#38227)

1c845d6

Tracked at #38226

original-brownbear mentioned this issue Feb 2, 2019

Fix Incorrect Transport Response Handler Type #38264

Merged

original-brownbear closed this as completed in #38264 Feb 3, 2019

original-brownbear reopened this Feb 4, 2019

This was referenced Feb 4, 2019

Mute SharedClusterSnapshotRestoreIT#testAbortedSnapshotDuringInitDoes… #38303

Closed

Mute SharedClusterSnapshotRestoreIT#testAbortedSnapshotDuringInitDoesnt Start #38304

Merged

cbuescher mentioned this issue Feb 4, 2019

Mute DedicatedClusterSnapshotRestoreIT#testRestoreShrinkIndex #38330

Merged

original-brownbear mentioned this issue Feb 4, 2019

Fix Concurrent Snapshot Ending And Stabilize Snapshot Finalization #38368

Merged

original-brownbear closed this as completed in #38368 Feb 5, 2019

original-brownbear mentioned this issue Feb 6, 2019

[CI] SharedClusterSnapshotRestoreIT.testAbortedSnapshotDuringInitDoesNotStart Fails #38489

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

testAbortedSnapshotDuringInitDoesNotStart fails with ClassCastException #38226

testAbortedSnapshotDuringInitDoesNotStart fails with ClassCastException #38226

dnhatn commented Feb 1, 2019 •

edited

Loading

elasticmachine commented Feb 1, 2019

original-brownbear commented Feb 1, 2019

original-brownbear commented Feb 4, 2019

cbuescher commented Feb 4, 2019

original-brownbear commented Feb 4, 2019

cbuescher commented Feb 4, 2019

testAbortedSnapshotDuringInitDoesNotStart fails with ClassCastException #38226

testAbortedSnapshotDuringInitDoesNotStart fails with ClassCastException #38226

Comments

dnhatn commented Feb 1, 2019 • edited Loading

elasticmachine commented Feb 1, 2019

original-brownbear commented Feb 1, 2019

original-brownbear commented Feb 4, 2019

cbuescher commented Feb 4, 2019

original-brownbear commented Feb 4, 2019

cbuescher commented Feb 4, 2019

dnhatn commented Feb 1, 2019 •

edited

Loading