Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S3 snapshot fails unable to move the index file blob #33874

Closed
gregolsen opened this issue Sep 19, 2018 · 4 comments
Closed

S3 snapshot fails unable to move the index file blob #33874

gregolsen opened this issue Sep 19, 2018 · 4 comments
Labels
:Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs

Comments

@gregolsen
Copy link

Elasticsearch version: 6.3.0

Plugins installed: [discovery-ec2, repository-s3]

JVM version: java version "1.8.0_171"

OS version : Linux 4.14.51-60.38.amzn1.x86_64 #1 SMP Tue Jun 26 23:06:43 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

Description of the problem including expected versus actual behavior:

Occasionally cluster snapshot operation fails with Failed to finalize snapshot creation exception caused by AmazonS3Exception: The specified key does not exist. (see full stacktrace at the bottom). Resulted snapshot marked as partial. Exception is thrown in writeAtomic when moving a newly created temporary index file.


Temporary file itself is created in the same method by writeBlob method, which in turn calls writeBlob on S3BlobContainer. Note that before writing the blob blobExists is called to check whether it already exists.
public void writeBlob(String blobName, InputStream inputStream, long blobSize) throws IOException {
if (blobExists(blobName)) {
throw new FileAlreadyExistsException("Blob [" + blobName + "] already exists, cannot overwrite");
}

Most likely the problem is that S3 only guarantees eventual consistency for read-after-write in case "exists" request was issued before write.
According to S3 documentation (see "Amazon S3 Data Consistency Model
" section)
:

Amazon S3 provides read-after-write consistency for PUTS of new objects in your S3 bucket in all regions with one caveat. The caveat is that if you make a HEAD or GET request to the key name (to find if the object exists) before creating the object, Amazon S3 provides eventual consistency for read-after-write.

That's why move operation fails with The specified key does not exist – before writing the blob its existence is verified, which makes subsequent write operation to be eventually consistent and hence newly written index file is not immediately available.

As a potential solution BlobContainer interface could be extended with an additional method which writes the blob w/o verifying its existence. ChecksumBlobStoreFormat could use that method instead of writeBlob to make write operation immediately consistent. To avoid any potential name conflicts ChecksumBlobStoreFormat#tempBlobName could have some random part in it.

Provide logs (if relevant):

Full stacktrace
[2018-09-14T04:00:27,071][WARN ][o.e.s.SnapshotShardsService] [data.localdomain] [[index_name][6]][bucket_name:snapshot_2018-09-14t04:00:10z/vm0BgXHgSRuZad49bQ_OIA]
 failed to snapshot shard
org.elasticsearch.index.snapshots.IndexShardSnapshotFailedException: Failed to finalize snapshot creation [snapshot_2018-09-14t04:00:10z/vm0BgXHgSRuZad49bQ_OIA] with shard index [index-54]
        at org.elasticsearch.repositories.blobstore.BlobStoreRepository$Context.finalize(BlobStoreRepository.java:1015) ~[elasticsearch-6.3.0.jar:6.3.0]
        at org.elasticsearch.repositories.blobstore.BlobStoreRepository$SnapshotContext.snapshot(BlobStoreRepository.java:1255) ~[elasticsearch-6.3.0.jar:6.3.0]
        at org.elasticsearch.repositories.blobstore.BlobStoreRepository.snapshotShard(BlobStoreRepository.java:813) ~[elasticsearch-6.3.0.jar:6.3.0]
        at org.elasticsearch.snapshots.SnapshotShardsService.snapshot(SnapshotShardsService.java:410) ~[elasticsearch-6.3.0.jar:6.3.0]
        at org.elasticsearch.snapshots.SnapshotShardsService.access$200(SnapshotShardsService.java:97) ~[elasticsearch-6.3.0.jar:6.3.0]
        at org.elasticsearch.snapshots.SnapshotShardsService$1.doRun(SnapshotShardsService.java:354) [elasticsearch-6.3.0.jar:6.3.0]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:724) [elasticsearch-6.3.0.jar:6.3.0]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-6.3.0.jar:6.3.0]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_171]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_171]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_171]
Caused by: java.io.IOException: com.amazonaws.services.s3.model.AmazonS3Exception: The specified key does not exist. (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey; Request ID: 599340CA5D293CCD; S3 Extended Request ID: BJID/TmaSF3GyAt8PYxin+3MQ2F6si83vFqfi3xXlZkNYotCWPBnc9UYsU8PYMKXQ8ffzK8Yluw=), S3 Extended Request ID: BJID/TmaSF3GyAt8PYxin+3MQ2F6si83vFqfi3xXlZkNYotCWPBnc9UYsU8PYMKXQ8ffzK8Yluw=
        at org.elasticsearch.repositories.s3.S3BlobContainer.move(S3BlobContainer.java:177) ~[?:?]
        at org.elasticsearch.repositories.blobstore.ChecksumBlobStoreFormat.writeAtomic(ChecksumBlobStoreFormat.java:138) ~[elasticsearch-6.3.0.jar:6.3.0]
        at org.elasticsearch.repositories.blobstore.BlobStoreRepository$Context.finalize(BlobStoreRepository.java:986) ~[elasticsearch-6.3.0.jar:6.3.0]
        ... 10 more
        Suppressed: java.nio.file.NoSuchFileException: Blob [pending-index-54] does not exist
                at org.elasticsearch.repositories.s3.S3BlobContainer.deleteBlob(S3BlobContainer.java:116) ~[?:?]
                at org.elasticsearch.repositories.blobstore.ChecksumBlobStoreFormat.writeAtomic(ChecksumBlobStoreFormat.java:142) ~[elasticsearch-6.3.0.jar:6.3.0]
                at org.elasticsearch.repositories.blobstore.BlobStoreRepository$Context.finalize(BlobStoreRepository.java:986) ~[elasticsearch-6.3.0.jar:6.3.0]
                at org.elasticsearch.repositories.blobstore.BlobStoreRepository$SnapshotContext.snapshot(BlobStoreRepository.java:1255) ~[elasticsearch-6.3.0.jar:6.3.0]
                at org.elasticsearch.repositories.blobstore.BlobStoreRepository.snapshotShard(BlobStoreRepository.java:813) ~[elasticsearch-6.3.0.jar:6.3.0]
                at org.elasticsearch.snapshots.SnapshotShardsService.snapshot(SnapshotShardsService.java:410) ~[elasticsearch-6.3.0.jar:6.3.0]
                at org.elasticsearch.snapshots.SnapshotShardsService.access$200(SnapshotShardsService.java:97) ~[elasticsearch-6.3.0.jar:6.3.0]
                at org.elasticsearch.snapshots.SnapshotShardsService$1.doRun(SnapshotShardsService.java:354) [elasticsearch-6.3.0.jar:6.3.0]
                at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:724) [elasticsearch-6.3.0.jar:6.3.0]
                at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-6.3.0.jar:6.3.0]
                at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_171]
                at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_171]
                at java.lang.Thread.run(Thread.java:748) [?:1.8.0_171]
Caused by: com.amazonaws.services.s3.model.AmazonS3Exception: The specified key does not exist. (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey; Request ID: 599340CA5D293CCD; S3 Extended Request ID: BJID/TmaSF3GyAt8PYxin+3MQ2F6si83vFqfi3xXlZkNYotCWPBnc9UYsU8PYMKXQ8ffzK8Yluw=)
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1639) ~[?:?]
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1304) ~[?:?]
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1056) ~[?:?]
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:743) ~[?:?]
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:717) ~[?:?]
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:699) ~[?:?]
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:667) ~[?:?]
        at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:649) ~[?:?]
        at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:513) ~[?:?]
        at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4247) ~[?:?]
        at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4194) ~[?:?]
        at com.amazonaws.services.s3.AmazonS3Client.copyObject(AmazonS3Client.java:1870) ~[?:?]
        at org.elasticsearch.repositories.s3.S3BlobContainer.lambda$move$6(S3BlobContainer.java:172) ~[?:?]
        at org.elasticsearch.repositories.s3.SocketAccess.lambda$doPrivilegedVoid$0(SocketAccess.java:57) ~[?:?]
        at java.security.AccessController.doPrivileged(Native Method) ~[?:1.8.0_171]
        at org.elasticsearch.repositories.s3.SocketAccess.doPrivilegedVoid(SocketAccess.java:56) ~[?:?]
        at org.elasticsearch.repositories.s3.S3BlobContainer.move(S3BlobContainer.java:171) ~[?:?]
        at org.elasticsearch.repositories.blobstore.ChecksumBlobStoreFormat.writeAtomic(ChecksumBlobStoreFormat.java:138) ~[elasticsearch-6.3.0.jar:6.3.0]
@dnhatn dnhatn added the :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs label Sep 20, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@ywelsch
Copy link
Contributor

ywelsch commented Sep 20, 2018

@gregolsen this has been fixed in ES v6.4.0, see #31729 and #31128.

@ywelsch ywelsch closed this as completed Sep 20, 2018
@gregolsen
Copy link
Author

@ywelsch I'm embarrassed, thank you and sorry for taking your time on this one!

@ywelsch
Copy link
Contributor

ywelsch commented Sep 20, 2018

@gregolsen there's no reason to feel embarrassed. I prefer closing an issue that's already fixed to missing out on a bug report. Thanks for sharing your findings and analysis with us.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs
Projects
None yet
Development

No branches or pull requests

4 participants