Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow snapshots to S3 #24123

Closed
jjfalling opened this issue Apr 15, 2017 · 17 comments
Closed

Slow snapshots to S3 #24123

jjfalling opened this issue Apr 15, 2017 · 17 comments
Assignees
Labels
:Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs

Comments

@jjfalling
Copy link

Elasticsearch version: 2.4.2

Plugins installed: [] Cloud AWS / Repository S3

JVM version: OpenJDK 1.8

OS version: Debian 8

Description of the problem including expected versus actual behavior:

Snapshots to S3 are significantly slower then transferring files to S3 using the AWS cli. Uploading a test file using the AWS cli gets ~230mb/s but the snapshots get only around an average of 25mb/s. Both of these tests were preformed on the same EC2 instance. The instance was also in the same region as the S3 bucket. The test index was 10gb with 5 shards and force merged to one segment. The snapshot throttle settings were also set to 10GB.

I also tested ES 5.2.2 and found the snapshot was significantly faster, however still much slower then using the AWS cli.

I've already talked to @tlrx about this issue and we agreed I should open an issue.

@jasontedor
Copy link
Member

@tlrx Can you take a look?

@abeyad
Copy link

abeyad commented Apr 29, 2017

I believe this will be fixed by #24403

@tlrx
Copy link
Member

tlrx commented May 2, 2017

@jjfalling Do you think you could run your tests again but on a version that includes @abeyad 's patch?

@jjfalling
Copy link
Author

Sure, I'll try to test it this week.

@jjfalling
Copy link
Author

@tlrx I've done some light testing and the snapshots are faster, but still much slower then the AWS cli. I need to do more testing but I'm not sure when I will have time.

@vrizopoulos
Copy link

@jjfalling , any progress with this? s3 snapshot speed is still slower in comparison for both backup and restore on 5.6

@tlrx
Copy link
Member

tlrx commented Oct 31, 2017

@vrizopoulos Yes, we identified a slow down because of a inefficient buffer initialization in the S3 repository and I'll create a fix soon. Also, I'm investigating using a more recent version of the AWS SDK that should speed up uploads (#26993). I'll update this issue once I have more.

@tlrx
Copy link
Member

tlrx commented Nov 9, 2017

I created #27280 to reduce memory allocation when snapshoting to S3 using Elasticsearch 6.x and I saw a performance improvement. I plan to backport the change down to 6.0 and 5.6.5.

Note that AWS enabled throttling by default in March 2016 so one has to be careful when comparing aws cli and repository-s3 performance because depending of the versions one is throttled and not the other one (also as you noted snapshots are rate limited to 40Mb /sec by default in Elasticsearch).

Also please note that AWS cli and the S3 repository plugin are two different beasts, the former can push files using zero copy send files system calls while the latter has a rate limit by default and does much more than just uploading files.

@vrizopoulos
Copy link

@tlrx, Thanks for the updates and your work on this. Based on the AWS article you've linked, it's the number of retries that gets throttled, not necessarily the transfer rate itself?

Regarding the 40Mb/s default limit, is that something that be overwritten when creating the snapshot repo using the "max_snapshot_bytes_per_sec" "max_restore_bytes_per_sec" configuration keys?

Finally is there perhaps a way to allow for multiple restores to execute in parallel that would provide us with a better aggregate performance?

Thanks

@tlrx
Copy link
Member

tlrx commented Jan 11, 2018

Based on the AWS article you've linked, it's the number of retries that gets throttled, not necessarily the transfer rate itself?

Yes, I linked to this article because @abeyad mentioned the issue #24403 which involves many retries when the incompatible snapshot blob is missing.

Regarding the 40Mb/s default limit, is that something that be overwritten when creating the snapshot repo using the "max_snapshot_bytes_per_sec" "max_restore_bytes_per_sec" configuration keys?

Yes, you can set both of them to 0 or -1 to disable the rate limiter. But test this before changing your production.

Finally is there perhaps a way to allow for multiple restores to execute in parallel that would provide us with a better aggregate performance?

The current snapshot/restore feature does not allow multiple snapshots or restores to be executed at the same time. But for a given snapshot or restore, multiple shards are snapshotted/restored in parallel. If both index-a and index-b were snapshotted at the same time, when restoring the snapshot both indices will be restored in parallel. Does that help?

Finally, I'd love to have more feedback from users about repository-s3 performance on versions 5.6.5/6.0.1/6.1.0 to see the impact of #27280.

@JD557
Copy link

JD557 commented Jan 25, 2018

Finally, I'd love to have more feedback from users about repository-s3 performance on versions 5.6.5/6.0.1/6.1.0 to see the impact of #27280.

This was an issue that was affecting us on 5.6.4 and the upgrade to 6.1.0 seems to have helped a ton (snapshots are about 30% faster).

Here's a redacted output from _cat/snapshots:

id                   duration indices total_shards
hourly-2018-01-24-11    14.9m     715         3567
hourly-2018-01-24-12    16.1m     717         3577
hourly-2018-01-24-13    15.4m     717         3577
hourly-2018-01-24-14    16.8m     717         3577
------------------------ Update ------------------
hourly-2018-01-24-16    17.4m     717         3577
hourly-2018-01-24-17      11m     717         3577
hourly-2018-01-24-18    11.6m     717         3577
hourly-2018-01-24-19      11m     717         3577
hourly-2018-01-24-20    11.4m     717         3577
hourly-2018-01-24-21    11.6m     717         3577
hourly-2018-01-24-22    10.7m     717         3577
hourly-2018-01-24-23    10.6m     717         3577

@tlrx
Copy link
Member

tlrx commented Jan 25, 2018

@JD557 Thanks for the feedback!

@eherot
Copy link

eherot commented Feb 6, 2018

Is it possible that this might affect deletes as well? I'm running 5.6.3 and snapshot deletes are quite slow:

2018-02-05 06:36:12,600 INFO      Deleting snapshot applogs-2017.10.27...
2018-02-05 07:02:54,618 INFO      Deleting snapshot applogs-2017.10.28...
2018-02-05 07:33:38,949 INFO      Deleting snapshot applogs-2017.10.29...
2018-02-05 07:59:19,782 INFO      Deleting snapshot applogs-2017.10.30...
2018-02-05 08:27:13,283 INFO      Deleting snapshot applogs-2017.10.31...

I don't know exactly what version of ES was used to make these snapshots but I don't think it was < 5.x.

@tlrx
Copy link
Member

tlrx commented Feb 7, 2018

@eherot Some factors can affect the performance of a deletion, but I can't tell from your previous comment if the deletion took more time than usual on your platform or did you simply notice that the deletions were slow?

Is it possible that this might affect deletes as well? I'm running 5.6.3 and snapshot deletes are quite slow:

Many factors can slow down a snapshot deletion. The issue fixed by #27280 mostly impacts snapshot creation but it can affect deletion too as many files are updated during the deletion process.

Most of the time the deletion is slow because the repository contains a lot of snapshots (the file index-N at the root of the S3 bucket is a good indicator of what is a "lot of snapshots") and/or the snapshots contains a lot of indices and shards. Since the snapshots are incremental, it requires to update the list of files to keep/to delete for all shards involved in the snapshot and this can also take some time.

@clintongormley clintongormley added :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs and removed :Plugin Repository S3 labels Feb 14, 2018
@JD557
Copy link

JD557 commented Feb 20, 2018

Recently, I've noticed a huge slowdown in snapshots. After looking at the elasticsearch logs, I've found a lot of errors like this:

[2018-02-20T14:58:48,669][WARN ][o.e.r.s.S3Repository     ] [qMs-_ZD] failed to read index file [index-0]
com.fasterxml.jackson.core.JsonParseException: Duplicate field 'hourly-2018-02-16-04'
 at [Source: org.elasticsearch.common.bytes.BytesReference$MarkSupportingStreamInputWrapper@1e926bbd; line: -1, column: 1117]
	at com.fasterxml.jackson.core.json.JsonReadContext._checkDup(JsonReadContext.java:204) ~[jackson-core-2.8.10.jar:2.8.10]
	at com.fasterxml.jackson.core.json.JsonReadContext.setCurrentName(JsonReadContext.java:198) ~[jackson-core-2.8.10.jar:2.8.10]
	at com.fasterxml.jackson.dataformat.smile.SmileParser._handleFieldName(SmileParser.java:1552) ~[jackson-dataformat-smile-2.8.10.jar:2.8.10]
	at com.fasterxml.jackson.dataformat.smile.SmileParser.nextToken(SmileParser.java:588) ~[jackson-dataformat-smile-2.8.10.jar:2.8.10]
	at org.elasticsearch.common.xcontent.json.JsonXContentParser.nextToken(JsonXContentParser.java:52) ~[elasticsearch-6.1.0.jar:6.1.0]
	at org.elasticsearch.index.snapshots.blobstore.BlobStoreIndexShardSnapshots.fromXContent(BlobStoreIndexShardSnapshots.java:257) ~[elasticsearch-6.1.0.jar:6.1.0]
	at org.elasticsearch.repositories.blobstore.BlobStoreFormat.read(BlobStoreFormat.java:113) ~[elasticsearch-6.1.0.jar:6.1.0]
	at org.elasticsearch.repositories.blobstore.ChecksumBlobStoreFormat.readBlob(ChecksumBlobStoreFormat.java:114) ~[elasticsearch-6.1.0.jar:6.1.0]
	at org.elasticsearch.repositories.blobstore.BlobStoreFormat.read(BlobStoreFormat.java:89) ~[elasticsearch-6.1.0.jar:6.1.0]
	at org.elasticsearch.repositories.blobstore.BlobStoreRepository$Context.buildBlobStoreIndexShardSnapshots(BlobStoreRepository.java:1070) ~[elasticsearch-6.1.0.jar:6.1.0]
	at org.elasticsearch.repositories.blobstore.BlobStoreRepository$Context.delete(BlobStoreRepository.java:927) ~[elasticsearch-6.1.0.jar:6.1.0]
	at org.elasticsearch.repositories.blobstore.BlobStoreRepository.delete(BlobStoreRepository.java:877) ~[elasticsearch-6.1.0.jar:6.1.0]
	at org.elasticsearch.repositories.blobstore.BlobStoreRepository.deleteSnapshot(BlobStoreRepository.java:390) ~[elasticsearch-6.1.0.jar:6.1.0]
	at org.elasticsearch.snapshots.SnapshotsService.lambda$deleteSnapshotFromRepository$8(SnapshotsService.java:1292) ~[elasticsearch-6.1.0.jar:6.1.0]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:568) [elasticsearch-6.1.0.jar:6.1.0]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_161]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_161]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_161]

Is it possible that the latest changes can output an incorrect JSON somewhere?

@tlrx
Copy link
Member

tlrx commented Feb 20, 2018

Is it possible that the latest changes can output an incorrect JSON somewhere?

I'd say no, the latest changes didn't impact the output of this file. The exception is triggered by #22073 which adds detection for duplicated fields in XContent (in your case, a SMILE file). Can you please send me the index-0 file (tanguy at elastic.co)? I'm curious to see what it contains... It does not contain sensitive data but contains snapshots and indice names.

@tlrx
Copy link
Member

tlrx commented Mar 26, 2018

We talked about this today with @ywelsch and we're going to close this issue. Some improvement has been made (#27280) and we can not expect the same performance as the aws cli because the snapshot/restore process is doing more things than uploading file. Anyway, feel free to add any comment on this issue as we're always happy to have more feedback on performance changes between versions.

@tlrx tlrx closed this as completed Mar 26, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs
Projects
None yet
Development

No branches or pull requests

8 participants