-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable Azure snapshot plugin to support taking snapshot into multiple storage accounts. #22709
Conversation
Thanks for the PR. Please could I ask you to sign the CLA before we review it? |
Perhaps we need to be better about detecting throttling events and backing off instead of failing, and you also may need to expand your storage to increase the IOPS capacity, but I don't think we should make the (already complicated) repository settings here even more complicated. Introducing multiple accounts for a single repository raises all kinds of questions that I don't think we should be worrying about. |
I'm not sure this will solve the problem either. What if all configured storage accounts become full - then what? We would have to rehash the blobs to different buckets once more storage accounts are added. I think it will become too difficult to maintain and will require a lot of logic to solve a problem that most don't have. The simple solution here would be to create a new repository located at a different storage account. It will add a bit of extra complexity on the user to look into multiple repositories to find the snapshot they may be looking for, but in this case, I believe its the right tradeoff. |
Sorry I misunderstood regarding a throttling limit on the Azure storage accounts vs. a size limit. In any case, I think the complexity argument still holds (adding or removing a storage account would require rehashing all blobs to different accounts).
++ |
@JeffreyZZ Thank you for the PR, but we are going decline for now. I suggest looking into increasing your IOPS as I mentioned earlier, and I opened an issue to make our behavior on throttling better (so as not to fail a snapshot when throttled): #22728. And you are of course welcome to work on a PR for that issue! |
Thanks for the quick response. I think that the feature to enable Elasticsearch to write and read snapshot to/from multiple Azure storage accounts is very important for running big production clusters (with 50+ data nodes) on public cloud, such as Azure. Here I’d like to provide more investigation details that we did before I started to add this feature to the plugin for our production cluster running on Azure cloud.
|
Hi @JeffreyZZ We've had a long internal discussion about this PR and the problem in general. We believe that this PR is not the right way to solve the problem because of the notion of bucketing blobs based on the number of accounts; if you change the number of accounts, you break everything (you can’t restore anymore), and your future snapshots will be sending blobs to different accounts. The design is fundamentally flawed. Long term, we would like to rewrite snapshot-restore to use Lucene's recovery process instead of the BlobStore that we use today. With that rewrite in place, we could treat multiple repos as separate disks, and put one index on each "disk" (the same way we treat multiple local mount-points today). This would solve your issue in a much cleaner way. Obviously, that is a major rewrite and will not be happening anytime soon. In the meantime (and given the limitations of Azure) I'd suggest breaking your snapshots down by index (which could be sent to different accounts) so that you do not run into these issues with throttling. |
Hi, @clintongormley , Thanks for the team's efforts to evaluate this PR and sharing your thoughts on. I think this might be the right way to solve the throttling problem in the long run. By the way, this is another change in PR about improving the retry with exponential retry, see the code line below, for your reference. This should help to improve the performance of retry. client.getDefaultRequestOptions().setRetryPolicyFactory(new RetryExponentialRetry(1000 * 30, 7)); Thanks, Jeffrey |
@JeffreyZZ I agree that it could be a nice separated PR to send. Wanna do it? I believe it should be available as a setting though. |
The default Elasticsearch Azure snapshot plugin write the whole snapshot data into a single Azure storage account. For big Elasticsearch cluster with multiple TB data in size, the snapshot could fail because of the storage account throttling limit. Here is an example of the snapshot failure error:
To address this problem, we extend Elasticsearch Azure plugin by adding the feature to support taking snapshot into and restore from multiple storage accounts to avoid overload a single storage account. That said, Elasticsearch can write their snapshot data into multiple storage accounts evenly and in parallel.
Here are the configuration as well as the commands to take snapshot and restore snapshot:
Elasticsearch.yml
cloud.azure.storage.my_account1.account: storageaccount1
cloud.azure.storage.my_account1.key: key1
cloud.azure.storage.my_account1.default: true
cloud.azure.storage.my_account2.account: storageaccount2
cloud.azure.storage.my_account2.key: key2
cloud.azure.storage.my_account2.default: true
cloud.azure.storage.my_account3.account: storageaccount3
cloud.azure.storage.my_account3.key: key3
cloud.azure.storage.my_account3.default: true
Commands
#1: define repository
PUT _snapshot/plugintest160921
{
"type": "azure",
"settings": {
"account": "my_account1,my_account2,my_account3",
"container": "plugintest160921"
}
}
#2: take snapshot
PUT _snapshot/plugintest160921/backup0921?wait_for_completion=true
{
}
#3: restore
POST _snapshot/plugintest160921/backup0921/_restore?wait_for_completion=true
{
"ignore_unavailable": "true",
"include_global_state": false
}
#4: define repository for secondary
PUT _snapshot/plugintest160921
{
"type": "azure",
"settings": {
"account": "my_account1,my_account2,my_account3",
"container": "plugintest160921",
"location_mode": "secondary_only"
}
}