Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Remote Translog] High failover durations #7381

Closed
gbbafna opened this issue May 3, 2023 · 2 comments
Closed

[Remote Translog] High failover durations #7381

gbbafna opened this issue May 3, 2023 · 2 comments
Assignees
Labels
enhancement Enhancement or improvement to existing feature or request Storage:Durability Issues and PRs related to the durability framework v2.8.0 'Issues and PRs related to version v2.8.0'

Comments

@gbbafna
Copy link
Collaborator

gbbafna commented May 3, 2023

Is your feature request related to a problem? Please describe.

For trimming Remote Translog, we rely on flush as well as remote segment store upload . Since Flush happen rarely as compared to remote segment store upload (happens on refresh) , we always end up depending upon flush to trim the translog .

This increases the failover time as the newly promoted primary needs to download lot of translog files

Describe the solution you'd like

Don't rely on flush to trim remote translog . Instead use the seq number safely persisted in remote segment store .

Alternate

Change the behavior of refresh . Call flush on every refresh .

@gbbafna gbbafna added enhancement Enhancement or improvement to existing feature or request untriaged Storage:Durability Issues and PRs related to the durability framework and removed untriaged labels May 3, 2023
@gbbafna gbbafna self-assigned this May 3, 2023
@sachinpkale sachinpkale added the v2.8.0 'Issues and PRs related to version v2.8.0' label May 3, 2023
@linuxpi
Copy link
Collaborator

linuxpi commented May 3, 2023

@gbbafna here is a sample exception. Ideally the root cause we should do is to understand why there were do many translog files pilling up for delete. this indicates flush only succeeded after a while

[2023-05-02T08:00:34,355][ERROR][o.o.i.t.t.TranslogTransferManager] [node-1] Exception occurred while deleting translog for primaryTerm=1 files=[trans
java.io.IOException: Failed to delete blobs [[cluster-4/MYce-mgETOGu23bJrxMY7w/11/1/translog-550.ckp, cluster-4/MYce-mgETOGu23bJrxMY7w/11/1/translog-9
        at org.opensearch.repositories.s3.S3BlobContainer.doDeleteBlobs(S3BlobContainer.java:316) ~[?:?]
        at org.opensearch.repositories.s3.S3BlobContainer.deleteBlobsIgnoringIfNotExists(S3BlobContainer.java:250) ~[?:?]
        at org.opensearch.index.translog.transfer.BlobStoreTransferService.deleteBlobs(BlobStoreTransferService.java:80) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.
        at org.opensearch.index.translog.transfer.BlobStoreTransferService.lambda$deleteBlobsAsync$2(BlobStoreTransferService.java:87) [opensearch-3.0.0-SNAP
        at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:747) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNA
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) [?:?]
        at java.lang.Thread.run(Thread.java:1589) [?:?]
Caused by: com.amazonaws.services.s3.model.AmazonS3Exception: Please reduce your request rate. (Service: Amazon S3; Status Code: 503; Error Code: Slow
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1879) ~[?:?]
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleServiceErrorResponse(AmazonHttpClient.java:1418) ~[?:?]
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1387) ~[?:?]
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1157) ~[?:?]
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:814) ~[?:?]
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:781) ~[?:?]
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:755) ~[?:?]
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:715) ~[?:?]
        at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:697) ~[?:?]
        at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:561) ~[?:?]
        at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:541) ~[?:?] 
        at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5456) ~[?:?]
        at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5403) ~[?:?]
        at com.amazonaws.services.s3.AmazonS3Client.deleteObjects(AmazonS3Client.java:2335) ~[?:?]
        at org.opensearch.repositories.s3.S3BlobContainer.lambda$doDeleteBlobs$8(S3BlobContainer.java:285) ~[?:?]
        at org.opensearch.repositories.s3.SocketAccess.lambda$doPrivilegedVoid$0(SocketAccess.java:79) ~[?:?]
        at java.security.AccessController.doPrivileged(AccessController.java:318) ~[?:?]
        at org.opensearch.repositories.s3.SocketAccess.doPrivilegedVoid(SocketAccess.java:78) ~[?:?]
        at org.opensearch.repositories.s3.S3BlobContainer.doDeleteBlobs(S3BlobContainer.java:277) ~[?:?]
        ... 7 more
        Suppressed: com.amazonaws.services.s3.model.AmazonS3Exception: Please reduce your request rate. (Service: Amazon S3; Status Code: 503; Error Code: Sl
                at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1879) ~[?:?]
                at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleServiceErrorResponse(AmazonHttpClient.java:1418) ~[?:?]
                at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1387) ~[?:?]
                at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1157) ~[?:?]
                at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:814) ~[?:?]
                at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:781) ~[?:?]
                at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:755) ~[?:?]
                at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:715) ~[?:?]
                at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:697) ~[?:?]
                at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:561) ~[?:?]
                at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:541) ~[?:?]
                at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5456) ~[?:?]
                at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5403) ~[?:?]
                at com.amazonaws.services.s3.AmazonS3Client.deleteObjects(AmazonS3Client.java:2335) ~[?:?]
                at org.opensearch.repositories.s3.S3BlobContainer.lambda$doDeleteBlobs$8(S3BlobContainer.java:285) ~[?:?]
                at org.opensearch.repositories.s3.SocketAccess.lambda$doPrivilegedVoid$0(SocketAccess.java:79) ~[?:?]
                at java.security.AccessController.doPrivileged(AccessController.java:318) ~[?:?]
                at org.opensearch.repositories.s3.SocketAccess.doPrivilegedVoid(SocketAccess.java:78) ~[?:?]
                at org.opensearch.repositories.s3.S3BlobContainer.doDeleteBlobs(S3BlobContainer.java:277) ~[?:?]
                at org.opensearch.repositories.s3.S3BlobContainer.deleteBlobsIgnoringIfNotExists(S3BlobContainer.java:250) ~[?:?]

@gbbafna
Copy link
Collaborator Author

gbbafna commented May 8, 2023

@linuxpi : Yes, this is because we are deferring the deletes of translog to flush . With the proposed change, we will clean up the translog with every refresh , hence alleviating the sudden calls to remote store.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or improvement to existing feature or request Storage:Durability Issues and PRs related to the durability framework v2.8.0 'Issues and PRs related to version v2.8.0'
Projects
None yet
Development

No branches or pull requests

3 participants