-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Peer recovery may time out during recoverLocallyUpToGlobalCheckpoint #93542
Labels
>bug
:Distributed Indexing/Recovery
Anything around constructing a new shard, either from a local or a remote source.
Team:Distributed (Obsolete)
Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.
Comments
DaveCTurner
added
>bug
:Distributed Indexing/Recovery
Anything around constructing a new shard, either from a local or a remote source.
labels
Feb 7, 2023
elasticsearchmachine
added
the
Team:Distributed (Obsolete)
Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.
label
Feb 7, 2023
Pinging @elastic/es-distributed (Team:Distributed) |
DaveCTurner
added a commit
to DaveCTurner/elasticsearch
that referenced
this issue
Feb 7, 2023
We do nontrivial amounts of work before we start a peer recovery, particularly recovering from the local translog up to its global checkpoint. Today the recovery monitor is running during this time, and will (repeatedly) fail the recovery if it takes more than 30 minutes to complete. With this commit we disable the recovery monitor until this local process has completed. Closes elastic#93542
DaveCTurner
added a commit
that referenced
this issue
Feb 7, 2023
We do nontrivial amounts of work before we start a peer recovery, particularly recovering from the local translog up to its global checkpoint. Today the recovery monitor is running during this time, and will (repeatedly) fail the recovery if it takes more than 30 minutes to complete. With this commit we disable the recovery monitor until this local process has completed. Closes #93542
I reverted the PR that closed this due to test failures. |
DaveCTurner
added a commit
to DaveCTurner/elasticsearch
that referenced
this issue
Feb 7, 2023
We do nontrivial amounts of work before we start a peer recovery, particularly recovering from the local translog up to its global checkpoint. Today the recovery monitor is running during this time, and will (repeatedly) fail the recovery if it takes more than 30 minutes to complete. With this commit we disable the recovery monitor until this local process has completed. Closes elastic#93542
DaveCTurner
added a commit
that referenced
this issue
Feb 7, 2023
We do nontrivial amounts of work before we start a peer recovery, particularly recovering from the local translog up to its global checkpoint. Today the recovery monitor is running during this time, and will (repeatedly) fail the recovery if it takes more than 30 minutes to complete. With this commit we disable the recovery monitor until this local process has completed. Closes #93542
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
>bug
:Distributed Indexing/Recovery
Anything around constructing a new shard, either from a local or a remote source.
Team:Distributed (Obsolete)
Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.
The
recoverLocallyUpToGlobalCheckpoint
step of peer recovery may take an extended period of time, but today we do not disable the recovery monitor during this step so if it takes more than 30 minutes (by default) then the recovery will time out, fail, and retry repeatedly. We should disable the recovery monitor during this step.The text was updated successfully, but these errors were encountered: