Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Peer recovery may time out during recoverLocallyUpToGlobalCheckpoint #93542

Closed
DaveCTurner opened this issue Feb 7, 2023 · 2 comments · Fixed by #93543 or #93551
Closed

Peer recovery may time out during recoverLocallyUpToGlobalCheckpoint #93542

DaveCTurner opened this issue Feb 7, 2023 · 2 comments · Fixed by #93543 or #93551
Labels
>bug :Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.

Comments

@DaveCTurner
Copy link
Contributor

The recoverLocallyUpToGlobalCheckpoint step of peer recovery may take an extended period of time, but today we do not disable the recovery monitor during this step so if it takes more than 30 minutes (by default) then the recovery will time out, fail, and retry repeatedly. We should disable the recovery monitor during this step.

@DaveCTurner DaveCTurner added >bug :Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. labels Feb 7, 2023
@elasticsearchmachine elasticsearchmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Feb 7, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue Feb 7, 2023
We do nontrivial amounts of work before we start a peer recovery,
particularly recovering from the local translog up to its global
checkpoint. Today the recovery monitor is running during this time, and
will (repeatedly) fail the recovery if it takes more than 30 minutes to
complete. With this commit we disable the recovery monitor until this
local process has completed.

Closes elastic#93542
DaveCTurner added a commit that referenced this issue Feb 7, 2023
We do nontrivial amounts of work before we start a peer recovery,
particularly recovering from the local translog up to its global
checkpoint. Today the recovery monitor is running during this time, and
will (repeatedly) fail the recovery if it takes more than 30 minutes to
complete. With this commit we disable the recovery monitor until this
local process has completed.

Closes #93542
@DaveCTurner DaveCTurner reopened this Feb 7, 2023
@DaveCTurner
Copy link
Contributor Author

I reverted the PR that closed this due to test failures.

DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue Feb 7, 2023
We do nontrivial amounts of work before we start a peer recovery,
particularly recovering from the local translog up to its global
checkpoint. Today the recovery monitor is running during this time, and
will (repeatedly) fail the recovery if it takes more than 30 minutes to
complete. With this commit we disable the recovery monitor until this
local process has completed.

Closes elastic#93542
DaveCTurner added a commit that referenced this issue Feb 7, 2023
We do nontrivial amounts of work before we start a peer recovery,
particularly recovering from the local translog up to its global
checkpoint. Today the recovery monitor is running during this time, and
will (repeatedly) fail the recovery if it takes more than 30 minutes to
complete. With this commit we disable the recovery monitor until this
local process has completed.

Closes #93542
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.
Projects
None yet
2 participants