Peer recovery may time out during recoverLocallyUpToGlobalCheckpoint #93542

DaveCTurner · 2023-02-07T08:23:47Z

The recoverLocallyUpToGlobalCheckpoint step of peer recovery may take an extended period of time, but today we do not disable the recovery monitor during this step so if it takes more than 30 minutes (by default) then the recovery will time out, fail, and retry repeatedly. We should disable the recovery monitor during this step.

The text was updated successfully, but these errors were encountered:

elasticsearchmachine · 2023-02-07T08:24:11Z

Pinging @elastic/es-distributed (Team:Distributed)

We do nontrivial amounts of work before we start a peer recovery, particularly recovering from the local translog up to its global checkpoint. Today the recovery monitor is running during this time, and will (repeatedly) fail the recovery if it takes more than 30 minutes to complete. With this commit we disable the recovery monitor until this local process has completed. Closes elastic#93542

We do nontrivial amounts of work before we start a peer recovery, particularly recovering from the local translog up to its global checkpoint. Today the recovery monitor is running during this time, and will (repeatedly) fail the recovery if it takes more than 30 minutes to complete. With this commit we disable the recovery monitor until this local process has completed. Closes #93542

DaveCTurner · 2023-02-07T11:11:49Z

I reverted the PR that closed this due to test failures.

We do nontrivial amounts of work before we start a peer recovery, particularly recovering from the local translog up to its global checkpoint. Today the recovery monitor is running during this time, and will (repeatedly) fail the recovery if it takes more than 30 minutes to complete. With this commit we disable the recovery monitor until this local process has completed. Closes elastic#93542

We do nontrivial amounts of work before we start a peer recovery, particularly recovering from the local translog up to its global checkpoint. Today the recovery monitor is running during this time, and will (repeatedly) fail the recovery if it takes more than 30 minutes to complete. With this commit we disable the recovery monitor until this local process has completed. Closes #93542

DaveCTurner added >bug :Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. labels Feb 7, 2023

elasticsearchmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Feb 7, 2023

DaveCTurner mentioned this issue Feb 7, 2023

Disable recovery monitor before recovery start #93543

Merged

DaveCTurner mentioned this issue Feb 7, 2023

Do we need the RecoveryMonitor? #93544

Open

DaveCTurner closed this as completed in #93543 Feb 7, 2023

DaveCTurner reopened this Feb 7, 2023

DaveCTurner mentioned this issue Feb 7, 2023

Disable recovery monitor before recovery start #93551

Merged

DaveCTurner closed this as completed in #93551 Feb 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Peer recovery may time out during recoverLocallyUpToGlobalCheckpoint #93542

Peer recovery may time out during recoverLocallyUpToGlobalCheckpoint #93542

DaveCTurner commented Feb 7, 2023

elasticsearchmachine commented Feb 7, 2023

DaveCTurner commented Feb 7, 2023

Peer recovery may time out during recoverLocallyUpToGlobalCheckpoint #93542

Peer recovery may time out during recoverLocallyUpToGlobalCheckpoint #93542

Comments

DaveCTurner commented Feb 7, 2023

elasticsearchmachine commented Feb 7, 2023

DaveCTurner commented Feb 7, 2023