-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a configurable cooldown period between SLM operations #47520
Comments
Pinging @elastic/es-core-features (:Core/Features/ILM) |
This is a special kind of thread pool executor that runs operations in a single-threaded manner, but with a configurable cooldown in between. The executor is always forced to have a single fixed thread with a configurable queue size. This special executor is not currently used, but is part of work for elastic#47520
This doesn't feel right to me, that our solution to "eventual consistency" problems is to wait a little bit, while we will still have no guarantees that how long we wait will make what we're waiting on visible. I think @original-brownbear has been working so that we aren't as susceptible to eventual consistency issues, so I really want to challenge the necessity of this. Since he's spent far more time thinking about this, I would solicit his input here. |
@dakrone @jasontedor yea, I talked about this with @ywelsch this morning and I gotta admit I'm a little at fault for suggesting we move ahead with this. The situation is this: As it stands now we are still succeptible to S3 eventual consistency issue unfortunately. The impact of it has been reduced step by step over the last year though. #46250 will reduce the impact to a single operation during snapshotting only and a follow up should be able to fix the eventual consistency issue for good.
That's true. Unfortunately though, currently waiting is what we do in operations to work around these issues. In practice the wait we're using turns out to be safe across a very large sample size. => I think what this comes down to is: |
It would be my preference that we wait another minor release for the eventual consistency fixes to land than we hack around this, and building infrastructure that will otherwise be unused (#47522) for this. |
I agree I'd rather wait until this was handled at the snapshot/restore side rather than with infra in SLM. It is actually not very simple to add to SLM, the dedicated #47522 threadpool is the "simplest" way I could think of to address it, I'd be more than happy not to have to add it. |
Given that we have agreed (I think?) to address this through consistency fixes in snapshot/delete itself, I'm going to close this as well. |
This is intended as a stop-gap solution/improvement to #38941 that prevents repo modifications without an intermittent master failover from causing inconsistent (outdated due to inconsistent listing of index-N blobs) `RepositoryData` to be written. Tracking the latest repository generation will move to the cluster state in a separate pull request. This is intended as a low-risk change to be backported as far as possible and motived by the recently increased chance of #38941 causing trouble via SLM (see #47520). Closes #47834 Closes #49048
This is intended as a stop-gap solution/improvement to elastic#38941 that prevents repo modifications without an intermittent master failover from causing inconsistent (outdated due to inconsistent listing of index-N blobs) `RepositoryData` to be written. Tracking the latest repository generation will move to the cluster state in a separate pull request. This is intended as a low-risk change to be backported as far as possible and motived by the recently increased chance of elastic#38941 causing trouble via SLM (see elastic#47520). Closes elastic#47834 Closes elastic#49048
This is intended as a stop-gap solution/improvement to elastic#38941 that prevents repo modifications without an intermittent master failover from causing inconsistent (outdated due to inconsistent listing of index-N blobs) `RepositoryData` to be written. Tracking the latest repository generation will move to the cluster state in a separate pull request. This is intended as a low-risk change to be backported as far as possible and motived by the recently increased chance of elastic#38941 causing trouble via SLM (see elastic#47520). Closes elastic#47834 Closes elastic#49048
This is intended as a stop-gap solution/improvement to #38941 that prevents repo modifications without an intermittent master failover from causing inconsistent (outdated due to inconsistent listing of index-N blobs) `RepositoryData` to be written. Tracking the latest repository generation will move to the cluster state in a separate pull request. This is intended as a low-risk change to be backported as far as possible and motived by the recently increased chance of #38941 causing trouble via SLM (see #47520). Closes #47834 Closes #49048
This is intended as a stop-gap solution/improvement to #38941 that prevents repo modifications without an intermittent master failover from causing inconsistent (outdated due to inconsistent listing of index-N blobs) `RepositoryData` to be written. Tracking the latest repository generation will move to the cluster state in a separate pull request. This is intended as a low-risk change to be backported as far as possible and motived by the recently increased chance of #38941 causing trouble via SLM (see #47520). Closes #47834 Closes #49048
For some of our repository types (particularly S3), there can sometimes be issues due to eventual visibility of files that have been uploaded.
To handle this from SLM, we should add a configurable "cooldown" period in between operations (snapshot and delete).
The text was updated successfully, but these errors were encountered: