Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a configurable cooldown period between SLM operations #47520

Closed
dakrone opened this issue Oct 3, 2019 · 6 comments
Closed

Add a configurable cooldown period between SLM operations #47520

dakrone opened this issue Oct 3, 2019 · 6 comments
Assignees
Labels
:Data Management/ILM+SLM Index and Snapshot lifecycle management >enhancement

Comments

@dakrone
Copy link
Member

dakrone commented Oct 3, 2019

For some of our repository types (particularly S3), there can sometimes be issues due to eventual visibility of files that have been uploaded.

To handle this from SLM, we should add a configurable "cooldown" period in between operations (snapshot and delete).

@dakrone dakrone added >enhancement :Data Management/ILM+SLM Index and Snapshot lifecycle management labels Oct 3, 2019
@dakrone dakrone self-assigned this Oct 3, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-features (:Core/Features/ILM)

dakrone added a commit to dakrone/elasticsearch that referenced this issue Oct 3, 2019
This is a special kind of thread pool executor that runs operations in a
single-threaded manner, but with a configurable cooldown in between. The
executor is always forced to have a single fixed thread with a
configurable queue size.

This special executor is not currently used, but is part of work for elastic#47520
@jasontedor
Copy link
Member

This doesn't feel right to me, that our solution to "eventual consistency" problems is to wait a little bit, while we will still have no guarantees that how long we wait will make what we're waiting on visible. I think @original-brownbear has been working so that we aren't as susceptible to eventual consistency issues, so I really want to challenge the necessity of this. Since he's spent far more time thinking about this, I would solicit his input here.

@original-brownbear
Copy link
Member

@dakrone @jasontedor yea, I talked about this with @ywelsch this morning and I gotta admit I'm a little at fault for suggesting we move ahead with this.

The situation is this:

As it stands now we are still succeptible to S3 eventual consistency issue unfortunately. The impact of it has been reduced step by step over the last year though. #46250 will reduce the impact to a single operation during snapshotting only and a follow up should be able to fix the eventual consistency issue for good.
My assumption when suggesting we move the wait from operations into ES/SLM was that:

  1. It would be very simple to add that wait to SLM
  2. We'd be able to remove it transparently once it has become unnecessary

we will still have no guarantees that how long we wait will make what we're waiting on visible

That's true. Unfortunately though, currently waiting is what we do in operations to work around these issues. In practice the wait we're using turns out to be safe across a very large sample size.

=> I think what this comes down to is:
If SLM will schedule operations (create/delete) on the repo in rapid succession that could be a problem with S3 repositories. Do we want to fix this problem for users or can we wait another minor for the eventual consistency fixes to land?

@jasontedor
Copy link
Member

It would be my preference that we wait another minor release for the eventual consistency fixes to land than we hack around this, and building infrastructure that will otherwise be unused (#47522) for this.

@dakrone
Copy link
Member Author

dakrone commented Oct 4, 2019

I agree I'd rather wait until this was handled at the snapshot/restore side rather than with infra in SLM. It is actually not very simple to add to SLM, the dedicated #47522 threadpool is the "simplest" way I could think of to address it, I'd be more than happy not to have to add it.

@dakrone
Copy link
Member Author

dakrone commented Oct 4, 2019

Given that we have agreed (I think?) to address this through consistency fixes in snapshot/delete itself, I'm going to close this as well.

@dakrone dakrone closed this as completed Oct 4, 2019
original-brownbear added a commit that referenced this issue Nov 14, 2019
This is intended as a stop-gap solution/improvement to #38941 that
prevents repo modifications without an intermittent master failover
from causing inconsistent (outdated due to inconsistent listing of index-N blobs)
`RepositoryData` to be written.

Tracking the latest repository generation will move to the cluster state in a
separate pull request. This is intended as a low-risk change to be backported as
far as possible and motived by the recently increased chance of #38941 
causing trouble via SLM (see #47520).

Closes #47834
Closes #49048
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Nov 14, 2019
This is intended as a stop-gap solution/improvement to elastic#38941 that
prevents repo modifications without an intermittent master failover
from causing inconsistent (outdated due to inconsistent listing of index-N blobs)
`RepositoryData` to be written.

Tracking the latest repository generation will move to the cluster state in a
separate pull request. This is intended as a low-risk change to be backported as
far as possible and motived by the recently increased chance of elastic#38941
causing trouble via SLM (see elastic#47520).

Closes elastic#47834
Closes elastic#49048
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Nov 14, 2019
This is intended as a stop-gap solution/improvement to elastic#38941 that
prevents repo modifications without an intermittent master failover
from causing inconsistent (outdated due to inconsistent listing of index-N blobs)
`RepositoryData` to be written.

Tracking the latest repository generation will move to the cluster state in a
separate pull request. This is intended as a low-risk change to be backported as
far as possible and motived by the recently increased chance of elastic#38941
causing trouble via SLM (see elastic#47520).

Closes elastic#47834
Closes elastic#49048
original-brownbear added a commit that referenced this issue Nov 15, 2019
This is intended as a stop-gap solution/improvement to #38941 that
prevents repo modifications without an intermittent master failover
from causing inconsistent (outdated due to inconsistent listing of index-N blobs)
`RepositoryData` to be written.

Tracking the latest repository generation will move to the cluster state in a
separate pull request. This is intended as a low-risk change to be backported as
far as possible and motived by the recently increased chance of #38941
causing trouble via SLM (see #47520).

Closes #47834
Closes #49048
original-brownbear added a commit that referenced this issue Nov 15, 2019
This is intended as a stop-gap solution/improvement to #38941 that
prevents repo modifications without an intermittent master failover
from causing inconsistent (outdated due to inconsistent listing of index-N blobs)
`RepositoryData` to be written.

Tracking the latest repository generation will move to the cluster state in a
separate pull request. This is intended as a low-risk change to be backported as
far as possible and motived by the recently increased chance of #38941
causing trouble via SLM (see #47520).

Closes #47834
Closes #49048
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/ILM+SLM Index and Snapshot lifecycle management >enhancement
Projects
None yet
Development

No branches or pull requests

4 participants