Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add local session timeouts to leader node #37438

Merged
merged 10 commits into from
Jan 18, 2019

Conversation

Tim-Brooks
Copy link
Contributor

This is related to #35975. This commit adds timeout functionality to
the local session on a leader node. When a session is started, a timeout
is scheduled using a repeatable runnable. If the session is not accessed
in between two runs the session is closed. When the sssion is closed,
the repeating task is cancelled.

Additionally, this commit moves session uuid generation to the leader
cluster. And renames the PutCcrRestoreSessionRequest to
StartCcrRestoreSessionRequest to reflect that change.

This is related to elastic#35975. This commit adds timeout functionality to
the local session on a leader node. When a session is started, a timeout
is scheduled using a repeatable runnable. If the session is not accessed
in between two runs the session is closed. When the sssion is closed,
the repeating task is cancelled.
@Tim-Brooks Tim-Brooks added >non-issue v7.0.0 :Distributed Indexing/CCR Issues around the Cross Cluster State Replication features v6.7.0 labels Jan 14, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@Tim-Brooks
Copy link
Contributor Author

This is POC for local leader session timeouts. It takes an approach of an "idle" timeout. Which means that the session will stay alive as long it is actively in use in between the timeout periods. Another alternative would be to have a total restore timeout on the leader side.

Second, is the question about configurability. Do we want to introduce a setting for this and what do we want the setting to be labelled as? If we going with total restore timeout I think that we definitely want a configurable setting.

Thoughts?

@ywelsch
Copy link
Contributor

ywelsch commented Jan 15, 2019

The "idle" timeout approach makes sense to me (it's the one we use for peer recovery as well).

Second, is the question about configurability. Do we want to introduce a setting for this and what do we want the setting to be labelled as?

yes. For recovery we have indices.recovery.recovery_activity_timeout. I wonder if we need a special setting for CCR or whether we can just reuse the same setting. This leaves the question of default value for this, though. For peer recoveries, the value of this setting is 30 minutes by default, which is way too high imo, and is probably only explained by the fact that peer recoveries used to work quite differently in the ES 2.x times, namely that the node itself (and not the master) was responsible for scheduling recoveries and making sure that not too many recoveries run at once, which required a larger timeout here because a recovery might need to wait for other recoveries to finish first before it got to execute.

Additionally, this commit moves session uuid generation to the leader cluster. And renames the PutCcrRestoreSessionRequest to StartCcrRestoreSessionRequest to reflect that change.

Can you undo that change, as it's unrelated to this PR, and I'm not even sure we should be doing that change.

@bleskes
Copy link
Contributor

bleskes commented Jan 15, 2019

+1 to the same handling as recoveries. i.e., - activity timeouts + settings.

@jasontedor
Copy link
Member

+1


public CcrRestoreSourceService(Settings settings) {
public CcrRestoreSourceService(Settings settings, ThreadPool threadPool) {
this(settings, threadPool, RecoverySettings.INDICES_RECOVERY_ACTIVITY_TIMEOUT_SETTING.get(settings));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it might not have been clear from my initial comment, but I would prefer to go with a separate setting for now (e.g. ccr.recovery.recovery_activity_timeout), as the default on INDICES_RECOVERY_ACTIVITY_TIMEOUT_SETTING is absurdly high, and I want to investigate lowering this timeout in 7.0. We can still decide to leave the new setting undocumented for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a new setting


public CcrRestoreSourceService(Settings settings) {
public CcrRestoreSourceService(Settings settings, ThreadPool threadPool) {
this(settings, threadPool, RecoverySettings.INDICES_RECOVERY_ACTIVITY_TIMEOUT_SETTING.get(settings));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the new setting should also be dynamic, just as INDICES_RECOVERY_ACTIVITY_TIMEOUT_SETTING, and would best live in CCRSettings, following the same model as I've explained in #37449

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hooked it up to CcrSettings

@Tim-Brooks Tim-Brooks requested a review from ywelsch January 18, 2019 02:04
@Tim-Brooks
Copy link
Contributor Author

@ywelsch I made your suggested changes and merged in the CcrSettings work from rate limiting.

restoreSourceService = new CcrRestoreSourceService();
Settings settings = Settings.builder().put(NODE_NAME_SETTING.getKey(), "node").build();
taskQueue = new DeterministicTaskQueue(settings, random());
Set<Setting<?>> registeredSettings = new HashSet<>(Arrays.asList(CcrSettings.INDICES_RECOVERY_ACTIVITY_TIMEOUT_SETTING,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have a convenience method Sets.newHashSet(...) (does not save that many chars, just so you're aware)

assertFalse(taskQueue.hasDeferredTasks());

try (CcrRestoreSourceService.SessionReader reader = restoreSourceService.getSessionReader(sessionUUID)) {
fail("Should have timed out.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be written more concisely with expectThrows. Can you also assert something on the exception message?

@Tim-Brooks
Copy link
Contributor Author

run gradle build tests 1

1 similar comment
@Tim-Brooks
Copy link
Contributor Author

run gradle build tests 1

@Tim-Brooks
Copy link
Contributor Author

run gradle build tests 1

@Tim-Brooks Tim-Brooks merged commit cd41289 into elastic:master Jan 18, 2019
jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request Jan 20, 2019
* elastic/master:
  Remove Watcher Account "unsecure" settings (elastic#36736)
  Add cache cleaning task for ML snapshot (elastic#37505)
  Update jdk used by the docker builds (elastic#37621)
  Remove an unused constant in PutMappingRequest.
  Update get users to allow unknown fields (elastic#37593)
  Do not add index event listener if CCR disabled (elastic#37432)
  Add local session timeouts to leader node (elastic#37438)
jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request Jan 21, 2019
* elastic/master: (104 commits)
  Permission for restricted indices (elastic#37577)
  Remove Watcher Account "unsecure" settings (elastic#36736)
  Add cache cleaning task for ML snapshot (elastic#37505)
  Update jdk used by the docker builds (elastic#37621)
  Remove an unused constant in PutMappingRequest.
  Update get users to allow unknown fields (elastic#37593)
  Do not add index event listener if CCR disabled (elastic#37432)
  Add local session timeouts to leader node (elastic#37438)
  Add some deprecation optimizations (elastic#37597)
  refactor inner geogrid classes to own class files (elastic#37596)
  Remove obsolete deprecation checks (elastic#37510)
  ML: Add support for single bucket aggs in Datafeeds (elastic#37544)
  ML: creating ML State write alias and pointing writes there (elastic#37483)
  Deprecate types in the put mapping API. (elastic#37280)
  [ILM] Add unfollow action (elastic#36970)
  Packaging: Update marker used to allow ELASTIC_PASSWORD (elastic#37243)
  Fix setting openldap realm ssl config
  Document the need for JAVA11_HOME (elastic#37589)
  SQL: fix object extraction from sources (elastic#37502)
  Nit in settings.gradle for Eclipse
  ...
Tim-Brooks added a commit to Tim-Brooks/elasticsearch that referenced this pull request Jan 22, 2019
This is related to elastic#35975. This commit adds timeout functionality to
the local session on a leader node. When a session is started, a timeout
is scheduled using a repeatable runnable. If the session is not accessed
in between two runs the session is closed. When the sssion is closed,
the repeating task is cancelled.

Additionally, this commit moves session uuid generation to the leader
cluster. And renames the PutCcrRestoreSessionRequest to
StartCcrRestoreSessionRequest to reflect that change.
@Tim-Brooks Tim-Brooks deleted the add_local_leader_timeouts branch December 18, 2019 14:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Indexing/CCR Issues around the Cross Cluster State Replication features >non-issue v6.7.0 v7.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants