Use system context for cluster state update tasks #31241

ywelsch · 2018-06-11T11:45:46Z

This PR makes it so that cluster state update tasks always run under the system context, only restoring the original context when the listener that was provided with the task is called. A notable exception is the clusterStatePublished(...) callback which will still run under system context, because it's defined on the executor-level, and not the task level, and only called once for the combined batch of tasks and can therefore not be uniquely identified with a task / thread context. I've looked at usages of this, and adapted the two problematic ones to use clusterStateProcessed instead.

Relates #30603

elasticmachine · 2018-06-11T11:45:47Z

Pinging @elastic/es-distributed

bleskes

LGTM. I left one comment about a potential other solution, to see if we like it better.

bleskes · 2018-06-11T12:03:51Z

server/src/main/java/org/elasticsearch/cluster/service/MasterService.java

+        final ThreadContext threadContext = threadPool.getThreadContext();
+        final Supplier<ThreadContext.StoredContext> supplier = threadContext.newRestorableContext(false);
+        try (ThreadContext.StoredContext ignore = threadContext.stashContext()) {
+            threadContext.markAsSystemContext();


instead of doing this, I wonder if we should invest in a SystemThreadContext that we give to the executor and it does is under the system context. Alternatively we can have a "no context" executor that simply doesn't set any context. @tvernum do you see any other usage for this?

I'm not sure if I understand this correctly. Would you have different executors based on whether they switch to the system context or stay in default context? Looking at the other usages of threadContext.markAsSystemContext();, all of them seem to use executors that cannot be categorized as system or non-system (i.e. generic threadpool, ...).

my suggestion referred to the entire pattern of generating a new context and then marking it as system and deal with on the executor level. This has two potential upsides:

You don't need to remember to do this everywhere you submit a task (not so important here as there's just one place).

We don't do all the "clean & capture and then restore in the thread pool" song and dance.

Since this is not a high throughput usage and we only have one place we submit tasks it doesn't have such a big impact on MasterService but might be useful for other things that use an internal thread pool executor.

bleskes · 2018-06-11T12:06:26Z

x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/job/JobManager.java

@@ -346,8 +346,8 @@ public void onFailure(String source, Exception e) {
                }

                @Override
-                public void clusterStatePublished(ClusterChangedEvent clusterChangedEvent) {


I wonder if ClusterStateUpdateTask should have a final empty method for this? I think it's always the wrong one in that contex.t

I like this idea, I've pushed 1b7eeb4 to implement this.

ywelsch · 2018-06-11T13:36:35Z

The doc tests (:docs:integTestRunner) are currently failing on this, as there's an assertion in RemoteClusterConnection that checks that SniffClusterStateResponseHandler.handleResponse is not called in a system context. I'm not sure why that assertion is there, or what's triggering the failure. Maybe @s1monw can help with this?

Finally, I'm not sure what the implications of this PR are for cluster state publication, where we cross node boundaries (which does not seem to transfer the system context, only reestablish it because of internal action?)

ywelsch · 2018-06-13T07:40:50Z

I've removed the assertion in RemoteClusterConnection, which no longer holds. The reason is that forkConnect is called from the cluster applier thread, which now runs under system context.
To validate this, I added an assertion in 7b92bdb and got the following stack trace:

[2018-06-13T09:35:44,461][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [node-0] fatal error in thread [elasticsearch[node-0][clusterApplierService#updateTask][T#1]], exiting
java.lang.AssertionError: context is a system context
        at org.elasticsearch.transport.RemoteClusterConnection$ConnectHandler.forkConnect(RemoteClusterConnection.java:377) ~[elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
        at org.elasticsearch.transport.RemoteClusterConnection$ConnectHandler.connect(RemoteClusterConnection.java:371) ~[elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
        at org.elasticsearch.transport.RemoteClusterConnection$ConnectHandler.connect(RemoteClusterConnection.java:335) ~[elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
        at org.elasticsearch.transport.RemoteClusterConnection.updateSeedNodes(RemoteClusterConnection.java:135) ~[elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
        at org.elasticsearch.transport.RemoteClusterService.updateRemoteClusters(RemoteClusterService.java:148) ~[elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
        at org.elasticsearch.transport.RemoteClusterService.updateRemoteCluster(RemoteClusterService.java:327) ~[elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
        at org.elasticsearch.transport.RemoteClusterService.updateRemoteCluster(RemoteClusterService.java:314) ~[elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
        at org.elasticsearch.common.settings.Setting$AffixSetting$1.lambda$getValue$0(Setting.java:626) ~[elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
        at org.elasticsearch.common.settings.Setting$Updater.apply(Setting.java:957) ~[elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
        at org.elasticsearch.common.settings.Setting$AffixSetting$1.apply(Setting.java:641) ~[elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
        at org.elasticsearch.common.settings.Setting$AffixSetting$1.apply(Setting.java:611) ~[elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
        at org.elasticsearch.common.settings.AbstractScopedSettings$SettingUpdater.lambda$updater$0(AbstractScopedSettings.java:410) ~[elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
        at org.elasticsearch.common.settings.AbstractScopedSettings.applySettings(AbstractScopedSettings.java:170) ~[elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
        at org.elasticsearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:472) ~[elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
        at org.elasticsearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:430) ~[elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
        at org.elasticsearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:160) ~[elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:625) ~[elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:244) ~[elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:207) ~[elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1135) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
        at java.lang.Thread.run(Thread.java:844) [?:?]

…-task-under-system-context

s1monw · 2018-06-13T12:55:22Z

@ywelsch I think you can remove the assertion it's not valid anymore. The new assertion also isn't

ywelsch · 2018-06-13T12:58:34Z

@s1monw yes, thanks, I only added that extra assertion to see what was causing the existing assertion to trip. I have removed those assertions in faf152c.

…-task-under-system-context

ywelsch · 2018-06-14T13:19:13Z

After removing the assertions in RemoteClusterConnection, I ran into an ML test failure (testLookbackWithoutPermissions). @davidkyle has looked into this and pushed a fix to this PR: 8b8e3e3. ML was previously relying on the behavior that the cluster state update task (the execute method) was run as the user that submitted the task (as it stored the headers of the originating request into the cluster state, in order to run the datafeeds with the permission of the last editing user). This was changed by this PR and the fix is therefore to explicitly pass the desired information to the task (see 8b8e3e3).

…-task-under-system-context

tvernum

LGTM

ywelsch · 2018-06-18T14:46:51Z

Thanks @bleskes @tvernum @davidkyle @s1monw

This commit makes it so that cluster state update tasks always run under the system context, only restoring the original context when the listener that was provided with the task is called. A notable exception is the clusterStatePublished(...) callback which will still run under system context, because it's defined on the executor-level, and not the task level, and only called once for the combined batch of tasks and can therefore not be uniquely identified with a task / thread context. Relates #30603

Relates to elastic#31408 and elastic#31241

* 6.x: Add get stored script and delete stored script to high level REST API Increasing skip version for failing test on 6.x Skip get_alias tests for 5.x (#31397) Fix defaults in GeoShapeFieldMapper output (#31302) Test: better error message on failure Mute DefaultShardsIT#testDefaultShards test Fix reference to XContentBuilder.string() (#31337) [DOCS] Adds monitoring breaking change (#31369) [DOCS] Adds security breaking change (#31375) [DOCS] Backports breaking change (#31373) RestAPI: Reject forcemerge requests with a body (#30792) Docs: Use the default distribution to test docs (#31251) Use system context for cluster state update tasks (#31241) [DOCS] Adds testing for security APIs (#31345) [DOCS] Removes ML item from release highlights [DOCS] Removes breaking change (#31376) REST high-level client: add validate query API (#31077) Move language analyzers from server to analysis-common module. (#31300) Expose lucene's RemoveDuplicatesTokenFilter (#31275) [Test] Fix :example-plugins:rest-handler on Windows Delete typos in SAML docs (#31199) Ensure we don't use a remote profile if cluster name matches (#31331) Test: Skip alias tests that failed all weekend [DOCS] Fix version in SQL JDBC Maven template [DOCS] Improve install and setup section for SQL JDBC Add ingest-attachment support for per document `indexed_chars` limit (#31352) SQL: Fix rest endpoint names in node stats (#31371) [DOCS] Fixes small issue in release notes Support for remote path in reindex api Closes #22913 [ML] Put ML filter API response should contain the filter (#31362) Remove trial status info from start trial doc (#31365) [DOCS] Added links in breaking changes pages [DOCS] Adds links to release notes and highlights Docs: Document changes in rest client QA: Fix tribe tests to use node selector REST Client: NodeSelector for node attributes (#31296) LLClient: Fix assertion on windows LLClient: Support host selection (#30523) Add QA project and fixture based test for discovery-ec2 plugin (#31107) [ML] Hold ML filter items in sorted set (#31338) [ML] Add description to ML filters (#31330)

* master: Add get stored script and delete stored script to high level REST API - post backport fix Add get stored script and delete stored script to high level REST API (#31355) Core: Combine Action and GenericAction (#31405) Fix reference to XContentBuilder.string() (#31337) Avoid sending duplicate remote failed shard requests (#31313) Fix defaults in GeoShapeFieldMapper output (#31302) RestAPI: Reject forcemerge requests with a body (#30792) Packaging: Remove windows bin files from the tar distribution (#30596) Docs: Use the default distribution to test docs (#31251) [DOCS] Adds testing for security APIs (#31345) Clarify that IP range data can be specified in CIDR notation. (#31374) Use system context for cluster state update tasks (#31241) Percentile/Ranks should return null instead of NaN when empty (#30460) REST high-level client: add validate query API (#31077) Move language analyzers from server to analysis-common module. (#31300) [Test] Fix :example-plugins:rest-handler on Windows Expose lucene's RemoveDuplicatesTokenFilter (#31275) Reload secure settings for plugins (#31383) Remove some cases in FieldTypeLookupTests that are no longer relevant. (#31381) Ensure we don't use a remote profile if cluster name matches (#31331) [TEST] Double write alias fault (#30942) [DOCS] Fix version in SQL JDBC Maven template [DOCS] Improve install and setup section for SQL JDBC SQL: Fix rest endpoint names in node stats (#31371) Support for remote path in reindex api - post backport fix Closes #22913 [ML] Put ML filter API response should contain the filter (#31362) Support for remote path in reindex api (#31290) Add byte array pooling to nio http transport (#31349) Remove trial status info from start trial doc (#31365) [DOCS] Adds links to release notes and highlights add is-write-index flag to aliases (#30942) Add rollover-creation-date setting to rolled over index (#31144) [ML] Hold ML filter items in sorted set (#31338) [Tests] Fix edge case in ScriptedMetricAggregatorTests (#31357)

#31241 changed the cluster state update tasks to run under system context. The context wrapping did not preserve response headers, though. This has led to a test failure on 6.x #31408, as the deprecation warnings were not carried back anymore to the caller when creating an index. This commit changes the restorable context supplier to preserve response headers.

Use system context for cluster state update tasks

7ec662f

ywelsch added >enhancement v7.0.0 :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. v6.4.0 labels Jun 11, 2018

ywelsch requested review from bleskes and tvernum June 11, 2018 11:45

bleskes approved these changes Jun 11, 2018

View reviewed changes

Make clusterStatePublished method final in ClusterStateUpdateTask

1b7eeb4

ywelsch added 3 commits June 13, 2018 09:12

add assertion for debugging purposes

e205ad2

add more debug

7b92bdb

Remove assertions now that they no longer hold

faf152c

Merge remote-tracking branch 'elastic/master' into run-master-service…

44e8d90

…-task-under-system-context

ywelsch and others added 2 commits June 13, 2018 17:38

Merge remote-tracking branch 'elastic/master' into run-master-service…

d70d73a

…-task-under-system-context

Store headers for put/update datafeed

8b8e3e3

ywelsch added 6 commits June 14, 2018 15:22

Merge remote-tracking branch 'elastic/master' into run-master-service…

29a0552

…-task-under-system-context

checkstyle T_T

2a9c6a9

Merge remote-tracking branch 'elastic/master' into run-master-service…

6c22141

…-task-under-system-context

moar chckstyle

7c83153

T_T

9c036a5

O_o

f5ca4d9

tvernum approved these changes Jun 18, 2018

View reviewed changes

ywelsch merged commit 02a4ef3 into elastic:master Jun 18, 2018

ywelsch added a commit to ywelsch/elasticsearch that referenced this pull request Jun 19, 2018

Preserve response headers on index creation listener

87eadaf

Relates to elastic#31408 and elastic#31241

ywelsch mentioned this pull request Jun 19, 2018

Preserve response headers on cluster update task #31421

Merged

cbuescher mentioned this pull request Jun 19, 2018

CI: YAML test missing warning about delimited_payload_filter in 6.x #31422

Closed

jasontedor mentioned this pull request Jun 26, 2018

[CI] MultiClusterSearchWithSecurityYamlTestSuiteIT #31462

Closed

This was referenced Jun 26, 2018

Preserve thread context when connecting to remote cluster #31574

Merged

TemplateUpgradeService runs updates under existing ThreadContext #30603

Closed

colings86 added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use system context for cluster state update tasks #31241

Use system context for cluster state update tasks #31241

ywelsch commented Jun 11, 2018

elasticmachine commented Jun 11, 2018

bleskes left a comment

bleskes Jun 11, 2018

ywelsch Jun 11, 2018

bleskes Jun 11, 2018

bleskes Jun 11, 2018

ywelsch Jun 11, 2018

ywelsch commented Jun 11, 2018

ywelsch commented Jun 13, 2018

s1monw commented Jun 13, 2018

ywelsch commented Jun 13, 2018

ywelsch commented Jun 14, 2018

tvernum left a comment

ywelsch commented Jun 18, 2018

Use system context for cluster state update tasks #31241

Use system context for cluster state update tasks #31241

Conversation

ywelsch commented Jun 11, 2018

elasticmachine commented Jun 11, 2018

bleskes left a comment

Choose a reason for hiding this comment

bleskes Jun 11, 2018

Choose a reason for hiding this comment

ywelsch Jun 11, 2018

Choose a reason for hiding this comment

bleskes Jun 11, 2018

Choose a reason for hiding this comment

bleskes Jun 11, 2018

Choose a reason for hiding this comment

ywelsch Jun 11, 2018

Choose a reason for hiding this comment

ywelsch commented Jun 11, 2018

ywelsch commented Jun 13, 2018

s1monw commented Jun 13, 2018

ywelsch commented Jun 13, 2018

ywelsch commented Jun 14, 2018

tvernum left a comment

Choose a reason for hiding this comment

ywelsch commented Jun 18, 2018