Add PersistentTasksClusterService::unassignPersistentTask method #37576

benwtrent · 2019-01-17T16:44:58Z

This adds a method that assigns a task to a null node with the given reason.

The purpose that his serves is when we don't want to END a task permanently, but we do want to temporarily pause the execution of a task until a later time.

The caller of this method is assumed to have cleaned up their task and takes appropriate action before calling this.

The service, because executorNode == null, will attempt to re-assign the task. The consumer should override getAssignment when they extend PersistentTasksExecutor. This will allow them to check the cluster state to determine if they are ready for re-assignment and starting the task.

elasticmachine · 2019-01-17T16:46:35Z

Pinging @elastic/ml-core

elasticmachine · 2019-01-17T16:56:09Z

Pinging @elastic/es-distributed

benwtrent · 2019-01-17T17:33:34Z

run gradle build tests 2

benwtrent · 2019-01-17T19:22:07Z

run gradle build tests 2

benwtrent · 2019-01-17T21:12:27Z

This code as it is now will not work.

elasticsearch/server/src/main/java/org/elasticsearch/persistent/PersistentTasksNodeService.java

Lines 131 to 137 in 0227260

    
           } else { 
        
               // task is running locally, but master doesn't know about it - that means that the persistent task was removed 
        
               // cancel the task without notifying master 
        
               logger.trace("Found unregistered persistent task [{}] with id [{}] and allocation id [{}] - cancelling", 
        
                       task.getAction(), task.getPersistentTaskId(), task.getAllocationId()); 
        
               cancelTask(id); 
        
           }

Is executed, which should be expected since the task is still running locally on the node and in its state. Simply removing from the master node state does not address the issue of somehow removing it from the executing node state without cancelling the task completely.

Will reopen this PR when I can figure this out.

benwtrent · 2019-01-18T16:00:17Z

Adding an integration test that verifies the desired behavior of unallocating a task. All seems to check out from what I can tell.

Feedback is welcome and appreciated :)

droberts195 · 2019-01-18T16:42:13Z

Simply removing from the master node state does not address the issue of somehow removing it from the executing node state without cancelling the task completely

For the benefit of other reviewers, this earlier comment is not true for persistent tasks in general. ML jobs have an extra complication in that when a job is closed gracefully a separate action tells the local task to run its shutdown steps and then the local task tells the persistent tasks service when it's fully finished. So we need to add some extra logic in ML to account for this complexity, but we'll do this as a followup PR that doesn't touch the core code.

tlrx

This looks nice, I left a bunch of minor comments

tlrx · 2019-01-21T14:10:11Z

server/src/main/java/org/elasticsearch/persistent/PersistentTasksClusterService.java

+                PersistentTasksCustomMetaData.Builder tasksInProgress = builder(currentState);
+                if (tasksInProgress.hasTask(taskId, taskAllocationId)) {
+                    logger.trace("Unassigning task {} with allocation id {}", taskId, taskAllocationId);
+                    return update(currentState, tasksInProgress.reassignTask(taskId, new Assignment(null, reason)));


Can you centralize the instanciation of unassigned Assigment in a static method and replaces the usages in the class?

tlrx · 2019-01-21T14:11:08Z

server/src/main/java/org/elasticsearch/persistent/PersistentTasksClusterService.java

+                    logger.trace("Unassigning task {} with allocation id {}", taskId, taskAllocationId);
+                    return update(currentState, tasksInProgress.reassignTask(taskId, new Assignment(null, reason)));
+                } else {
+                    if (tasksInProgress.hasTask(taskId)) {


I'm wondering if the log traces are really useful, since the ResourceNotFoundException should appear in the log anyway?

tlrx · 2019-01-21T14:18:11Z

server/src/test/java/org/elasticsearch/persistent/PersistentTasksExecutorIT.java

+        PersistentTasksClusterService persistentTasksClusterService =
+            internalCluster().getInstance(PersistentTasksClusterService.class, internalCluster().getMasterName());
+        // Speed up rechecks to a rate that is quicker than what settings would allow
+        persistentTasksClusterService.setRecheckInterval(TimeValue.timeValueMillis(1));


Is there a risk to set it to a such low value?

There is already another test that sets 1ms for the recheck interval. I agree 1ms would be a completely inappropriate setting for production, but that's why it's being set via a method that end users cannot call. I think if a test fails because the interval is low then it will be exposing a bug that could happen with a higher setting, just much less frequently. During this test it's true that the master node will be doing a lot of work iterating through the persistent tasks list, but it won't be doing the other work that a production master node would be doing, so a modern CPU should be able to cope.

Thanks for the precision. I just wanted to be sure that this value couldn't overhelm any thread pool and cause other issues.

tlrx · 2019-01-21T14:19:10Z

server/src/test/java/org/elasticsearch/persistent/PersistentTasksExecutorIT.java

+
+            // Verify that the task is STILL in internal cluster state
+            assertThat(((PersistentTasksCustomMetaData) internalCluster().clusterService().state().getMetaData()
+                .custom(PersistentTasksCustomMetaData.TYPE)).tasks(), hasSize(1));


I think we should also check the taskId here

tlrx · 2019-01-21T14:19:20Z

server/src/test/java/org/elasticsearch/persistent/PersistentTasksExecutorIT.java

+
+        // Assert that we still have it in master state
+        assertThat(((PersistentTasksCustomMetaData) internalCluster().clusterService().state().getMetaData()
+            .custom(PersistentTasksCustomMetaData.TYPE)).tasks(), hasSize(1));


And this could be in a static private helper method

benwtrent · 2019-01-22T16:12:12Z

@tlrx PR updated, let me know what you think

benwtrent · 2019-01-22T16:50:43Z

run elasticsearch-ci/1

benwtrent · 2019-01-22T16:50:54Z

run elasticsearch-ci/2

benwtrent · 2019-01-22T17:34:51Z

run elasticsearch-ci/1

benwtrent · 2019-01-22T19:03:58Z

run elasticsearch-ci/2

benwtrent · 2019-01-22T20:52:40Z

run elasticsearch-ci/2

tlrx

LGTM - I left very minor comments but they don't require an extra review

tlrx · 2019-01-23T08:16:24Z

server/src/test/java/org/elasticsearch/persistent/PersistentTasksExecutorIT.java

+        });
+    }
+
+    private static void internalClusterHasSingleTask(String taskId) {


nit: can you renamed to assertTaskExists() or assertClusterStateHasTask()

tlrx · 2019-01-23T08:19:52Z

server/src/test/java/org/elasticsearch/persistent/PersistentTasksExecutorIT.java

+        // Verify it starts again
+        waitForTaskToStart();
+
+        // Assert that we still have it in master state


Nit: by master state, I'd expect the internalClusterHasSingleTask() method to check the cluster state on the master node, but in fact it checks using a random cluster service. So maybe just remove this comment?

tlrx · 2019-01-23T08:26:12Z

server/src/test/java/org/elasticsearch/persistent/PersistentTasksExecutorIT.java

+        PersistentTasksClusterService persistentTasksClusterService =
+            internalCluster().getInstance(PersistentTasksClusterService.class, internalCluster().getMasterName());
+        // Speed up rechecks to a rate that is quicker than what settings would allow
+        persistentTasksClusterService.setRecheckInterval(TimeValue.timeValueMillis(1));


Thanks for the precision. I just wanted to be sure that this value couldn't overhelm any thread pool and cause other issues.

* master: Liberalize StreamOutput#writeStringList (elastic#37768) Add PersistentTasksClusterService::unassignPersistentTask method (elastic#37576) Tests: disable testRandomGeoCollectionQuery on tiny polygons (elastic#37579) Use ILM for Watcher history deletion (elastic#37443) Make sure PutMappingRequest accepts content types other than JSON. (elastic#37720) Retry ILM steps that fail due to SnapshotInProgressException (elastic#37624) Use disassociate in preference to deassociate (elastic#37704) Delete Redundant RoutingServiceTests (elastic#37750) Always return metadata version if metadata is requested (elastic#37674)

) * Add PersistentTasksClusterService::unassignPersistentTask method * adding cancellation test * Adding integration test for unallocating tasks from a node * Addressing review comments * adressing minor PR comments

* elastic/master: (85 commits) Use explicit version for build-tools in example plugin integ tests (elastic#37792) Change `rational` to `saturation` in script_score (elastic#37766) Deprecate types in get field mapping API (elastic#37667) Add ability to listen to group of affix settings (elastic#37679) Ensure changes requests return the latest mapping version (elastic#37633) Make Minio Setup more Reliable (elastic#37747) Liberalize StreamOutput#writeStringList (elastic#37768) Add PersistentTasksClusterService::unassignPersistentTask method (elastic#37576) Tests: disable testRandomGeoCollectionQuery on tiny polygons (elastic#37579) Use ILM for Watcher history deletion (elastic#37443) Make sure PutMappingRequest accepts content types other than JSON. (elastic#37720) Retry ILM steps that fail due to SnapshotInProgressException (elastic#37624) Use disassociate in preference to deassociate (elastic#37704) Delete Redundant RoutingServiceTests (elastic#37750) Always return metadata version if metadata is requested (elastic#37674) [TEST] Mute MlMappingsUpgradeIT testMappingsUpgrade Streamline skip_unavailable handling (elastic#37672) Only bootstrap and elect node in current voting configuration (elastic#37712) Ensure either success or failure path for SearchOperationListener is called (elastic#37467) Target only specific index in update settings test ...

Add PersistentTasksClusterService::unassignPersistentTask method

6340be8

benwtrent added >non-issue v7.0.0 v6.7.0 :ml Machine learning labels Jan 17, 2019

droberts195 added the :Distributed Coordination/Task Management Issues for anything around the Tasks API - both persistent and node level. label Jan 17, 2019

benwtrent closed this Jan 17, 2019

benwtrent added 2 commits January 17, 2019 15:23

adding cancellation test

11af7df

Adding integration test for unallocating tasks from a node

5cd2f9d

benwtrent reopened this Jan 18, 2019

droberts195 requested a review from tlrx January 21, 2019 11:14

tlrx requested changes Jan 21, 2019

View reviewed changes

Addressing review comments

df24e18

tlrx approved these changes Jan 23, 2019

View reviewed changes

benwtrent added 2 commits January 23, 2019 07:52

Merge branch 'master' into feature/adding-unassign-task-method

5fcf03a

adressing minor PR comments

7756bc7

benwtrent merged commit 1c2ae91 into elastic:master Jan 23, 2019

benwtrent deleted the feature/adding-unassign-task-method branch January 23, 2019 17:48

colings86 added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add PersistentTasksClusterService::unassignPersistentTask method #37576

Add PersistentTasksClusterService::unassignPersistentTask method #37576

benwtrent commented Jan 17, 2019 •

edited

Loading

elasticmachine commented Jan 17, 2019

elasticmachine commented Jan 17, 2019

benwtrent commented Jan 17, 2019

benwtrent commented Jan 17, 2019

benwtrent commented Jan 17, 2019

benwtrent commented Jan 18, 2019

droberts195 commented Jan 18, 2019

tlrx left a comment

tlrx Jan 21, 2019

tlrx Jan 21, 2019

tlrx Jan 21, 2019

droberts195 Jan 22, 2019

tlrx Jan 23, 2019

tlrx Jan 21, 2019

tlrx Jan 21, 2019

tlrx Jan 21, 2019

benwtrent commented Jan 22, 2019

benwtrent commented Jan 22, 2019

benwtrent commented Jan 22, 2019

benwtrent commented Jan 22, 2019

benwtrent commented Jan 22, 2019

benwtrent commented Jan 22, 2019

tlrx left a comment

tlrx Jan 23, 2019

tlrx Jan 23, 2019

tlrx Jan 23, 2019

Add PersistentTasksClusterService::unassignPersistentTask method #37576

Add PersistentTasksClusterService::unassignPersistentTask method #37576

Conversation

benwtrent commented Jan 17, 2019 • edited Loading

elasticmachine commented Jan 17, 2019

elasticmachine commented Jan 17, 2019

benwtrent commented Jan 17, 2019

benwtrent commented Jan 17, 2019

benwtrent commented Jan 17, 2019

benwtrent commented Jan 18, 2019

droberts195 commented Jan 18, 2019

tlrx left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

benwtrent commented Jan 22, 2019

benwtrent commented Jan 22, 2019

benwtrent commented Jan 22, 2019

benwtrent commented Jan 22, 2019

benwtrent commented Jan 22, 2019

benwtrent commented Jan 22, 2019

tlrx left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

benwtrent commented Jan 17, 2019 •

edited

Loading