Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add PersistentTasksClusterService::unassignPersistentTask method #37576

Merged

Conversation

benwtrent
Copy link
Member

@benwtrent benwtrent commented Jan 17, 2019

This adds a method that assigns a task to a null node with the given reason.

The purpose that his serves is when we don't want to END a task permanently, but we do want to temporarily pause the execution of a task until a later time.

The caller of this method is assumed to have cleaned up their task and takes appropriate action before calling this.

The service, because executorNode == null, will attempt to re-assign the task. The consumer should override getAssignment when they extend PersistentTasksExecutor. This will allow them to check the cluster state to determine if they are ready for re-assignment and starting the task.

@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core

@droberts195 droberts195 added the :Distributed Coordination/Task Management Issues for anything around the Tasks API - both persistent and node level. label Jan 17, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@benwtrent
Copy link
Member Author

run gradle build tests 2

1 similar comment
@benwtrent
Copy link
Member Author

run gradle build tests 2

@benwtrent
Copy link
Member Author

This code as it is now will not work.

} else {
// task is running locally, but master doesn't know about it - that means that the persistent task was removed
// cancel the task without notifying master
logger.trace("Found unregistered persistent task [{}] with id [{}] and allocation id [{}] - cancelling",
task.getAction(), task.getPersistentTaskId(), task.getAllocationId());
cancelTask(id);
}

Is executed, which should be expected since the task is still running locally on the node and in its state. Simply removing from the master node state does not address the issue of somehow removing it from the executing node state without cancelling the task completely.

Will reopen this PR when I can figure this out.

@benwtrent benwtrent closed this Jan 17, 2019
@benwtrent benwtrent reopened this Jan 18, 2019
@benwtrent
Copy link
Member Author

Adding an integration test that verifies the desired behavior of unallocating a task. All seems to check out from what I can tell.

Feedback is welcome and appreciated :)

@droberts195
Copy link
Contributor

Simply removing from the master node state does not address the issue of somehow removing it from the executing node state without cancelling the task completely

For the benefit of other reviewers, this earlier comment is not true for persistent tasks in general. ML jobs have an extra complication in that when a job is closed gracefully a separate action tells the local task to run its shutdown steps and then the local task tells the persistent tasks service when it's fully finished. So we need to add some extra logic in ML to account for this complexity, but we'll do this as a followup PR that doesn't touch the core code.

@droberts195 droberts195 requested a review from tlrx January 21, 2019 11:14
Copy link
Member

@tlrx tlrx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks nice, I left a bunch of minor comments

PersistentTasksCustomMetaData.Builder tasksInProgress = builder(currentState);
if (tasksInProgress.hasTask(taskId, taskAllocationId)) {
logger.trace("Unassigning task {} with allocation id {}", taskId, taskAllocationId);
return update(currentState, tasksInProgress.reassignTask(taskId, new Assignment(null, reason)));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you centralize the instanciation of unassigned Assigment in a static method and replaces the usages in the class?

logger.trace("Unassigning task {} with allocation id {}", taskId, taskAllocationId);
return update(currentState, tasksInProgress.reassignTask(taskId, new Assignment(null, reason)));
} else {
if (tasksInProgress.hasTask(taskId)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if the log traces are really useful, since the ResourceNotFoundException should appear in the log anyway?

PersistentTasksClusterService persistentTasksClusterService =
internalCluster().getInstance(PersistentTasksClusterService.class, internalCluster().getMasterName());
// Speed up rechecks to a rate that is quicker than what settings would allow
persistentTasksClusterService.setRecheckInterval(TimeValue.timeValueMillis(1));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a risk to set it to a such low value?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is already another test that sets 1ms for the recheck interval. I agree 1ms would be a completely inappropriate setting for production, but that's why it's being set via a method that end users cannot call. I think if a test fails because the interval is low then it will be exposing a bug that could happen with a higher setting, just much less frequently. During this test it's true that the master node will be doing a lot of work iterating through the persistent tasks list, but it won't be doing the other work that a production master node would be doing, so a modern CPU should be able to cope.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the precision. I just wanted to be sure that this value couldn't overhelm any thread pool and cause other issues.


// Verify that the task is STILL in internal cluster state
assertThat(((PersistentTasksCustomMetaData) internalCluster().clusterService().state().getMetaData()
.custom(PersistentTasksCustomMetaData.TYPE)).tasks(), hasSize(1));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should also check the taskId here


// Assert that we still have it in master state
assertThat(((PersistentTasksCustomMetaData) internalCluster().clusterService().state().getMetaData()
.custom(PersistentTasksCustomMetaData.TYPE)).tasks(), hasSize(1));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And this could be in a static private helper method

@benwtrent
Copy link
Member Author

@tlrx PR updated, let me know what you think

@benwtrent
Copy link
Member Author

run elasticsearch-ci/1

@benwtrent
Copy link
Member Author

run elasticsearch-ci/2

@benwtrent
Copy link
Member Author

run elasticsearch-ci/1

@benwtrent
Copy link
Member Author

run elasticsearch-ci/2

1 similar comment
@benwtrent
Copy link
Member Author

run elasticsearch-ci/2

Copy link
Member

@tlrx tlrx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - I left very minor comments but they don't require an extra review

});
}

private static void internalClusterHasSingleTask(String taskId) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can you renamed to assertTaskExists() or assertClusterStateHasTask()

// Verify it starts again
waitForTaskToStart();

// Assert that we still have it in master state
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: by master state, I'd expect the internalClusterHasSingleTask() method to check the cluster state on the master node, but in fact it checks using a random cluster service. So maybe just remove this comment?

PersistentTasksClusterService persistentTasksClusterService =
internalCluster().getInstance(PersistentTasksClusterService.class, internalCluster().getMasterName());
// Speed up rechecks to a rate that is quicker than what settings would allow
persistentTasksClusterService.setRecheckInterval(TimeValue.timeValueMillis(1));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the precision. I just wanted to be sure that this value couldn't overhelm any thread pool and cause other issues.

@benwtrent benwtrent merged commit 1c2ae91 into elastic:master Jan 23, 2019
@benwtrent benwtrent deleted the feature/adding-unassign-task-method branch January 23, 2019 17:48
jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request Jan 23, 2019
* master:
  Liberalize StreamOutput#writeStringList (elastic#37768)
  Add PersistentTasksClusterService::unassignPersistentTask method (elastic#37576)
  Tests: disable testRandomGeoCollectionQuery on tiny polygons (elastic#37579)
  Use ILM for Watcher history deletion (elastic#37443)
  Make sure PutMappingRequest accepts content types other than JSON. (elastic#37720)
  Retry ILM steps that fail due to SnapshotInProgressException (elastic#37624)
  Use disassociate in preference to deassociate (elastic#37704)
  Delete Redundant RoutingServiceTests (elastic#37750)
  Always return metadata version if metadata is requested (elastic#37674)
benwtrent added a commit that referenced this pull request Jan 23, 2019
)

* Add PersistentTasksClusterService::unassignPersistentTask method

* adding cancellation test

* Adding integration test for unallocating tasks from a node

* Addressing review comments

* adressing minor PR comments
jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request Jan 24, 2019
* elastic/master: (85 commits)
  Use explicit version for build-tools in example plugin integ tests (elastic#37792)
  Change `rational` to `saturation` in script_score (elastic#37766)
  Deprecate types in get field mapping API (elastic#37667)
  Add ability to listen to group of affix settings (elastic#37679)
  Ensure changes requests return the latest mapping version (elastic#37633)
  Make Minio Setup more Reliable (elastic#37747)
  Liberalize StreamOutput#writeStringList (elastic#37768)
  Add PersistentTasksClusterService::unassignPersistentTask method (elastic#37576)
  Tests: disable testRandomGeoCollectionQuery on tiny polygons (elastic#37579)
  Use ILM for Watcher history deletion (elastic#37443)
  Make sure PutMappingRequest accepts content types other than JSON. (elastic#37720)
  Retry ILM steps that fail due to SnapshotInProgressException (elastic#37624)
  Use disassociate in preference to deassociate (elastic#37704)
  Delete Redundant RoutingServiceTests (elastic#37750)
  Always return metadata version if metadata is requested (elastic#37674)
  [TEST] Mute MlMappingsUpgradeIT testMappingsUpgrade
  Streamline skip_unavailable handling (elastic#37672)
  Only bootstrap and elect node in current voting configuration (elastic#37712)
  Ensure either success or failure path for SearchOperationListener is called (elastic#37467)
  Target only specific index in update settings test
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Task Management Issues for anything around the Tasks API - both persistent and node level. :ml Machine learning >non-issue v6.7.0 v7.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants