-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add PersistentTasksClusterService::unassignPersistentTask method #37576
Add PersistentTasksClusterService::unassignPersistentTask method #37576
Conversation
Pinging @elastic/ml-core |
Pinging @elastic/es-distributed |
run gradle build tests 2 |
1 similar comment
run gradle build tests 2 |
This code as it is now will not work. elasticsearch/server/src/main/java/org/elasticsearch/persistent/PersistentTasksNodeService.java Lines 131 to 137 in 0227260
Is executed, which should be expected since the task is still running locally on the node and in its state. Simply removing from the master node state does not address the issue of somehow removing it from the executing node state without cancelling the task completely. Will reopen this PR when I can figure this out. |
Adding an integration test that verifies the desired behavior of unallocating a task. All seems to check out from what I can tell. Feedback is welcome and appreciated :) |
For the benefit of other reviewers, this earlier comment is not true for persistent tasks in general. ML jobs have an extra complication in that when a job is closed gracefully a separate action tells the local task to run its shutdown steps and then the local task tells the persistent tasks service when it's fully finished. So we need to add some extra logic in ML to account for this complexity, but we'll do this as a followup PR that doesn't touch the core code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks nice, I left a bunch of minor comments
PersistentTasksCustomMetaData.Builder tasksInProgress = builder(currentState); | ||
if (tasksInProgress.hasTask(taskId, taskAllocationId)) { | ||
logger.trace("Unassigning task {} with allocation id {}", taskId, taskAllocationId); | ||
return update(currentState, tasksInProgress.reassignTask(taskId, new Assignment(null, reason))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you centralize the instanciation of unassigned Assigment in a static method and replaces the usages in the class?
logger.trace("Unassigning task {} with allocation id {}", taskId, taskAllocationId); | ||
return update(currentState, tasksInProgress.reassignTask(taskId, new Assignment(null, reason))); | ||
} else { | ||
if (tasksInProgress.hasTask(taskId)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if the log traces are really useful, since the ResourceNotFoundException should appear in the log anyway?
PersistentTasksClusterService persistentTasksClusterService = | ||
internalCluster().getInstance(PersistentTasksClusterService.class, internalCluster().getMasterName()); | ||
// Speed up rechecks to a rate that is quicker than what settings would allow | ||
persistentTasksClusterService.setRecheckInterval(TimeValue.timeValueMillis(1)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a risk to set it to a such low value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is already another test that sets 1ms for the recheck interval. I agree 1ms would be a completely inappropriate setting for production, but that's why it's being set via a method that end users cannot call. I think if a test fails because the interval is low then it will be exposing a bug that could happen with a higher setting, just much less frequently. During this test it's true that the master node will be doing a lot of work iterating through the persistent tasks list, but it won't be doing the other work that a production master node would be doing, so a modern CPU should be able to cope.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the precision. I just wanted to be sure that this value couldn't overhelm any thread pool and cause other issues.
|
||
// Verify that the task is STILL in internal cluster state | ||
assertThat(((PersistentTasksCustomMetaData) internalCluster().clusterService().state().getMetaData() | ||
.custom(PersistentTasksCustomMetaData.TYPE)).tasks(), hasSize(1)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should also check the taskId here
|
||
// Assert that we still have it in master state | ||
assertThat(((PersistentTasksCustomMetaData) internalCluster().clusterService().state().getMetaData() | ||
.custom(PersistentTasksCustomMetaData.TYPE)).tasks(), hasSize(1)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And this could be in a static private helper method
@tlrx PR updated, let me know what you think |
run elasticsearch-ci/1 |
run elasticsearch-ci/2 |
run elasticsearch-ci/1 |
run elasticsearch-ci/2 |
1 similar comment
run elasticsearch-ci/2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - I left very minor comments but they don't require an extra review
}); | ||
} | ||
|
||
private static void internalClusterHasSingleTask(String taskId) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can you renamed to assertTaskExists() or assertClusterStateHasTask()
// Verify it starts again | ||
waitForTaskToStart(); | ||
|
||
// Assert that we still have it in master state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: by master state, I'd expect the internalClusterHasSingleTask() method to check the cluster state on the master node, but in fact it checks using a random cluster service. So maybe just remove this comment?
PersistentTasksClusterService persistentTasksClusterService = | ||
internalCluster().getInstance(PersistentTasksClusterService.class, internalCluster().getMasterName()); | ||
// Speed up rechecks to a rate that is quicker than what settings would allow | ||
persistentTasksClusterService.setRecheckInterval(TimeValue.timeValueMillis(1)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the precision. I just wanted to be sure that this value couldn't overhelm any thread pool and cause other issues.
* master: Liberalize StreamOutput#writeStringList (elastic#37768) Add PersistentTasksClusterService::unassignPersistentTask method (elastic#37576) Tests: disable testRandomGeoCollectionQuery on tiny polygons (elastic#37579) Use ILM for Watcher history deletion (elastic#37443) Make sure PutMappingRequest accepts content types other than JSON. (elastic#37720) Retry ILM steps that fail due to SnapshotInProgressException (elastic#37624) Use disassociate in preference to deassociate (elastic#37704) Delete Redundant RoutingServiceTests (elastic#37750) Always return metadata version if metadata is requested (elastic#37674)
* elastic/master: (85 commits) Use explicit version for build-tools in example plugin integ tests (elastic#37792) Change `rational` to `saturation` in script_score (elastic#37766) Deprecate types in get field mapping API (elastic#37667) Add ability to listen to group of affix settings (elastic#37679) Ensure changes requests return the latest mapping version (elastic#37633) Make Minio Setup more Reliable (elastic#37747) Liberalize StreamOutput#writeStringList (elastic#37768) Add PersistentTasksClusterService::unassignPersistentTask method (elastic#37576) Tests: disable testRandomGeoCollectionQuery on tiny polygons (elastic#37579) Use ILM for Watcher history deletion (elastic#37443) Make sure PutMappingRequest accepts content types other than JSON. (elastic#37720) Retry ILM steps that fail due to SnapshotInProgressException (elastic#37624) Use disassociate in preference to deassociate (elastic#37704) Delete Redundant RoutingServiceTests (elastic#37750) Always return metadata version if metadata is requested (elastic#37674) [TEST] Mute MlMappingsUpgradeIT testMappingsUpgrade Streamline skip_unavailable handling (elastic#37672) Only bootstrap and elect node in current voting configuration (elastic#37712) Ensure either success or failure path for SearchOperationListener is called (elastic#37467) Target only specific index in update settings test ...
This adds a method that assigns a task to a
null
node with the given reason.The purpose that his serves is when we don't want to END a task permanently, but we do want to temporarily pause the execution of a task until a later time.
The caller of this method is assumed to have cleaned up their task and takes appropriate action before calling this.
The service, because
executorNode == null
, will attempt to re-assign the task. The consumer should overridegetAssignment
when they extendPersistentTasksExecutor
. This will allow them to check the cluster state to determine if they are ready for re-assignment and starting the task.