Skip to content

Commit

Permalink
[Response Ops] Remove ephemeral tasks from task manager plugin (#201313)
Browse files Browse the repository at this point in the history
## Summary

Resolves: #151463

Removes all reference to ephemeral tasks from the task manager plugin.
As well as unit and E2E tests while maintaining backwards compatibility
for `xpack.task_manager.ephemeral_tasks` flag to no-op if set. This PR
has some dependencies from the PR to remove ephemeral task support from
the alerting and actions plugin
(#197421). So it should be merged
after the other PR.

Deprecates the following configuration settings:

- xpack.task_manager.ephemeral_tasks.enabled
- xpack.task_manager.ephemeral_tasks.request_capacity

The user doesn't have to change anything on their end if they don't wish
to. This deprecation is made so if the above settings are defined,
kibana will simply do nothing.

### Checklist
- [x] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or added to match the most common scenarios
  • Loading branch information
JiaweiWu authored Dec 13, 2024
1 parent 07a6902 commit 5a9129e
Show file tree
Hide file tree
Showing 51 changed files with 47 additions and 3,123 deletions.
10 changes: 0 additions & 10 deletions docs/settings/task-manager-settings.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -33,16 +33,6 @@ This flag will enable automatic warn and error logging if task manager self dete
`xpack.task_manager.monitored_stats_health_verbose_log.warn_delayed_task_start_in_seconds`::
The amount of seconds we allow a task to delay before printing a warning server log. Defaults to 60.

`xpack.task_manager.ephemeral_tasks.enabled`::
deprecated:[8.8.0]
Enables a technical preview feature that executes a limited (and configurable) number of actions in the same task as the alert which triggered them.
These action tasks will reduce the latency of the time it takes an action to run after it's triggered, but are not persisted as SavedObjects.
These non-persisted action tasks have a risk that they won't be run at all if the Kibana instance running them exits unexpectedly. Defaults to false.

`xpack.task_manager.ephemeral_tasks.request_capacity`::
deprecated:[8.8.0]
Sets the size of the ephemeral queue defined above. Defaults to 10.

`xpack.task_manager.event_loop_delay.monitor`::
Enables event loop delay monitoring, which will log a warning when a task causes an event loop delay which exceeds the `warn_threshold` setting. Defaults to true.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -181,7 +181,6 @@ The API returns the following:
"persistence": {
"recurring": 88,
"non_recurring": 4,
"ephemeral": 8
},
"result_frequency_percent_as_number": {
"alerting:.index-threshold": {
Expand Down Expand Up @@ -608,25 +607,22 @@ Resolving that would require deeper investigation into the {kib} Server Log, whe

[[task-manager-theory-spikes-in-non-recurring-tasks]]
*Theory*:
Spikes in non-recurring and ephemeral tasks are consuming a high percentage of the available capacity
Spikes in non-recurring tasks are consuming a high percentage of the available capacity

*Diagnosis*:
Task Manager uses ad-hoc non-recurring tasks to load balance operations across multiple {kib} instances.
Additionally, {kib} can use Task Manager to allocate resources for expensive operations by executing an ephemeral task. Ephemeral tasks are identical in operation to non-recurring tasks, but are not persisted and cannot be load balanced across {kib} instances.

Evaluating the preceding health stats, you see the following output under `stats.runtime.value.execution.persistence`:

[source,json]
--------------------------------------------------
{
"recurring": 88, # <1>
"non_recurring": 4, # <2>
"ephemeral": 8 # <3>
"non_recurring": 12, # <2>
},
--------------------------------------------------
<1> 88% of executed tasks are recurring tasks
<2> 4% of executed tasks are non-recurring tasks
<3> 8% of executed tasks are ephemeral tasks
<2> 12% of executed tasks are non-recurring tasks

You can infer from these stats that the majority of executions consist of recurring tasks at 88%.
You can use the `execution.persistence` stats to evaluate the ratio of consumed capacity, but on their own, you should not make assumptions about the sufficiency of the available capacity.
Expand All @@ -645,23 +641,21 @@ To assess the capacity, you should evaluate these stats against the `load` under
}
--------------------------------------------------

You can infer from these stats that it is very unusual for Task Manager to run out of capacity, so the capacity is likely sufficient to handle the amount of non-recurring and ephemeral tasks.
You can infer from these stats that it is very unusual for Task Manager to run out of capacity, so the capacity is likely sufficient to handle the amount of non-recurring tasks.

Suppose you have an alternate scenario, where you see the following output under `stats.runtime.value.execution.persistence`:

[source,json]
--------------------------------------------------
{
"recurring": 60, # <1>
"non_recurring": 30, # <2>
"ephemeral": 10 # <3>
"non_recurring": 40, # <2>
},
--------------------------------------------------
<1> 60% of executed tasks are recurring tasks
<2> 30% of executed tasks are non-recurring tasks
<3> 10% of executed tasks are ephemeral tasks
<2> 40% of executed tasks are non-recurring tasks

You can infer from these stats that even though most executions are recurring tasks, a substantial percentage of executions are non-recurring and ephemeral tasks at 40%.
You can infer from these stats that even though most executions are recurring tasks, a substantial percentage of executions are non-recurring tasks at 40%.

Evaluating the `load` under `stats.runtime.value`, you see the following:

Expand All @@ -678,9 +672,9 @@ Evaluating the `load` under `stats.runtime.value`, you see the following:
--------------------------------------------------

You can infer from these stats that it is quite common for this {kib} instance to run out of capacity.
Given the high rate of non-recurring and ephemeral tasks, it would be reasonable to assess that there is insufficient capacity in the {kib} cluster to handle the amount of tasks.
Given the high rate of non-recurring tasks, it would be reasonable to assess that there is insufficient capacity in the {kib} cluster to handle the amount of tasks.

Keep in mind that these stats give you a glimpse at a moment in time, and even though there has been insufficient capacity in recent minutes, this might not be true in other times where fewer non-recurring or ephemeral tasks are used. We recommend tracking these stats over time and identifying the source of these tasks before making sweeping changes to your infrastructure.
Keep in mind that these stats give you a glimpse at a moment in time, and even though there has been insufficient capacity in recent minutes, this might not be true in other times where fewer non-recurring tasks are used. We recommend tracking these stats over time and identifying the source of these tasks before making sweeping changes to your infrastructure.

[[task-manager-health-evaluate-the-workload]]
===== Evaluate the Workload
Expand Down
79 changes: 0 additions & 79 deletions x-pack/plugins/alerting/server/task_runner/task_runner.test.ts
Original file line number Diff line number Diff line change
Expand Up @@ -2492,85 +2492,6 @@ describe('Task Runner', () => {
expect(mockUsageCounter.incrementCounter).not.toHaveBeenCalled();
});

test('successfully executes the task with ephemeral tasks enabled', async () => {
const taskRunner = new TaskRunner({
ruleType,
internalSavedObjectsRepository,
taskInstance: {
...mockedTaskInstance,
state: {
...mockedTaskInstance.state,
previousStartedAt: new Date(Date.now() - 5 * 60 * 1000).toISOString(),
},
},
context: {
...taskRunnerFactoryInitializerParams,
},
inMemoryMetrics,
});
expect(AlertingEventLogger).toHaveBeenCalled();

mockGetAlertFromRaw.mockReturnValue(mockedRuleTypeSavedObject as Rule);
encryptedSavedObjectsClient.getDecryptedAsInternalUser.mockResolvedValue(mockedRawRuleSO);
const runnerResult = await taskRunner.run();
expect(runnerResult).toEqual(generateRunnerResult({ state: true, history: [true] }));
expect(ruleType.executor).toHaveBeenCalledTimes(1);
const call = ruleType.executor.mock.calls[0][0];
expect(call.params).toEqual({ bar: true });
expect(call.startedAt).toEqual(new Date(DATE_1970));
expect(call.previousStartedAt).toEqual(new Date(DATE_1970_5_MIN));
expect(call.state).toEqual({});
expect(call.rule).not.toBe(null);
expect(call.rule.id).toBe('1');
expect(call.rule.name).toBe(RULE_NAME);
expect(call.rule.tags).toEqual(['rule-', '-tags']);
expect(call.rule.consumer).toBe('bar');
expect(call.rule.enabled).toBe(true);
expect(call.rule.schedule).toEqual({ interval: '10s' });
expect(call.rule.createdBy).toBe('rule-creator');
expect(call.rule.updatedBy).toBe('rule-updater');
expect(call.rule.createdAt).toBe(mockDate);
expect(call.rule.updatedAt).toBe(mockDate);
expect(call.rule.notifyWhen).toBe('onActiveAlert');
expect(call.rule.throttle).toBe(null);
expect(call.rule.producer).toBe('alerts');
expect(call.rule.ruleTypeId).toBe('test');
expect(call.rule.ruleTypeName).toBe('My test rule');
expect(call.rule.actions).toEqual(RULE_ACTIONS);
expect(call.services.alertFactory.create).toBeTruthy();
expect(call.services.scopedClusterClient).toBeTruthy();
expect(call.services).toBeTruthy();

expect(logger.debug).toHaveBeenCalledTimes(5);
expect(logger.debug).nthCalledWith(1, 'executing rule test:1 at 1970-01-01T00:00:00.000Z', {
tags: ['1', 'test'],
});
expect(logger.debug).nthCalledWith(
2,
'deprecated ruleRunStatus for test:1: {"lastExecutionDate":"1970-01-01T00:00:00.000Z","status":"ok"}',
{ tags: ['1', 'test'] }
);
expect(logger.debug).nthCalledWith(
3,
'ruleRunStatus for test:1: {"outcome":"succeeded","outcomeOrder":0,"outcomeMsg":null,"warning":null,"alertsCount":{"active":0,"new":0,"recovered":0,"ignored":0}}',
{ tags: ['1', 'test'] }
);
expect(logger.debug).nthCalledWith(
4,
'ruleRunMetrics for test:1: {"numSearches":3,"totalSearchDurationMs":23423,"esSearchDurationMs":33,"numberOfTriggeredActions":0,"numberOfGeneratedActions":0,"numberOfActiveAlerts":0,"numberOfRecoveredAlerts":0,"numberOfNewAlerts":0,"numberOfDelayedAlerts":0,"hasReachedAlertLimit":false,"hasReachedQueuedActionsLimit":false,"triggeredActionsStatus":"complete"}',
{ tags: ['1', 'test'] }
);

testAlertingEventLogCalls({
status: 'ok',
});

expect(elasticsearchService.client.asInternalUser.update).toHaveBeenCalledWith(
...generateRuleUpdateParams({})
);
expect(mockUsageCounter.incrementCounter).not.toHaveBeenCalled();
});

test('successfully stores successful runs', async () => {
const taskRunner = new TaskRunner({
ruleType,
Expand Down
23 changes: 11 additions & 12 deletions x-pack/plugins/task_manager/server/config.test.ts
Original file line number Diff line number Diff line change
Expand Up @@ -19,10 +19,6 @@ describe('config validation', () => {
"active_nodes_lookback": "30s",
"interval": 10000,
},
"ephemeral_tasks": Object {
"enabled": false,
"request_capacity": 10,
},
"event_loop_delay": Object {
"monitor": true,
"warn_threshold": 5000,
Expand Down Expand Up @@ -82,10 +78,6 @@ describe('config validation', () => {
"active_nodes_lookback": "30s",
"interval": 10000,
},
"ephemeral_tasks": Object {
"enabled": false,
"request_capacity": 10,
},
"event_loop_delay": Object {
"monitor": true,
"warn_threshold": 5000,
Expand Down Expand Up @@ -143,10 +135,6 @@ describe('config validation', () => {
"active_nodes_lookback": "30s",
"interval": 10000,
},
"ephemeral_tasks": Object {
"enabled": false,
"request_capacity": 10,
},
"event_loop_delay": Object {
"monitor": true,
"warn_threshold": 5000,
Expand Down Expand Up @@ -296,4 +284,15 @@ describe('config validation', () => {
`"[discovery.active_nodes_lookback]: active node lookback duration cannot exceed five minutes"`
);
});

test('should not throw if ephemeral_tasks is defined', () => {
const config: Record<string, unknown> = {
ephemeral_tasks: {
enabled: true,
request_capacity: 20,
},
};

expect(() => configSchema.validate(config)).not.toThrow();
});
});
13 changes: 2 additions & 11 deletions x-pack/plugins/task_manager/server/config.ts
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,6 @@ export const DEFAULT_MAX_WORKERS = 10;
export const DEFAULT_POLL_INTERVAL = 3000;
export const MGET_DEFAULT_POLL_INTERVAL = 500;
export const DEFAULT_VERSION_CONFLICT_THRESHOLD = 80;
export const DEFAULT_MAX_EPHEMERAL_REQUEST_CAPACITY = MAX_WORKERS_LIMIT;

// Monitoring Constants
// ===================
Expand Down Expand Up @@ -101,16 +100,8 @@ export const configSchema = schema.object(
max: MAX_DISCOVERY_INTERVAL_MS,
}),
}),
ephemeral_tasks: schema.object({
enabled: schema.boolean({ defaultValue: false }),
/* How many requests can Task Manager buffer before it rejects new requests. */
request_capacity: schema.number({
// a nice round contrived number, feel free to change as we learn how it behaves
defaultValue: 10,
min: 1,
max: DEFAULT_MAX_EPHEMERAL_REQUEST_CAPACITY,
}),
}),
/* Allows for old kibana config to start kibana without crashing since ephemeral tasks are deprecated*/
ephemeral_tasks: schema.maybe(schema.any()),
event_loop_delay: eventLoopDelaySchema,
kibanas_per_partition: schema.number({
defaultValue: DEFAULT_KIBANAS_PER_PARTITION,
Expand Down

This file was deleted.

Loading

0 comments on commit 5a9129e

Please sign in to comment.