Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle cluster_block_exception during reindexing the TM index #201297

Merged

Conversation

ersin-erdal
Copy link
Contributor

@ersin-erdal ersin-erdal commented Nov 21, 2024

Resolves: https://github.com/elastic/response-ops-team/issues/249

This PR increases task claiming interval in case of cluster_block_exception to avoid generating too many error during TM index reindexing.

To verify:

  • Run your local Kibana,
  • Create a user with kibana_system and kibana_admin roles
  • Logout and login with your new user
  • Use below request to put a write block on TM index.
    PUT /.kibana_task_manager_9.0.0_001/_block/write
  • Observe the error messages and their occurring interval on your terminal.
  • Use below request on the Kibana console to halt write block.
PUT /.kibana_task_manager_9.0.0_001/_settings
{
  "index": {
    "blocks.write": false
  }
}


const FLUSH_MARKER = Symbol('flush');
export const ADJUST_THROUGHPUT_INTERVAL = 10 * 1000;
export const PREFERRED_MAX_POLL_INTERVAL = 60 * 1000;
export const INTERVAL_AFTER_BLOCK_EXCEPTION = 61 * 1000;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make it 1 sec longer than the max limit, so I can check the previousPollInterval on error flush and set the interval back to default.

return event.tag === 'emit';
}

function incementErrorCount(count: number) {
function incrementOrEmitErrorCount(count: number, isBlockException: boolean) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to emit the error event as soon as possible in case of ClusterBlockException

@pmuellr
Copy link
Member

pmuellr commented Nov 22, 2024

Haven't reviewed the code yet, but I did take it for a spin.

Notes:

  • if the write block is still on and Kibana is restarted, messages like this are logged: Task ML:saved-objects-sync-task: Error running task: ML:saved-objects-sync-task, index [.kibana_task_manager_9.0.0_001] blocked by: [FORBIDDEN/8/index write (api)];: cluster_block_exception
    Guessing this is probably ok, but why would we be trying to write a task, that presumably already exists? Is that the way "ensureScheduled" (or whatever) works w/TM? Not clear if it's all the tasks or just some. Not sure it's worth doing anything about this, if anything it's a great signal that the TM index is write-blocked :-)

  • when using the update-by-query claimer, there's a long, filled-with-JSON error logged every 3s: Failed to poll for work: { big JSON wad here }. Seems like we should try to not log that every 3s, but perhaps the # of folks using that claimer, by the time we're in version 8.last, will be almost or literally none.

Other than that, seems to work as described. Looks like it's logging the Discovery service message ~1/minute, and then you can see errors updating task claims, etc, as expected. When the block is removed, everything comes back to normal.

@ersin-erdal
Copy link
Contributor Author

  • if the write block is still on and Kibana is restarted, messages like this are logged: Task ML:saved-objects-sync-task: Error running task: ML:saved-objects-sync-task, index [.kibana_task_manager_9.0.0_001] blocked by: [FORBIDDEN/8/index write (api)];: cluster_block_exception
    Guessing this is probably ok, but why would we be trying to write a task, that presumably already exists? Is that the way "ensureScheduled" (or whatever) works w/TM? Not clear if it's all the tasks or just some. Not sure it's worth doing anything about this, if anything it's a great signal that the TM index is write-blocked :-)

Yes, I also think that it is ok, because there should not be a write-block during plugin start. Upgrade assistant can be used in an already running Kibana.

when using the update-by-query claimer, there's a long, filled-with-JSON error logged every 3s: Failed to poll for work: { big JSON wad here }. Seems like we should try to not log that every 3s, but perhaps the # of folks using that claimer, by the time we're in version 8.last, will be almost or literally none.

I don't think that there will be users opting in for update-by-query strategy but I will check that scenario as well.

@ersin-erdal ersin-erdal added Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) release_note:skip Skip the PR/issue when compiling release notes backport:prev-minor Backport to (8.x) the previous minor version (i.e. one version back from main) labels Dec 4, 2024
@ersin-erdal ersin-erdal marked this pull request as ready for review December 4, 2024 16:38
@ersin-erdal ersin-erdal requested a review from a team as a code owner December 4, 2024 16:38
@elasticmachine
Copy link
Contributor

Pinging @elastic/response-ops (Team:ResponseOps)

@ersin-erdal ersin-erdal force-pushed the 249-handling-tm-reindexing-errors branch from 2cde300 to c2da309 Compare December 5, 2024 14:25
);
} else {
this.logger.error(
`Kibana Discovery Service couldn't update this node's last_seen timestamp. id: ${this.currentNode}, last_seen: ${lastSeen}, error:${e.message}`
);
}
if (isClusterBlockException(e)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this check needs to move up, otherwise the log always says the retryInterval is 10000 ms even if it's actually 60,000

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah good point, i didn't check that. Fixed it, thanks.

@ymao1
Copy link
Contributor

ymao1 commented Dec 5, 2024

I'm seeing the poll interval flip flop between 500 and 61,000 (I changed the debug log to an info)

[2024-12-05T13:14:18.719-05:00][ERROR][plugins.taskManager] Kibana Discovery Service couldn't be started and will be retried in 10000ms, error:index [.kibana_task_manager_9.0.0_001] blocked by: [FORBIDDEN/8/index write (api)];: cluster_block_exception
	Root causes:
		cluster_block_exception: index [.kibana_task_manager_9.0.0_001] blocked by: [FORBIDDEN/8/index write (api)];
[2024-12-05T13:14:21.142-05:00][WARN ][plugins.taskManager] Background task node "5b2de169-2785-441b-ae8c-186a1936b17d" has no assigned partitions, claiming against all partitions
[2024-12-05T13:14:21.162-05:00][INFO ][plugins.taskManager] Poll interval configuration changing from 500 to 61000 after seeing 1 "too many request" and/or "execute [inline] script" and/or "cluster_block_exception" error(s).
[2024-12-05T13:14:21.162-05:00][WARN ][plugins.taskManager] Poll interval configuration is temporarily increased after Elasticsearch returned 1 "too many request" and/or "execute [inline] script" and/or "cluster_block_exception" error(s).
[2024-12-05T13:14:21.162-05:00][WARN ][plugins.taskManager] Capacity configuration is temporarily reduced after Elasticsearch returned 1 "too many request" and/or "execute [inline] script" error(s).
[2024-12-05T13:14:21.162-05:00][INFO ][plugins.taskManager] Poll interval configuration changing from 500 to 61000 after seeing 1 "too many request" and/or "execute [inline] script" and/or "cluster_block_exception" error(s).
[2024-12-05T13:14:21.162-05:00][WARN ][plugins.taskManager] Poll interval configuration is temporarily increased after Elasticsearch returned 1 "too many request" and/or "execute [inline] script" and/or "cluster_block_exception" error(s).
[2024-12-05T13:14:21.162-05:00][WARN ][plugins.taskManager] Capacity configuration is temporarily reduced after Elasticsearch returned 1 "too many request" and/or "execute [inline] script" error(s).
[2024-12-05T13:14:21.163-05:00][INFO ][plugins.taskManager] Poll interval configuration changing from 500 to 61000 after seeing 1 "too many request" and/or "execute [inline] script" and/or "cluster_block_exception" error(s).
[2024-12-05T13:14:21.163-05:00][WARN ][plugins.taskManager] Poll interval configuration is temporarily increased after Elasticsearch returned 1 "too many request" and/or "execute [inline] script" and/or "cluster_block_exception" error(s).

and then 1 minute later

[2024-12-05T13:14:38.689-05:00][INFO ][plugins.taskManager] Poll interval configuration changing from 61000 to 500 after seeing 0 "too many request" and/or "execute [inline] script" and/or "cluster_block_exception" error(s).
[2024-12-05T13:14:38.704-05:00][INFO ][plugins.taskManager] Poll interval configuration changing from 61000 to 500 after seeing 0 "too many request" and/or "execute [inline] script" and/or "cluster_block_exception" error(s).
[2024-12-05T13:14:38.705-05:00][WARN ][plugins.taskManager] Task Manager is unhealthy, the assumedRequiredThroughputPerMinutePerKibana (NaN) >= capacityPerMinutePerKibana (600)
[2024-12-05T13:14:40.071-05:00][INFO ][plugins.taskManager] Poll interval configuration changing from 61000 to 500 after seeing 0 "too many request" and/or "execute [inline] script" and/or "cluster_block_exception" error(s).
[2024-12-05T13:14:48.710-05:00][WARN ][plugins.taskManager] Task Manager is unhealthy, the assumedRequiredThroughputPerMinutePerKibana (NaN) >= capacityPerMinutePerKibana (840)

Should we also look at the task manager capacity calculation log? It's calculating NaN so that is being logged every 10 seconds it looks like

@ersin-erdal
Copy link
Contributor Author

ersin-erdal commented Dec 5, 2024

I'm seeing the poll interval flip flop between 500 and 61,000 (I changed the debug log to an info)

Actually it is not flapping, It sets the interval to 61000 and schedules the tasks with it.
Then errorReset interval kicks in (in 10 sec) and sets the interval back to 500.
But when the next cycle gets run in 1m, it sets the interval back to 61000 if the cluster_block_exception is still firing.

Should we also look at the task manager capacity calculation log? It's calculating NaN so that is being logged every 10 seconds it looks like

I set the capacity to previousCapacity in case of cluster_block_exception but not sure if this is correct. WDYT?

@ymao1
Copy link
Contributor

ymao1 commented Dec 6, 2024

Actually it is not flapping, It sets the interval to 61000 and schedules the tasks with it. Then errorReset interval kicks in (in 10 sec) and sets the interval back to 500. But when the next cycle gets run in 1m, it sets the interval back to 61000 if the cluster_block_exception is still firing.

I see, so the poll interval does get reset back to 500 but since it is already set to 1m it won't actually poll until the next minute. Is there any way we can fix these logs to be less misleading?

@ersin-erdal
Copy link
Contributor Author

ersin-erdal commented Dec 9, 2024

I see, so the poll interval does get reset back to 500 but since it is already set to 1m it won't actually poll until the next minute. Is there any way we can fix these logs to be less misleading?

I think I managed to hide that message, have just pushed the change

Copy link
Contributor

@ymao1 ymao1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Verified the poll interval increases when cluster block exception seen and reverts when it is no longer seen.

@elasticmachine
Copy link
Contributor

💚 Build Succeeded

Metrics [docs]

✅ unchanged

History

@ersin-erdal ersin-erdal merged commit 7aa80ce into elastic:main Dec 10, 2024
8 checks passed
@ersin-erdal ersin-erdal deleted the 249-handling-tm-reindexing-errors branch December 10, 2024 15:17
@kibanamachine
Copy link
Contributor

Starting backport for target branches: 8.x

https://github.com/elastic/kibana/actions/runs/12259162994

kibanamachine pushed a commit to kibanamachine/kibana that referenced this pull request Dec 10, 2024
…c#201297)

Resolves: elastic/response-ops-team#249

This PR increases task claiming interval in case of
`cluster_block_exception` to avoid generating too many error during TM
index reindexing.

## To verify:

- Run your local Kibana,
- Create a user with `kibana_system` and `kibana_admin` roles
- Logout and login with your new user
- Use below request to put a write block on TM index.
   `PUT /.kibana_task_manager_9.0.0_001/_block/write`
- Observe the error messages and their occurring interval on your
terminal.
- Use below request on the Kibana console to halt write block.
```
PUT /.kibana_task_manager_9.0.0_001/_settings
{
  "index": {
    "blocks.write": false
  }
}
```

(cherry picked from commit 7aa80ce)
@kibanamachine
Copy link
Contributor

💚 All backports created successfully

Status Branch Result
8.x

Note: Successful backport PRs will be merged automatically after passing CI.

Questions ?

Please refer to the Backport tool documentation

kibanamachine added a commit that referenced this pull request Dec 10, 2024
…201297) (#203609)

# Backport

This will backport the following commits from `main` to `8.x`:
- [Handle cluster_block_exception during reindexing the TM index
(#201297)](#201297)

<!--- Backport version: 9.4.3 -->

### Questions ?
Please refer to the [Backport tool
documentation](https://github.com/sqren/backport)

<!--BACKPORT [{"author":{"name":"Ersin
Erdal","email":"[email protected]"},"sourceCommit":{"committedDate":"2024-12-10T15:17:27Z","message":"Handle
cluster_block_exception during reindexing the TM index
(#201297)\n\nResolves:
https://github.com/elastic/response-ops-team/issues/249\r\n\r\nThis PR
increases task claiming interval in case of\r\n`cluster_block_exception`
to avoid generating too many error during TM\r\nindex
reindexing.\r\n\r\n## To verify:\r\n\r\n- Run your local Kibana,\r\n-
Create a user with `kibana_system` and `kibana_admin` roles\r\n- Logout
and login with your new user\r\n- Use below request to put a write block
on TM index.\r\n `PUT /.kibana_task_manager_9.0.0_001/_block/write`\r\n-
Observe the error messages and their occurring interval on
your\r\nterminal.\r\n- Use below request on the Kibana console to halt
write block.\r\n```\r\nPUT
/.kibana_task_manager_9.0.0_001/_settings\r\n{\r\n \"index\": {\r\n
\"blocks.write\": false\r\n
}\r\n}\r\n```","sha":"7aa80ce53027df7ac0e5fc01d206ef38ac3f9575","branchLabelMapping":{"^v9.0.0$":"main","^v8.18.0$":"8.x","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["release_note:skip","Team:ResponseOps","v9.0.0","backport:prev-minor"],"title":"Handle
cluster_block_exception during reindexing the TM
index","number":201297,"url":"https://github.com/elastic/kibana/pull/201297","mergeCommit":{"message":"Handle
cluster_block_exception during reindexing the TM index
(#201297)\n\nResolves:
https://github.com/elastic/response-ops-team/issues/249\r\n\r\nThis PR
increases task claiming interval in case of\r\n`cluster_block_exception`
to avoid generating too many error during TM\r\nindex
reindexing.\r\n\r\n## To verify:\r\n\r\n- Run your local Kibana,\r\n-
Create a user with `kibana_system` and `kibana_admin` roles\r\n- Logout
and login with your new user\r\n- Use below request to put a write block
on TM index.\r\n `PUT /.kibana_task_manager_9.0.0_001/_block/write`\r\n-
Observe the error messages and their occurring interval on
your\r\nterminal.\r\n- Use below request on the Kibana console to halt
write block.\r\n```\r\nPUT
/.kibana_task_manager_9.0.0_001/_settings\r\n{\r\n \"index\": {\r\n
\"blocks.write\": false\r\n
}\r\n}\r\n```","sha":"7aa80ce53027df7ac0e5fc01d206ef38ac3f9575"}},"sourceBranch":"main","suggestedTargetBranches":[],"targetPullRequestStates":[{"branch":"main","label":"v9.0.0","branchLabelMappingKey":"^v9.0.0$","isSourceBranch":true,"state":"MERGED","url":"https://github.com/elastic/kibana/pull/201297","number":201297,"mergeCommit":{"message":"Handle
cluster_block_exception during reindexing the TM index
(#201297)\n\nResolves:
https://github.com/elastic/response-ops-team/issues/249\r\n\r\nThis PR
increases task claiming interval in case of\r\n`cluster_block_exception`
to avoid generating too many error during TM\r\nindex
reindexing.\r\n\r\n## To verify:\r\n\r\n- Run your local Kibana,\r\n-
Create a user with `kibana_system` and `kibana_admin` roles\r\n- Logout
and login with your new user\r\n- Use below request to put a write block
on TM index.\r\n `PUT /.kibana_task_manager_9.0.0_001/_block/write`\r\n-
Observe the error messages and their occurring interval on
your\r\nterminal.\r\n- Use below request on the Kibana console to halt
write block.\r\n```\r\nPUT
/.kibana_task_manager_9.0.0_001/_settings\r\n{\r\n \"index\": {\r\n
\"blocks.write\": false\r\n
}\r\n}\r\n```","sha":"7aa80ce53027df7ac0e5fc01d206ef38ac3f9575"}}]}]
BACKPORT-->

Co-authored-by: Ersin Erdal <[email protected]>
CAWilson94 pushed a commit to CAWilson94/kibana that referenced this pull request Dec 12, 2024
…c#201297)

Resolves: elastic/response-ops-team#249

This PR increases task claiming interval in case of
`cluster_block_exception` to avoid generating too many error during TM
index reindexing.

## To verify:

- Run your local Kibana,
- Create a user with `kibana_system` and `kibana_admin` roles
- Logout and login with your new user
- Use below request to put a write block on TM index.
   `PUT /.kibana_task_manager_9.0.0_001/_block/write`
- Observe the error messages and their occurring interval on your
terminal.
- Use below request on the Kibana console to halt write block.
```
PUT /.kibana_task_manager_9.0.0_001/_settings
{
  "index": {
    "blocks.write": false
  }
}
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport:prev-minor Backport to (8.x) the previous minor version (i.e. one version back from main) release_note:skip Skip the PR/issue when compiling release notes Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v8.18.0 v9.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants