Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reactively disable Task Manager lifecycle when core services become unavailable #81779

Merged
merged 10 commits into from
Oct 29, 2020

Conversation

gmmorris
Copy link
Contributor

@gmmorris gmmorris commented Oct 27, 2020

Summary

Closes #75501
Closes #48785
Closes #47607
Closes #46670

Plugs the Task Manager polling lifecycle into the Kibana Services Status streams in order to ensure we reactively start and stop polling whenever the Elasticsearch or SavedObjects service switch between available and unavailable.

This will prevent Task Manager from polling whenever these services switch to an unavailable state.

We do not address the potential breaking behaviour described in the original issue as it is not a supported case as per Platform team's guidelines (specifically, we do not support the loading of a fresh snapshot Elasticsearch without restarting Kibana). This should be addressed by this though: #81790

I couldn't figure out a way to test this in an e2e automated way, but there are unit tests covering the internal behaviour and I've tested it locally by forcing these services to fail randomly and watching TM recover.

Checklist

Delete any items that are not applicable to this PR.

For maintainers

* master: (87 commits)
  [Actions] Adding `hasAuth` to Webhook Configuration to avoid confusing UX (elastic#81778)
  [i18n] add get_kibana_translation_paths tests (elastic#81624)
  [UX] Fix search term reset from url (elastic#81654)
  [Lens] Improved range formatter (elastic#80132)
  [Resolver] `SideEffectContext` changes, remove `@testing-library` uses (elastic#81077)
  [Time to Visualize] Update Library Text with Call to Action (elastic#81667)
  [docs] Resolving failed Kibana upgrade migrations (elastic#80999)
  [ftr/menuToggle] provide helper for enhanced menu toggle handling (elastic#81709)
  [Task Manager] adds basic observability into Task Manager's runtime operations (elastic#77868)
  Doc changes for stack management and grouped feature privileges (elastic#80486)
  Added functional test for alerts list filters by status, alert type and action type. Did a code refactoring and splitting for alerts tests. (elastic#81422)
  [Security Solution][Endpoint][Admin] Malware Protections Notify User Version (elastic#81415)
  Revert "[Actions] Adding `hasAuth` to Webhook Configuration to avoid confusing UX (elastic#81390)"
  [Maps] Use default format when proxying EMS-files (elastic#79760)
  [Discover] Deangularize context.html (elastic#81442)
  Use the displayName property in default editor (elastic#73311)
  adds bug label to Bug report for Security Solution template (elastic#81643)
  [ML] Transforms: Remove index field limitation for custom query. (elastic#81467)
  [Actions] Adding `hasAuth` to Webhook Configuration to avoid confusing UX (elastic#81390)
  [Task Manager] Mark task as failed if maxAttempts has been met. (elastic#80681)
  ...
@gmmorris
Copy link
Contributor Author

@elasticmachine merge upstream

kibanamachine and others added 4 commits October 28, 2020 05:51
…kibana into task-manager/lost-connectivity

* 'task-manager/lost-connectivity' of github.com:gmmorris/kibana:
  skips overview tests (elastic#81877)
  [Security Solution][Case] Fix connector's labeling (elastic#81824)
  [Maps] Fix EMS test (elastic#81856)
  [Security Solutions][Detections] - Fix bug, last response not showing for disabled rules (elastic#81783)
  skip flaky suite (elastic#81853)
  Add tsconfig for url_forwarding (elastic#81177)
  skip flaky suite (elastic#81844)
  check for server enabled (elastic#81818)
  [Seurity Solution][Case] Create case plugin client (elastic#81018)
  [Security Solutions][Detection Engine] Changes wording for threat matches and rules (elastic#81334)
  [Security Solution] critical pref bug with browser fields reducer
@gmmorris gmmorris changed the title plugged Task Manager lifecycle into status reactively Reactively disable Task Manager lifecycle when core services become unavailable Oct 28, 2020
@gmmorris gmmorris marked this pull request as ready for review October 28, 2020 12:21
@gmmorris gmmorris requested a review from a team as a code owner October 28, 2020 12:21
@gmmorris gmmorris added Feature:Task Manager release_note:fix Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v7.11.0 v8.0.0 labels Oct 28, 2020
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

@kibanamachine
Copy link
Contributor

💚 Build Succeeded

Metrics [docs]

✅ unchanged

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

@ymao1
Copy link
Contributor

ymao1 commented Oct 28, 2020

Both on master and this branch, I started ES, then Kibana and then killed ES. On this branch, I saw that the [error][plugins][taskManager][taskManager] Failed to poll for work: Error: No Living connections messages no longer fill up the logs like it does on master. On both branches though, when I start ES back up and it is ready, I see a ton of logs for [error][plugins][taskManager] [WorkloadAggregator]: Error: Invalid workload: {"took":0,"timed_out":false,"_shards":{"total":0,"successful":0,"skipped":0,"failed":0},"hits":{"total":{"value":0,"relation":"eq"},"max_score":0,"hits":[]}}. Is that something that should be addressed by this PR or is this a different issue?

@gmmorris
Copy link
Contributor Author

Both on master and this branch, I started ES, then Kibana and then killed ES. On this branch, I saw that the [error][plugins][taskManager][taskManager] Failed to poll for work: Error: No Living connections messages no longer fill up the logs like it does on master. On both branches though, when I start ES back up and it is ready, I see a ton of logs for [error][plugins][taskManager] [WorkloadAggregator]: Error: Invalid workload: {"took":0,"timed_out":false,"_shards":{"total":0,"successful":0,"skipped":0,"failed":0},"hits":{"total":{"value":0,"relation":"eq"},"max_score":0,"hits":[]}}. Is that something that should be addressed by this PR or is this a different issue?

Thanks @ymao1 , are you sure you have the latest version? I thought I pushed a fix for that. :)

@ymao1
Copy link
Contributor

ymao1 commented Oct 28, 2020

Thanks @ymao1 , are you sure you have the latest version? I thought I pushed a fix for that. :)

Hmm...I just tried it again. Deleted the local branch and did git pr 81779 which should have gotten me the latest and I'm still seeing the same thing.

Edit. I just checked and I'm seeing the changes from your latest commit in the files so I think I have the latest.

Second edit: user error on my part. I was starting up a brand new ES every time.

Copy link
Contributor

@ymao1 ymao1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@gmmorris
Copy link
Contributor Author

Second edit: user error on my part. I was starting up a brand new ES every time.

Totally reasonable mistake to make - I was doing the same until Platform informed me we don't support that case.

Copy link
Member

@pmuellr pmuellr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@gmmorris gmmorris merged commit 66d79ea into elastic:master Oct 29, 2020
gmmorris added a commit to gmmorris/kibana that referenced this pull request Oct 29, 2020
…navailable (elastic#81779)

Plugs the Task Manager polling lifecycle into the Kibana Services Status streams in order to ensure we reactively start and stop polling whenever the Elasticsearch or SavedObjects service switch between `available` and `unavailable`.

This will prevent Task Manager from polling whenever these services switch to an `unavailable` state.
gmmorris added a commit that referenced this pull request Oct 29, 2020
…navailable (#81779) (#81991)

Plugs the Task Manager polling lifecycle into the Kibana Services Status streams in order to ensure we reactively start and stop polling whenever the Elasticsearch or SavedObjects service switch between `available` and `unavailable`.

This will prevent Task Manager from polling whenever these services switch to an `unavailable` state.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:Task Manager release_note:fix Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v7.11.0 v8.0.0
Projects
None yet
5 participants