Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Reporting/Screenshots] Handle page setup errors and capture the page, don't fail the job #58683

Merged

Conversation

tsullivan
Copy link
Member

@tsullivan tsullivan commented Feb 27, 2020

Closes #57008
Closes #44305

Release note: Kibana Reporting can now tolerate timeouts when trying to read visualization information on the Kibana page, and will capture a screenshot of the page even when such errors happen. If a timeout error happens on a Reporting job, the errors will become part of the job document and will be shown in the listing of Reporting jobs in Management > Kibana > Reporting.

Opening the Kibana URL for screenshot capture and reading the Kibana visualization element info from the page involves a lot of moving parts that could fail the entire reporting job if an error happens somewhere in the flow. This PR changes that by breaking up the flow to setup setups and a capture step. If there are errors in the setup step, they are logged and added to the Reporting job metadata, but the capture step is still allowed to continue.

New configuration settings for Reporting

xpack.reporting.capture.timeouts:
  openUrl: 30000
  waitForElements: 30000
  renderComplete: 30000

Each of these settings allows the user to control how much time to allow for each of the 3 main parts of setup$.

  • openUrl: the time allowed for Reporting to wait for the initial data of the Kibana page in the browser (time to resolve the .application selector)
  • waitForElements: the time allowed for Reporting to wait for the visualization panels to load (time to resolve the data-shared-items-container selector)
  • renderComplete: the time allowed for Reporting to wait for each visualization to signal that rendering is done

Note that xpack.reporting.queue.timeout still exists, and represents the overall time that the Reporting job is allowed to finish, and still has a default of 120 seconds. None of the new timeout settings should exceed the xpack.reporting.queue.timeout setting.

Other changes in this PR:

  • Retires the scanPage module. It had the job of waiting for visualization selectors while also checking the page for errors. The latter is a troubleshooting step that is no longer needed with this PR.
  • Removes some obsolete troubleshooting tasks, such as logging page text when the app selector could not be found
  • Wraps a lot of log messages in i18n.translate

If a Reporting job completed with "warnings" it gets a new treatment on the Management > Reporting job listing page:
image
image

However, there will still be a download available:
image

Why does the download show an empty Kibana page? Because when the page loaded initially, there was a toast message error saying "Saved Object not found". That toast message dismissed itself after a while, but before Reporting captured the page.

Checklist

Delete any items that are not applicable to this PR.

For maintainers

@tsullivan tsullivan force-pushed the reporting/screenshot-always-capture branch 6 times, most recently from 4026286 to 4b2a977 Compare February 27, 2020 22:00
@tsullivan tsullivan changed the title [Reporting/Screenshots] Handle page setup errors and capture the page, even if errors [Reporting/Screenshots] Handle page setup errors and capture the page, don't fail the job Feb 27, 2020
@tsullivan tsullivan force-pushed the reporting/screenshot-always-capture branch 2 times, most recently from 14bcff1 to c96095d Compare February 27, 2020 22:58
.default(30000),
renderComplete: Joi.number()
.integer()
.default(30000),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: These get passed to Puppeteer waitForSelector calls. Puppeteer's default for these is 30 seconds internally.

Copy link
Contributor

@joelgriffith joelgriffith Feb 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we do call https://pptr.dev/#?product=Puppeteer&version=v2.1.1&show=api-pagesetdefaulttimeouttimeout and pass in the overall job timeout, so we might be overriding the default timeouts for these APIS.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++ If we're passing the timeout to each of the calls, then setting page.setDefaultTimeout doesn't make a lot of sense.

I don't think we should use the global job timeout, which is the "queue timeout" because once that timeout hits there is no buffer room to recover a completed screenshot for the job. Hitting the "queue timeout" should be avoided as much as possible.

We need to think about this... Should the page.setDefaultTimeout be the max of these 3 new timeouts?

At any rate, server/browsers/chromium/driver_factory/index.ts#L120 is no longer valid in this PR

Copy link
Member Author

@tsullivan tsullivan Mar 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, this makes sense:

        // All navigation methods default to 30 seconds,
        // which can cause the job to fail even if we bump timeouts in	        // All waitFor methods have their own timeout config
        // the config. Help alleviate errors like	        page.setDefaultTimeout(this.captureConfig.timeouts.openUrl);```

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Individual methods can still override this, but I like setting it to the openUrl which is where I think the bulk of the operation is for PDF reports

@tsullivan
Copy link
Member Author

Here is a test plan on how this PR will affect the different ways that jobs have been known to fail:

  1. Test using invalid hostname for xpack.reporting.kibanaServer.hostname
  2. Test invalid auth (stuck on login page)
  3. Deleted Canvas worksheet
  4. Deleted visualization in a Dashboard
  5. Deleted saved object for visualization report
  6. Search on a dashboard is taking a long time

// know how many items to expect since gridster incrementally adds panels
// we have to use this hint to wait for all of them
await browser.waitForSelector(
`${renderCompleteSelector},[${itemsCountAttribute}]`,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this block of code was moved over from the scanPage function

Copy link
Contributor

@joelgriffith joelgriffith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great, nice logging additions + try/catch blocks. Think this will help us a ton 👍

@tsullivan tsullivan force-pushed the reporting/screenshot-always-capture branch from 2ff57cd to b41acab Compare February 28, 2020 16:46
@tsullivan tsullivan marked this pull request as ready for review February 28, 2020 23:17
@tsullivan tsullivan added (Deprecated) Feature:Reporting Use Reporting:Screenshot, Reporting:CSV, or Reporting:Framework instead Team:Reporting Services labels Mar 4, 2020
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-reporting-services (Team:Reporting Services)

@tsullivan
Copy link
Member Author

@elasticmachine merge upstream

Copy link
Contributor

@joelgriffith joelgriffith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rest of the changes LGTM

@kibanamachine
Copy link
Contributor

💚 Build Succeeded

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

@tsullivan tsullivan merged commit 893d8da into elastic:master Mar 6, 2020
@tsullivan tsullivan deleted the reporting/screenshot-always-capture branch March 6, 2020 05:28
tsullivan added a commit to tsullivan/kibana that referenced this pull request Mar 6, 2020
…, don't fail the job (elastic#58683)

* [Reporting] Handle error if intercepted request could not be continued

* [Reporting/Screenshots] Handle page setup errors and capture the page with errors shown

* show warnings in UI

* i18n todos

* Cleanup an old troubleshooting task

* set the default for all new timeout settings to 30 seconds

* fix some tests

* update error strings

* Cleanup 2

* fix tests 2

* polish the job info map status items

* More error message updating

* Log the error that was caught

* Oops fix ts

* add documentation

* fix i18n

* fix mocha test

* use the openUrl timeout as the default for navigation

* fix comment

Co-authored-by: Elastic Machine <[email protected]>
jloleysens added a commit to jloleysens/kibana that referenced this pull request Mar 6, 2020
…x-closed-index

* 'master' of github.com:elastic/kibana: (32 commits)
  [ML] Use Kibana's HttpHandler for HTTP requests (elastic#59320)
  [APM] Create settings page to manage Custom Links (elastic#57788)
  [Upgrade Assistant] Server-side batch reindexing (elastic#58598)
  completes navigation test (elastic#59141)
  [SIEM] Fixes dragging entries to the Timeline while data is loading may trigger a partial page reload (elastic#59476)
  [Reporting/Screenshots] Handle page setup errors and capture the page, don't fail the job (elastic#58683)
  [SIEM] [CASES] API with io-ts validation (elastic#59265)
  Use camelCase rather than snakeCase for plugin name (elastic#59461)
  [Maps] top term percentage field property (elastic#59386)
  Add custom action to registry and show actions list in siem (elastic#58395)
  [Search service] Add enhanced ES search strategy (elastic#59224)
  [Logs UI] Speed up stream rendering using memoization (elastic#59163)
  expand max-old-space-size for xpack jest tests (elastic#59455)
  Added possibility to embed connectors create and edit flyouts (elastic#58514)
  Revert "Temporarily disabling PR project mappings (elastic#59485)" (elastic#59491)
  Temporarily disabling PR project mappings (elastic#59485)
  [Endpoint] Fix alert list functional test error (elastic#59357)
  Rename status_page to statusPage (elastic#59186)
  Fix visual baseline job (elastic#59348)
  Extended AlertContextValue with metadata optional property (elastic#59391)
  ...

# Conflicts:
#	x-pack/plugins/upgrade_assistant/common/types.ts
#	x-pack/plugins/upgrade_assistant/server/lib/reindexing/reindex_actions.ts
#	x-pack/plugins/upgrade_assistant/server/lib/reindexing/reindex_service.test.ts
#	x-pack/plugins/upgrade_assistant/server/lib/reindexing/reindex_service.ts
#	x-pack/plugins/upgrade_assistant/server/routes/reindex_indices/reindex_indices.test.ts
#	x-pack/plugins/upgrade_assistant/server/routes/reindex_indices/reindex_indices.ts
tsullivan added a commit that referenced this pull request Mar 6, 2020
…, don't fail the job (#58683) (#59519)

* [Reporting] Handle error if intercepted request could not be continued

* [Reporting/Screenshots] Handle page setup errors and capture the page with errors shown

* show warnings in UI

* i18n todos

* Cleanup an old troubleshooting task

* set the default for all new timeout settings to 30 seconds

* fix some tests

* update error strings

* Cleanup 2

* fix tests 2

* polish the job info map status items

* More error message updating

* Log the error that was caught

* Oops fix ts

* add documentation

* fix i18n

* fix mocha test

* use the openUrl timeout as the default for navigation

* fix comment

Co-authored-by: Elastic Machine <[email protected]>

Co-authored-by: Elastic Machine <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
(Deprecated) Feature:Reporting Use Reporting:Screenshot, Reporting:CSV, or Reporting:Framework instead release_note:enhancement v7.7.0 v8.0.0
Projects
None yet
4 participants