[ML] Improving parsing of large uploaded files #62970

jgowdyelastic · 2020-04-08T16:08:01Z

The data from the file is now no longer read in one go, instead it is read as an ArrayBuffer with the first 5MBs decoded and stored for sending to the find_file_structure endpoint.
At the point of import, the data is chopped up into 100MB chunks for processing into ndjson docs.
When dividing the data, there is a good chance a partial line will be left at the end of each chunk. The length of this partial line is measured and the start of the next chunk is rolled back to include it.

With this change I've managed to import a 1.43GB CSV file locally which took 11mins.
Because of this, I've increased the absolute max file size to be 1GB.

Further improvements can be made to reduce the browser memory footprint. Currently the whole file is still stored in memory before import, only now it's broken up into parts.
A better way would be to only process the file in chunks while the uploading is happening, removing the need to process the file entirely before beginning the upload. but that will involve a larger architectural change.

Checklist

Delete any items that are not applicable to this PR.

Any text added follows EUI's writing guidelines, uses sentence case text and includes i18n support
Documentation was added for features that require explanation or tutorials
Unit or functional tests were updated or added to match the most common scenarios
This was checked for keyboard-only and screenreader accessibility
This renders correctly on smaller devices using a responsive layout. (You can test this in your browser
This was checked for cross-browser compatibility, including a check against IE11

For maintainers

This was checked for breaking API changes and was labeled appropriately

elasticmachine · 2020-04-08T16:43:43Z

Pinging @elastic/ml-ui (:ml)

jgowdyelastic · 2020-04-08T17:03:30Z

cc @droberts195

peteharverson

Tested and LGTM. One minor comment.

x-pack/plugins/ml/common/constants/file_datavisualizer.ts

alvarezmelissa87

LGTM ⚡

jgowdyelastic · 2020-04-14T12:02:39Z

@elasticmachine merge upstream

jgowdyelastic · 2020-04-14T15:22:11Z

@elasticmachine merge upstream

kibanamachine · 2020-04-14T17:10:55Z

💚 Build Succeeded

continuous-integration/kibana-ci/pull-request
Commit: c93a51c

History

💚 Build #39852 succeeded 8781a69
💚 Build #39655 succeeded 46505bd
💚 Build #39639 succeeded 8dbd6c6

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

* [ML] Improving parsing of large uploaded files * small clean up * increasing max to 1GB * adding comments Co-authored-by: Elastic Machine <[email protected]>

* [ML] Improving parsing of large uploaded files * small clean up * increasing max to 1GB * adding comments Co-authored-by: Elastic Machine <[email protected]> Co-authored-by: Elastic Machine <[email protected]>

* master: (29 commits) Add test:jest_integration npm script (elastic#62938) [data.search.aggs] Remove service getters from agg types (AggConfig part) (elastic#62548) [Discover] Fix broken setting of bucketInterval (elastic#62939) Disable adding conditions when in alert management context. (elastic#63514) [Alerting] fixes to allow pre-configured actions to be executed (elastic#63432) adding useMemo (elastic#63504) [Maps] fix double fetch when filter pill is added (elastic#63024) [Lens] Fix missing formatting bug in "break down by" (elastic#63288) [SIEM] [Cases] Removed double pasted line (elastic#63507) [Reporting] Improve functional test steps (elastic#63259) [SIEM][CASE] Tests for server's configuration API (elastic#63099) [SIEM] [Cases] Case container unit tests (elastic#63376) [ML] Improving parsing of large uploaded files (elastic#62970) [ML] Listing global calendars on the job management page (elastic#63124) [Ingest][Endpoint] Add Ingest rest api response types for use in Endpoint (elastic#63373) Add help text to form fields (elastic#63165) [ML] Converts utils Mocha tests to Jest (elastic#63132) [Metrics UI] Refactor With* containers to hooks (elastic#59503) [NP] Migrate logstash server side code to NP (elastic#63135) Clicking cancel in saved query save modal doesn't close it (elastic#62774) ...

* [ML] Improving parsing of large uploaded files * small clean up * increasing max to 1GB * adding comments Co-authored-by: Elastic Machine <[email protected]>

jgowdyelastic added 3 commits April 8, 2020 16:44

[ML] Improving parsing of large uploaded files

a89359f

small clean up

8dbd6c6

increasing max to 1GB

46505bd

jgowdyelastic requested review from walterra, peteharverson, darnautov and alvarezmelissa87 April 8, 2020 16:43

jgowdyelastic self-assigned this Apr 8, 2020

jgowdyelastic added the :ml label Apr 8, 2020

jgowdyelastic added Feature:File and Index Data Viz ML file and index data visualizer release_note:enhancement review v7.8.0 v8.0.0 labels Apr 8, 2020

jgowdyelastic marked this pull request as ready for review April 8, 2020 17:03

jgowdyelastic requested a review from a team as a code owner April 8, 2020 17:03

peteharverson approved these changes Apr 9, 2020

View reviewed changes

x-pack/plugins/ml/common/constants/file_datavisualizer.ts Outdated Show resolved Hide resolved

adding comments

8781a69

alvarezmelissa87 approved these changes Apr 9, 2020

View reviewed changes

Merge branch 'master' into improving-parsing-of-large-uploaded-files

ce9c360

Merge branch 'master' into improving-parsing-of-large-uploaded-files

c93a51c

jgowdyelastic merged commit 2b4c300 into elastic:master Apr 14, 2020

jgowdyelastic mentioned this pull request Apr 14, 2020

[7.x] [ML] Improving parsing of large uploaded files (#62970) #63500

Merged

jgowdyelastic deleted the improving-parsing-of-large-uploaded-files branch April 14, 2020 17:59

jgowdyelastic mentioned this pull request Apr 14, 2020

[ML] Changing file data visualizer max upload setting to string #63502

Merged

1 task

jgowdyelastic mentioned this pull request Apr 21, 2020

[DOCS] Add file size setting for Data Visualizer #64006

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] Improving parsing of large uploaded files #62970

[ML] Improving parsing of large uploaded files #62970

jgowdyelastic commented Apr 8, 2020 •

edited

Loading

elasticmachine commented Apr 8, 2020

jgowdyelastic commented Apr 8, 2020

peteharverson left a comment

alvarezmelissa87 left a comment

jgowdyelastic commented Apr 14, 2020

jgowdyelastic commented Apr 14, 2020

kibanamachine commented Apr 14, 2020

[ML] Improving parsing of large uploaded files #62970

[ML] Improving parsing of large uploaded files #62970

Conversation

jgowdyelastic commented Apr 8, 2020 • edited Loading

Checklist

For maintainers

elasticmachine commented Apr 8, 2020

jgowdyelastic commented Apr 8, 2020

peteharverson left a comment

Choose a reason for hiding this comment

alvarezmelissa87 left a comment

Choose a reason for hiding this comment

jgowdyelastic commented Apr 14, 2020

jgowdyelastic commented Apr 14, 2020

kibanamachine commented Apr 14, 2020

💚 Build Succeeded

History

jgowdyelastic commented Apr 8, 2020 •

edited

Loading