Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Improving parsing of large uploaded files #62970

Conversation

jgowdyelastic
Copy link
Member

@jgowdyelastic jgowdyelastic commented Apr 8, 2020

The data from the file is now no longer read in one go, instead it is read as an ArrayBuffer with the first 5MBs decoded and stored for sending to the find_file_structure endpoint.
At the point of import, the data is chopped up into 100MB chunks for processing into ndjson docs.
When dividing the data, there is a good chance a partial line will be left at the end of each chunk. The length of this partial line is measured and the start of the next chunk is rolled back to include it.

With this change I've managed to import a 1.43GB CSV file locally which took 11mins.
Because of this, I've increased the absolute max file size to be 1GB.

Further improvements can be made to reduce the browser memory footprint. Currently the whole file is still stored in memory before import, only now it's broken up into parts.
A better way would be to only process the file in chunks while the uploading is happening, removing the need to process the file entirely before beginning the upload. but that will involve a larger architectural change.

Checklist

Delete any items that are not applicable to this PR.

For maintainers

@elasticmachine
Copy link
Contributor

Pinging @elastic/ml-ui (:ml)

@jgowdyelastic jgowdyelastic marked this pull request as ready for review April 8, 2020 17:03
@jgowdyelastic jgowdyelastic requested a review from a team as a code owner April 8, 2020 17:03
@jgowdyelastic
Copy link
Member Author

cc @droberts195

Copy link
Contributor

@peteharverson peteharverson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested and LGTM. One minor comment.

x-pack/plugins/ml/common/constants/file_datavisualizer.ts Outdated Show resolved Hide resolved
Copy link
Contributor

@alvarezmelissa87 alvarezmelissa87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM ⚡

@jgowdyelastic
Copy link
Member Author

@elasticmachine merge upstream

@jgowdyelastic
Copy link
Member Author

@elasticmachine merge upstream

@kibanamachine
Copy link
Contributor

💚 Build Succeeded

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

@jgowdyelastic jgowdyelastic merged commit 2b4c300 into elastic:master Apr 14, 2020
jgowdyelastic added a commit to jgowdyelastic/kibana that referenced this pull request Apr 14, 2020
* [ML] Improving parsing of large uploaded files

* small clean up

* increasing max to 1GB

* adding comments

Co-authored-by: Elastic Machine <[email protected]>
@jgowdyelastic jgowdyelastic deleted the improving-parsing-of-large-uploaded-files branch April 14, 2020 17:59
jgowdyelastic added a commit that referenced this pull request Apr 15, 2020
* [ML] Improving parsing of large uploaded files

* small clean up

* increasing max to 1GB

* adding comments

Co-authored-by: Elastic Machine <[email protected]>

Co-authored-by: Elastic Machine <[email protected]>
gmmorris added a commit to gmmorris/kibana that referenced this pull request Apr 15, 2020
* master: (29 commits)
  Add test:jest_integration npm script (elastic#62938)
  [data.search.aggs] Remove service getters from agg types (AggConfig part) (elastic#62548)
  [Discover] Fix broken setting of bucketInterval (elastic#62939)
  Disable adding conditions when in alert management context. (elastic#63514)
  [Alerting] fixes to allow pre-configured actions to be executed (elastic#63432)
  adding useMemo (elastic#63504)
  [Maps] fix double fetch when filter pill is added (elastic#63024)
  [Lens] Fix missing formatting bug in "break down by" (elastic#63288)
  [SIEM] [Cases] Removed double pasted line (elastic#63507)
  [Reporting] Improve functional test steps (elastic#63259)
  [SIEM][CASE] Tests for server's configuration API (elastic#63099)
  [SIEM] [Cases] Case container unit tests (elastic#63376)
  [ML] Improving parsing of large uploaded files (elastic#62970)
  [ML] Listing global calendars on the job management page (elastic#63124)
  [Ingest][Endpoint] Add Ingest rest api response types for use in Endpoint (elastic#63373)
  Add help text to form fields (elastic#63165)
  [ML] Converts utils Mocha tests to Jest (elastic#63132)
  [Metrics UI] Refactor With* containers to hooks (elastic#59503)
  [NP] Migrate logstash server side code to NP (elastic#63135)
  Clicking cancel in saved query save modal doesn't close it (elastic#62774)
  ...
wayneseymour pushed a commit that referenced this pull request Apr 15, 2020
* [ML] Improving parsing of large uploaded files

* small clean up

* increasing max to 1GB

* adding comments

Co-authored-by: Elastic Machine <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants