-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ML] Improving parsing of large uploaded files #62970
[ML] Improving parsing of large uploaded files #62970
Conversation
Pinging @elastic/ml-ui (:ml) |
cc @droberts195 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested and LGTM. One minor comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM ⚡
@elasticmachine merge upstream |
@elasticmachine merge upstream |
💚 Build SucceededHistory
To update your PR or re-run it, just comment with: |
* [ML] Improving parsing of large uploaded files * small clean up * increasing max to 1GB * adding comments Co-authored-by: Elastic Machine <[email protected]>
* [ML] Improving parsing of large uploaded files * small clean up * increasing max to 1GB * adding comments Co-authored-by: Elastic Machine <[email protected]> Co-authored-by: Elastic Machine <[email protected]>
* master: (29 commits) Add test:jest_integration npm script (elastic#62938) [data.search.aggs] Remove service getters from agg types (AggConfig part) (elastic#62548) [Discover] Fix broken setting of bucketInterval (elastic#62939) Disable adding conditions when in alert management context. (elastic#63514) [Alerting] fixes to allow pre-configured actions to be executed (elastic#63432) adding useMemo (elastic#63504) [Maps] fix double fetch when filter pill is added (elastic#63024) [Lens] Fix missing formatting bug in "break down by" (elastic#63288) [SIEM] [Cases] Removed double pasted line (elastic#63507) [Reporting] Improve functional test steps (elastic#63259) [SIEM][CASE] Tests for server's configuration API (elastic#63099) [SIEM] [Cases] Case container unit tests (elastic#63376) [ML] Improving parsing of large uploaded files (elastic#62970) [ML] Listing global calendars on the job management page (elastic#63124) [Ingest][Endpoint] Add Ingest rest api response types for use in Endpoint (elastic#63373) Add help text to form fields (elastic#63165) [ML] Converts utils Mocha tests to Jest (elastic#63132) [Metrics UI] Refactor With* containers to hooks (elastic#59503) [NP] Migrate logstash server side code to NP (elastic#63135) Clicking cancel in saved query save modal doesn't close it (elastic#62774) ...
* [ML] Improving parsing of large uploaded files * small clean up * increasing max to 1GB * adding comments Co-authored-by: Elastic Machine <[email protected]>
The data from the file is now no longer read in one go, instead it is read as an
ArrayBuffer
with the first 5MBs decoded and stored for sending to thefind_file_structure
endpoint.At the point of import, the data is chopped up into 100MB chunks for processing into ndjson docs.
When dividing the data, there is a good chance a partial line will be left at the end of each chunk. The length of this partial line is measured and the start of the next chunk is rolled back to include it.
With this change I've managed to import a 1.43GB CSV file locally which took 11mins.
Because of this, I've increased the absolute max file size to be 1GB.
Further improvements can be made to reduce the browser memory footprint. Currently the whole file is still stored in memory before import, only now it's broken up into parts.
A better way would be to only process the file in chunks while the uploading is happening, removing the need to process the file entirely before beginning the upload. but that will involve a larger architectural change.
Checklist
Delete any items that are not applicable to this PR.
For maintainers