-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallel NDSON file reading #8502
Comments
This should be simpler than CSV, as NDJSON does not typically permit unescaped newline characters, so it should just be a case of finding the next newline |
I think this is a medium difficulty task for a new contributor as the pattern exists and there are tests (e.g. see #8505) |
@alamb I wouldn't mind digging into this one if its still open |
I just filed it and I don't know of anyone else working on it. Thanks @JacobOgle |
Is it still available ? i wloud love to take it :) |
@kassoudkt feel free! I've been a bit tied up lately so if you're free go for it! |
@tustvold @alamb In order for NDJSON to be split "correctly" (and not in the middle of a JSON Object) the FileGroupPartitioner needs a new method to split on newline? Would this be a reasonable approach? Thanks for helping out. |
The way it typically works is the split is based on filesize but the reader is setup such that one of the bounds includes the current partial row, and the other does not. For example the reader starts at the NEXT newline (with special case for first row) and stops when it reaches the end of a line AND the byte position now exceeds the end limit. CSV (and parquet) behave similarly. This all avoids the planner needing to perform IO, which is pretty important |
I just realized that I forgot the IO part. Now, I understand the approach better - thanks for the explanation. |
Indeed -- I think the relevant code that finds the next bounds is https://github.com/apache/arrow-datafusion/blob/a1e959d87a66da7060bd005b1993b824c0683a63/datafusion/core/src/datasource/physical_plan/csv.rs#L411-L450 |
@alamb thanks for the pointers. I already implemented a working solution, however I need to do some refactoring (if my kids let me :P ). |
arrow-datafusion/datafusion/core/src/datasource/physical_plan/mod.rs sounds like a good idea to me |
@alamb However, I am not sure about properly benchmarking the solution (as stated in the PR) and perhaps some more tests are needed? I am looking forward to your feedback. |
I think a good test would be to find a largish JSON input file and show some benchmark reading numbers I don't know of any existing benchmarks we have for reading large JSON files. Maybe we could add a benchmark for reading from large JSON (any CSV?) files in https://github.com/apache/arrow-datafusion/tree/main/benchmarks#datafusion-benchmarks Something like
That would measure the speed of parsing a large JSON file |
I did some basic benchmarking. Methodology:
Results:
When applying a filter and However, when simply running I think this issue relates to: #6983 |
A good test of just the speed of the parsing might be something like select count(*) from json_test;
select count(*) from json_test where a > 5; That will minimize most of the actual query work other than parsing and won't need to try and format / carry through the other columns |
...the updated result with different queries:
|
* added basic test * added `fn repartitioned` * added basic version of FileOpener * refactor: extract calculate_range * refactor: handle GetResultPayload::Stream * refactor: extract common functions to mod.rs * refactor: use common functions * added docs * added test * clippy * fix: test_chunked_json * fix: sqllogictest * delete imports * update docs
Is your feature request related to a problem or challenge?
DataFusion can now automatically read CSV and parquet files in parallel (see #6325 for CSV)
It would be great to do the same for "NDJSON" files -- namely files that have multiple JSON objects placed one after the other.
Describe the solution you'd like
Basically implement what is described in #6325 for JSON -- and read a single large ND json file (new line delimited file) in parallel
Describe alternatives you've considered
Some research may be required -- I am not sure if finding record boundaries is feasible
Additional context
I found this while writing tests for #8451
The text was updated successfully, but these errors were encountered: