-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closes #8502: Parallel NDJSON file reading #8659
Closes #8502: Parallel NDJSON file reading #8659
Conversation
Thank you for this @marvinlanhenke - I plan to review it tomorrow. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @marvinlanhenke -- this PR was a pleasure to read and well tested. 👏
The only thing I think is needed is to resolve the merge conflicts and and update the test comments.
Thank you again for the contribution
----RepartitionExec: partitioning=RoundRobinBatch(4), input_partitions=1 | ||
------JsonExec: file_groups={1 group: [[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/repartition_scan/json_table/1.json]]}, projection=[column1] | ||
|
||
----JsonExec: file_groups={4 groups: [[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/repartition_scan/json_table/1.json:0..18], [WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/repartition_scan/json_table/1.json:18..36], [WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/repartition_scan/json_table/1.json:36..54], [WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/repartition_scan/json_table/1.json:54..70]]}, projection=[column1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎉 -- can you also update the comment in this file to reflect the fact that it now reads the file in parallel 🥳 🦜
|
||
let calculated_range = calculate_range(&file_meta, &store).await?; | ||
|
||
let range = match calculated_range { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a really nice refactoring
/// Else `file_meta.range` is `Some(FileRange{start, end})`, which corresponds to the byte range [start, end) within the file. | ||
/// | ||
/// Note: `start` or `end` might be in the middle of some lines. In such cases, the following rules | ||
/// are applied to determine which lines to read: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could potentially help to link to the CsvOpener documentation too (which has an example)
@@ -441,4 +446,94 @@ mod tests { | |||
.collect::<Vec<_>>(); | |||
assert_eq!(vec!["a: Int64", "b: Float64", "c: Boolean"], fields); | |||
} | |||
|
|||
async fn count_num_partitions(ctx: &SessionContext, query: &str) -> Result<usize> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another potential way to figure out the file groups would be to make the physical plan and then walk it to find the JsonExec
and its number of partitions
@alamb I resolved the merge conflicts and updated the docs according to your comments. |
Epic work @marvinlanhenke -- thank you very much! |
... Some epic strg+c+v combos from the CSV implementation @alamb 😂 but thanks again for the kind review. I am hoping to improve this implementation next week, by reducing the number of GetRequests on the object store down to one. However, the stream handling is more complicated than I wished for... |
Well, I think being able to extract and reuse code is far harder than just copy/paste/modify which you could have done ;)
Yeah -- I agree -- it might need to get wrapped up into its own state machine / stream impl 🤔 |
Which issue does this PR close?
Closes #8502.
Rationale for this change
As stated in the issue:
What changes are included in this PR?
Basically the same approach as in #6325.
However, I refactored and extracted some of the common functions like
find_first_newline
which is used bycalculate_range
, to be used by both the CSV and JSON implementation.Are these changes tested?
Yes, basic tests are included.
However, I was not sure about benchmarking the changes (since
benchmarks/bench.sh
does not provide JSON dataset?).Are there any user-facing changes?
No.