Closes #8502: Parallel NDJSON file reading #8659

marvinlanhenke · 2023-12-26T16:44:59Z

Which issue does this PR close?

Closes #8502.

Rationale for this change

As stated in the issue:

DataFusion can now automatically read CSV and parquet files in parallel (see #6325 for CSV)
It would be great to do the same for "NDJSON" files -- namely files that have multiple JSON objects placed one after the other.

What changes are included in this PR?

Basically the same approach as in #6325.
However, I refactored and extracted some of the common functions like find_first_newline which is used by calculate_range, to be used by both the CSV and JSON implementation.

Are these changes tested?

Yes, basic tests are included.
However, I was not sure about benchmarking the changes (since benchmarks/bench.sh does not provide JSON dataset?).

Are there any user-facing changes?

No.

alamb · 2023-12-28T20:37:34Z

Thank you for this @marvinlanhenke - I plan to review it tomorrow.

cc @devinjdangelo

marvinlanhenke · 2023-12-28T21:49:00Z

...just as a sidenote and originally stated in #6801 and #6922; this implementation is suboptimal since it performs multiple get operations on the object store. This will be handled in a follow-up Issue / PR - which I will (most) likely tackle in the next week.

alamb

Thank you @marvinlanhenke -- this PR was a pleasure to read and well tested. 👏

The only thing I think is needed is to resolve the merge conflicts and and update the test comments.

Thank you again for the contribution

alamb · 2023-12-30T14:52:57Z

datafusion/sqllogictest/test_files/repartition_scan.slt

----RepartitionExec: partitioning=RoundRobinBatch(4), input_partitions=1
------JsonExec: file_groups={1 group: [[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/repartition_scan/json_table/1.json]]}, projection=[column1]
-
+----JsonExec: file_groups={4 groups: [[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/repartition_scan/json_table/1.json:0..18], [WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/repartition_scan/json_table/1.json:18..36], [WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/repartition_scan/json_table/1.json:36..54], [WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/repartition_scan/json_table/1.json:54..70]]}, projection=[column1]


🎉 -- can you also update the comment in this file to reflect the fact that it now reads the file in parallel 🥳 🦜

alamb · 2023-12-30T14:54:39Z

datafusion/core/src/datasource/physical_plan/csv.rs

+
+            let calculated_range = calculate_range(&file_meta, &store).await?;
+
+            let range = match calculated_range {


This is a really nice refactoring

alamb · 2023-12-30T14:55:24Z

datafusion/core/src/datasource/physical_plan/json.rs

+    /// Else `file_meta.range` is `Some(FileRange{start, end})`, which corresponds to the byte range [start, end) within the file.
+    ///
+    /// Note: `start` or `end` might be in the middle of some lines. In such cases, the following rules
+    /// are applied to determine which lines to read:


It could potentially help to link to the CsvOpener documentation too (which has an example)

alamb · 2023-12-30T15:00:57Z

datafusion/core/src/datasource/file_format/json.rs

@@ -441,4 +446,94 @@ mod tests {
            .collect::<Vec<_>>();
        assert_eq!(vec!["a: Int64", "b: Float64", "c: Boolean"], fields);
    }
+
+    async fn count_num_partitions(ctx: &SessionContext, query: &str) -> Result<usize> {


Another potential way to figure out the file groups would be to make the physical plan and then walk it to find the JsonExec and its number of partitions

marvinlanhenke · 2023-12-30T20:15:19Z

@alamb
Thank you so much for the review.

I resolved the merge conflicts and updated the docs according to your comments.

alamb · 2023-12-31T12:52:01Z

Epic work @marvinlanhenke -- thank you very much!

marvinlanhenke · 2023-12-31T14:53:47Z

... Some epic strg+c+v combos from the CSV implementation @alamb 😂 but thanks again for the kind review.

I am hoping to improve this implementation next week, by reducing the number of GetRequests on the object store down to one. However, the stream handling is more complicated than I wished for...

alamb · 2024-01-01T16:23:16Z

... Some epic strg+c+v combos from the CSV implementation @alamb 😂 but thanks again for the kind review.

Well, I think being able to extract and reuse code is far harder than just copy/paste/modify which you could have done ;)

However, the stream handling is more complicated than I wished for...

Yeah -- I agree -- it might need to get wrapped up into its own state machine / stream impl 🤔

marvinlanhenke added 10 commits December 24, 2023 05:47

added basic test

0fc5153

added fn repartitioned

8addda3

added basic version of FileOpener

c75d406

refactor: extract calculate_range

1b9bb70

refactor: handle GetResultPayload::Stream

cbc0e0d

refactor: extract common functions to mod.rs

08fd37e

refactor: use common functions

a401eee

added docs

14352b6

added test

c4e6b9f

clippy

f66eb76

github-actions bot added the core Core DataFusion crate label Dec 26, 2023

marvinlanhenke added 2 commits December 26, 2023 18:23

fix: test_chunked_json

f25cefa

fix: sqllogictest

ed47607

github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Dec 26, 2023

alamb mentioned this pull request Dec 28, 2023

DataFusion weekly project plan (Andrew Lamb) - Dec 25, 2023 #8655

Closed

7 tasks

alamb approved these changes Dec 30, 2023

View reviewed changes

marvinlanhenke added 3 commits December 30, 2023 20:07

Merge branch 'main' into parallel_ndjson_file_reading

8f5e8ca

delete imports

d645b6b

update docs

33cd5e1

alamb merged commit 03bd9b4 into apache:main Dec 31, 2023
22 checks passed

matthewgapp mentioned this pull request Jan 11, 2024

matt/feat/recursive ctes/config flag matthewgapp/arrow-datafusion#3

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Closes #8502: Parallel NDJSON file reading #8659

Closes #8502: Parallel NDJSON file reading #8659

marvinlanhenke commented Dec 26, 2023

alamb commented Dec 28, 2023

marvinlanhenke commented Dec 28, 2023

alamb left a comment

alamb Dec 30, 2023

alamb Dec 30, 2023

alamb Dec 30, 2023

alamb Dec 30, 2023

alamb Dec 30, 2023

marvinlanhenke commented Dec 30, 2023

alamb commented Dec 31, 2023

marvinlanhenke commented Dec 31, 2023

alamb commented Jan 1, 2024


		let calculated_range = calculate_range(&file_meta, &store).await?;

		let range = match calculated_range {

Closes #8502: Parallel NDJSON file reading #8659

Closes #8502: Parallel NDJSON file reading #8659

Conversation

marvinlanhenke commented Dec 26, 2023

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

alamb commented Dec 28, 2023

marvinlanhenke commented Dec 28, 2023

alamb left a comment

Choose a reason for hiding this comment

alamb Dec 30, 2023

Choose a reason for hiding this comment

alamb Dec 30, 2023

Choose a reason for hiding this comment

alamb Dec 30, 2023

Choose a reason for hiding this comment

alamb Dec 30, 2023

Choose a reason for hiding this comment

alamb Dec 30, 2023

Choose a reason for hiding this comment

marvinlanhenke commented Dec 30, 2023

alamb commented Dec 31, 2023

marvinlanhenke commented Dec 31, 2023

alamb commented Jan 1, 2024