Parallel NDSON file reading #8502

alamb · 2023-12-11T21:55:32Z

Is your feature request related to a problem or challenge?

DataFusion can now automatically read CSV and parquet files in parallel (see #6325 for CSV)

It would be great to do the same for "NDJSON" files -- namely files that have multiple JSON objects placed one after the other.

Describe the solution you'd like

Basically implement what is described in #6325 for JSON -- and read a single large ND json file (new line delimited file) in parallel

Describe alternatives you've considered

Some research may be required -- I am not sure if finding record boundaries is feasible

Additional context

I found this while writing tests for #8451

tustvold · 2023-12-12T01:26:18Z

This should be simpler than CSV, as NDJSON does not typically permit unescaped newline characters, so it should just be a case of finding the next newline

alamb · 2023-12-12T13:32:51Z

I think this is a medium difficulty task for a new contributor as the pattern exists and there are tests (e.g. see #8505)

JacobOgle · 2023-12-13T02:17:54Z

@alamb I wouldn't mind digging into this one if its still open

alamb · 2023-12-13T12:21:19Z

@alamb I wouldn't mind digging into this one if its still open

I just filed it and I don't know of anyone else working on it.

Thanks @JacobOgle

kassoudkt · 2023-12-19T14:00:01Z

Is it still available ? i wloud love to take it :)

JacobOgle · 2023-12-19T14:12:44Z

@kassoudkt feel free! I've been a bit tied up lately so if you're free go for it!

marvinlanhenke · 2023-12-23T12:09:48Z

This should be simpler than CSV, as NDJSON does not typically permit unescaped newline characters, so it should just be a case of finding the next newline

@tustvold @alamb
...out of curiosity, I was digging into this as well. From my understanding (looking at the CSV impl) the FileGroupPartitioner and its method repartition_file_groups are used to create the partitions. However, in this case evenly divided by size.

In order for NDJSON to be split "correctly" (and not in the middle of a JSON Object) the FileGroupPartitioner needs a new method to split on newline? Would this be a reasonable approach?
Then only fn repartitioned of trait ExecutionPlan and fn open of trait FileOpener need to be implemented.

Thanks for helping out.

tustvold · 2023-12-23T12:14:56Z

The way it typically works is the split is based on filesize but the reader is setup such that one of the bounds includes the current partial row, and the other does not. For example the reader starts at the NEXT newline (with special case for first row) and stops when it reaches the end of a line AND the byte position now exceeds the end limit. CSV (and parquet) behave similarly.

This all avoids the planner needing to perform IO, which is pretty important

marvinlanhenke · 2023-12-23T12:44:55Z

I just realized that I forgot the IO part. Now, I understand the approach better - thanks for the explanation.

alamb · 2023-12-24T12:35:59Z

In order for NDJSON to be split "correctly" (and not in the middle of a JSON Object) the FileGroupPartitioner needs a new method to split on newline? Would this be a reasonable approach?

Indeed -- I think the relevant code that finds the next bounds is https://github.com/apache/arrow-datafusion/blob/a1e959d87a66da7060bd005b1993b824c0683a63/datafusion/core/src/datasource/physical_plan/csv.rs#L411-L450

alamb · 2023-12-24T12:44:40Z

BTW this comment might help: https://github.com/apache/arrow-datafusion/blob/6b433a839948c406a41128186e81572ec1fff689/datafusion/core/src/datasource/physical_plan/file_groups.rs#L35-L79

marvinlanhenke · 2023-12-25T05:14:22Z

@alamb thanks for the pointers.

I already implemented a working solution, however I need to do some refactoring (if my kids let me :P ).
I'd also like to extract the common functionality since the NdJson and the CSV implementation are nearly the same; any suggestions where to put those utility functions, like find_first_newline? I think mod.rs would be fine.

alamb · 2023-12-26T11:52:24Z

arrow-datafusion/datafusion/core/src/datasource/physical_plan/mod.rs sounds like a good idea to me

marvinlanhenke · 2023-12-26T16:47:06Z

@alamb
...found some time to clean up the changes.

However, I am not sure about properly benchmarking the solution (as stated in the PR) and perhaps some more tests are needed? I am looking forward to your feedback.

alamb · 2023-12-27T12:55:58Z

However, I am not sure about properly benchmarking the solution (as stated in the PR) and perhaps some more tests are needed? I am looking forward to your feedback.

I think a good test would be to find a largish JSON input file and show some benchmark reading numbers

I don't know of any existing benchmarks we have for reading large JSON files. Maybe we could add a benchmark for reading from large JSON (any CSV?) files in https://github.com/apache/arrow-datafusion/tree/main/benchmarks#datafusion-benchmarks

Something like

bench.sh run parse

That would measure the speed of parsing a large JSON file

marvinlanhenke · 2023-12-27T15:49:34Z

@alamb

I did some basic benchmarking.

Methodology:

Generated a 60mil rows NDJSON file (~3.7G)
Run tests with datafusion-cli (before / after changes)
create external table json_test stored as json location '/home/ml/data_60m.json';
select * from json_test; & select * from json_test where a > 5;

Results:

query	before	after
`select * from json_test;`	~24s	~24s
`select * from json_test where a > 5;`	~26s	~11s

When applying a filter and explain select * from json_test where a > 5;
we can see the repartitioning happening (file_groups: 12).

However, when simply running select * from json_test.
File_groups remain at 1 and we get no parallel reading.

I think this issue relates to: #6983
Haven't tested it with a dataframe; however the issue seems to remain, at least for the datafusion-cli
(tested with JSON and CSV)

alamb · 2023-12-28T20:02:02Z

select * from json_test; & select * from json_test where a > 5;

A good test of just the speed of the parsing might be something like

select count(*) from json_test; 
select count(*) from json_test where a > 5;

That will minimize most of the actual query work other than parsing and won't need to try and format / carry through the other columns

marvinlanhenke · 2023-12-28T21:08:47Z

...the updated result with different queries:

query	before	after
`select count(*) from json_test;`	~19s	~6s
`select count(*) from json_test where a > 5;`	~18s	~8s

* added basic test * added `fn repartitioned` * added basic version of FileOpener * refactor: extract calculate_range * refactor: handle GetResultPayload::Stream * refactor: extract common functions to mod.rs * refactor: use common functions * added docs * added test * clippy * fix: test_chunked_json * fix: sqllogictest * delete imports * update docs

alamb added the enhancement New feature or request label Dec 11, 2023

alamb changed the title ~~Parallel NDSON reading~~ Parallel NDSON file reading Dec 11, 2023

alamb mentioned this issue Dec 11, 2023

[EPIC] Streaming partitioned writes #6569

Open

38 tasks

alamb added the good first issue Good for newcomers label Dec 12, 2023

marvinlanhenke mentioned this issue Dec 26, 2023

Closes #8502: Parallel NDJSON file reading #8659

Merged

marvinlanhenke mentioned this issue Dec 27, 2023

parallel csv scan #6801

Merged

alamb closed this as completed in #8659 Dec 31, 2023

matthewgapp mentioned this issue Jan 11, 2024

matt/feat/recursive ctes/config flag matthewgapp/arrow-datafusion#3

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallel NDSON file reading #8502

Parallel NDSON file reading #8502

alamb commented Dec 11, 2023

tustvold commented Dec 12, 2023

alamb commented Dec 12, 2023

JacobOgle commented Dec 13, 2023

alamb commented Dec 13, 2023

kassoudkt commented Dec 19, 2023

JacobOgle commented Dec 19, 2023

marvinlanhenke commented Dec 23, 2023

tustvold commented Dec 23, 2023 •

edited

Loading

marvinlanhenke commented Dec 23, 2023 •

edited

Loading

alamb commented Dec 24, 2023

alamb commented Dec 24, 2023

marvinlanhenke commented Dec 25, 2023 •

edited

Loading

alamb commented Dec 26, 2023

marvinlanhenke commented Dec 26, 2023 •

edited

Loading

alamb commented Dec 27, 2023

marvinlanhenke commented Dec 27, 2023

alamb commented Dec 28, 2023

marvinlanhenke commented Dec 28, 2023

Parallel NDSON file reading #8502

Parallel NDSON file reading #8502

Comments

alamb commented Dec 11, 2023

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

tustvold commented Dec 12, 2023

alamb commented Dec 12, 2023

JacobOgle commented Dec 13, 2023

alamb commented Dec 13, 2023

kassoudkt commented Dec 19, 2023

JacobOgle commented Dec 19, 2023

marvinlanhenke commented Dec 23, 2023

tustvold commented Dec 23, 2023 • edited Loading

marvinlanhenke commented Dec 23, 2023 • edited Loading

alamb commented Dec 24, 2023

alamb commented Dec 24, 2023

marvinlanhenke commented Dec 25, 2023 • edited Loading

alamb commented Dec 26, 2023

marvinlanhenke commented Dec 26, 2023 • edited Loading

alamb commented Dec 27, 2023

marvinlanhenke commented Dec 27, 2023

Methodology:

Results:

alamb commented Dec 28, 2023

marvinlanhenke commented Dec 28, 2023

tustvold commented Dec 23, 2023 •

edited

Loading

marvinlanhenke commented Dec 23, 2023 •

edited

Loading

marvinlanhenke commented Dec 25, 2023 •

edited

Loading

marvinlanhenke commented Dec 26, 2023 •

edited

Loading