-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support newlines_in_values
CSV option
#11533
Conversation
This significantly simplifies the UX when dealing with large CSV files that must support newlines in (quoted) values. By default, large CSV files will be repartitioned into multiple parallel range scans. This is great for performance in the common case but when large CSVs contain newlines in values the parallel scan will fail due to splitting on newlines within quotes rather than actual line terminators. With the current implementation, this behaviour can be controlled by the session-level `datafusion.optimizer.repartition_file_scans` and `datafusion.optimizer.repartition_file_min_size` settings. This commit introduces a `newlines_in_values` option to `CsvOptions` and plumbs it through to `CsvExec`, which includes it in the test for whether parallel execution is supported. This provides a convenient and searchable way to disable file scan repartitioning on a per-CSV basis. BREAKING CHANGE: This adds new public fields to types with all public fields, which is a breaking change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @connec -- this PR looks very nice to me (and a very nice first contribution🏅)
I think the only thing missing is an end to end test showing that with this change DataFusion can actually read CSV files with newlines in them
Here are the instructions for such end to end tets https://github.com/apache/datafusion/tree/main/datafusion/sqllogictest
Maybe you could add an example data file in https://github.com/apache/datafusion/tree/main/datafusion/core/tests/data
And the write a test that defined the external table with the appropriate options and read from it
Ideally you should be able to follow the model of one of the existing test files, perhaps
datafusion/datafusion/sqllogictest/test_files/csv_files.slt
Lines 18 to 27 in 382bf4f
# create_external_table_with_quote_escape | |
statement ok | |
CREATE EXTERNAL TABLE csv_with_quote ( | |
c1 VARCHAR, | |
c2 VARCHAR | |
) STORED AS CSV | |
LOCATION '../core/tests/data/quote.csv' | |
OPTIONS ('format.quote' '~', | |
'format.delimiter' ',', | |
'format.has_header' 'true'); |
cc @2010YOUY01
datafusion/common/src/config.rs
Outdated
@@ -1665,6 +1670,14 @@ impl CsvOptions { | |||
self | |||
} | |||
|
|||
/// Set true to ensure that newlines in (quoted) values are supported. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// Set true to ensure that newlines in (quoted) values are supported. | |
/// Set true to support newlines in (quoted) values. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The phrasing was a bit awkward initially as I was thinking of the flag as "ensuring support", rather than directly "supporting", since newlines in values are already supported if the file is below the datafusion.optimizer.repartition_file_min_size
.
I'm not sure if it's worth conveying that detail through these docs, or else to document this as providing support and treat the fact that newlines in values will "just work" for smaller files as an implementation detail that might change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've expanded and normalised the documentation. I've gone for documenting the flag as enabling support for newlines in values, with the fact that newlines in values might work without it left as an implementation-defined detail.
@@ -233,6 +233,14 @@ impl CsvFormat { | |||
self | |||
} | |||
|
|||
/// Set true to ensure that newlines in (quoted) values are supported. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// Set true to ensure that newlines in (quoted) values are supported. | |
/// Set true to support newlines in (quoted) values. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the new feature support! It is a great UX improvement given 'newlines in quotes' is defined in CSV standard
I also find this PR's design is good and very easy to read, I think it's good to go with extra integration tests for CSVs with newlines_in_values
fields as @alamb mentioned.
datafusion/common/src/config.rs
Outdated
/// Default value for `format.newlines_in_values` for `CREATE EXTERNAL TABLE` | ||
// if not specified explicitly in the statement. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need more descriptive comment help for users without context to understand this config option, which should include 1. It's CSV specific 2. Can be overridden by the same config field in CsvOptions
3. Its behavior as stated in CsvOptions
's comment.
And update in https://github.com/apache/datafusion/blob/main/docs/source/user-guide/configs.md
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've expanded the documentation. I forgot about the user guide so I'll do that now as well.
I'm trying to provide more context for the rationale for this PR's implementation: Currently, DataFusion's parallel CSV scan will first split the file range evenly, and adjust the range border by probing So this PR's approach (explicitly adding one |
Thanks both for the rapid feedback. I'll take a look and update the PR. |
I've made some documentation updates and added/fixed sqllogictests. There's the new clippy issue around having too many arguments for |
I think we should suppress the clippy error for now in this PR and we can file a ticket to improve things as a follow on PR I would personally suggest a builder pattern like let csv_exec = CsvExec::builder()
.with_has_header(true)
.with_newlines_in_values(true)
...
.build()? Following the model of https://docs.rs/datafusion/latest/datafusion/datasource/physical_plan/parquet/struct.ParquetExecBuilder.html |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @connec -- this looks good to me 👍
I think once we have the CI passing this PR is good to go from my perspective.
@@ -0,0 +1,13 @@ | |||
id,message | |||
1,"hello |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
I think that should be the fixes needed for CI. |
I filed #11565 to track improving the clippy lint |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again @connec and @2010YOUY01 👌
🤔 the slt test seems to be failing on windows (likely due to newline nonsense). https://github.com/apache/datafusion/actions/runs/10014892577/job/27695257736 We should probably figure out a way to not run that test on windows. Gah! Is this something you have time to look at @connec ? If not, I think we could comment out the slt test to get this PR in and then sort out the testing as a follow on PR. Note to reduce the feedback loop cycle time, I merged this branch to main and pushed the commit (which I think means the CI should automatically run now on new commits). Once we have merged this PR I think your subsequent PRs will also have the CI run automatically. |
I'd guess it might be related to there being a Edit: ran out of time today, but I can continue tomorrow. |
This is a bit of a stab in the dark, but it might fix multiline tests on Windows.
Unfortunately it seems like the workflows still need approval @alamb.
|
The default git behaviour of converting line endings for checked out files causes the `csv_files.slt` test to fail when testing `newlines_in_values`. This appears to be due to the quoted newlines being converted to CRLF, which are not then normalised when the CSV is read. Assuming that the sqllogictests do normalise line endings in the expected output, this could then lead to a "spurious" diff from the actual output.
The issue is caused by the line endings in the For now, I've added a |
🚀 |
* feat!: support `newlines_in_values` CSV option This significantly simplifies the UX when dealing with large CSV files that must support newlines in (quoted) values. By default, large CSV files will be repartitioned into multiple parallel range scans. This is great for performance in the common case but when large CSVs contain newlines in values the parallel scan will fail due to splitting on newlines within quotes rather than actual line terminators. With the current implementation, this behaviour can be controlled by the session-level `datafusion.optimizer.repartition_file_scans` and `datafusion.optimizer.repartition_file_min_size` settings. This commit introduces a `newlines_in_values` option to `CsvOptions` and plumbs it through to `CsvExec`, which includes it in the test for whether parallel execution is supported. This provides a convenient and searchable way to disable file scan repartitioning on a per-CSV basis. BREAKING CHANGE: This adds new public fields to types with all public fields, which is a breaking change. * docs: normalise `newlines_in_values` documentation * test: add/fix sqllogictests for `newlines_in_values` * docs: document `datafusion.catalog.newlines_in_values` * fix: typo in config.md * chore: suppress lint on too many arguments for `CsvExec::new` * fix: always checkout `*.slt` with LF line endings This is a bit of a stab in the dark, but it might fix multiline tests on Windows. * fix: always checkout `newlines_in_values.csv` with `LF` line endings The default git behaviour of converting line endings for checked out files causes the `csv_files.slt` test to fail when testing `newlines_in_values`. This appears to be due to the quoted newlines being converted to CRLF, which are not then normalised when the CSV is read. Assuming that the sqllogictests do normalise line endings in the expected output, this could then lead to a "spurious" diff from the actual output. --------- Co-authored-by: Andrew Lamb <[email protected]>
* feat!: support `newlines_in_values` CSV option This significantly simplifies the UX when dealing with large CSV files that must support newlines in (quoted) values. By default, large CSV files will be repartitioned into multiple parallel range scans. This is great for performance in the common case but when large CSVs contain newlines in values the parallel scan will fail due to splitting on newlines within quotes rather than actual line terminators. With the current implementation, this behaviour can be controlled by the session-level `datafusion.optimizer.repartition_file_scans` and `datafusion.optimizer.repartition_file_min_size` settings. This commit introduces a `newlines_in_values` option to `CsvOptions` and plumbs it through to `CsvExec`, which includes it in the test for whether parallel execution is supported. This provides a convenient and searchable way to disable file scan repartitioning on a per-CSV basis. BREAKING CHANGE: This adds new public fields to types with all public fields, which is a breaking change. * docs: normalise `newlines_in_values` documentation * test: add/fix sqllogictests for `newlines_in_values` * docs: document `datafusion.catalog.newlines_in_values` * fix: typo in config.md * chore: suppress lint on too many arguments for `CsvExec::new` * fix: always checkout `*.slt` with LF line endings This is a bit of a stab in the dark, but it might fix multiline tests on Windows. * fix: always checkout `newlines_in_values.csv` with `LF` line endings The default git behaviour of converting line endings for checked out files causes the `csv_files.slt` test to fail when testing `newlines_in_values`. This appears to be due to the quoted newlines being converted to CRLF, which are not then normalised when the CSV is read. Assuming that the sqllogictests do normalise line endings in the expected output, this could then lead to a "spurious" diff from the actual output. --------- Co-authored-by: Andrew Lamb <[email protected]>
Which issue does this PR close?
Closes #11472.
Rationale for this change
This significantly simplifies the UX when dealing with large CSV files that must support newlines in (quoted) values. By default, large CSV files will be repartitioned into multiple parallel range scans. This is great for performance in the common case but when large CSVs contain newlines in values the parallel scan will fail due to splitting on newlines within quotes rather than actual line terminators.
With the current implementation, this behaviour can only be controlled by the session-level
datafusion.optimizer.repartition_file_scans
anddatafusion.optimizer.repartition_file_min_size
settings.What changes are included in this PR?
This commit introduces a
newlines_in_values
option toCsvOptions
and plumbs it through toCsvExec
, which includes it in the test for whether parallel execution is supported. This provides a convenient and searchable way to disable file scan repartitioning on a per-CSV basis.I've added
newlines_in_values
using similar conventions tohas_header
, withCsvOptions
using anOption<bool>
and a default value coming fromdatafusion::common::config::CatalogOptions
.For now, in the interests of being surgical, I've just added a new argument to
CsvExec::new
, which is now triggeringclippy::too_many_arguments
. Before going any further I wanted to see if this was overall a good approach, but I'm happy to refactor this into an options struct or similar.Are these changes tested?
Yes – a new test has been added alongside the existing tests for file scan repartioning in
datafusion/core/src/datasource/file_format/csv.rs
.Are there any user-facing changes?
datafusion::common::config::CatalogOptions::newlines_in_values: bool
field, default:false
.datafusion::common::config::CsvOptions::newlines_in_values: Option<bool>
field, default:None
.datafusion::datasource::file_format::options::CsvReadOptions::newlines_in_values: bool
field, default:false
.newlines_in_values: bool
argument todatafusion::datasource::physical_plan::CsvExec::new
.datafusion::common::config::CsvOptions::with_newlines_in_values
method.datafusion::datasource::file_format::csv::CsvFormat::with_newlines_in_values
method.datafusion::datasource::file_format::options::CsvReadOptions::newlines_in_values
method.datafusion::datasource::physical_plan::CsvExec::newlines_in_values
method.newlines_in_values
to relevant proto files.