refactor: new impl for loading CSV. #14645

youngsofun · 2024-02-08T01:57:53Z

I hereby agree to the terms of the CLA available at: https://docs.databend.com/dev/policies/cla/

Summary

the original InputContex is intended to include all kinds of file formats, and both streaming load and copy, which turn to be too complicated to maintain, and there are many differences in nature between parquet and text files.
this pr introduce new pipeline for text, row-based formats, and force on copy.
the code is mainly in databend-common-storages-stage::read::row_based
CSV format is refactored with this new pipeline firstly.

logically this is a refactor, but at code level many codes are rewritten (some details are refined by the way). So Add new setting enable_new_copy_for_text_formats.
if there are some problem with new impl, user can fallback to the old one.
because there are already many tests for CSV, I set enable_new_copy_for_text_formats=1 by default.

Fixes #[Link the issue here]

Tests

Unit Test
Logic Test
Benchmark Test
No Test - Explain why

Type of change

Bug Fix (non-breaking change which fixes an issue)
New Feature (non-breaking change which adds functionality)
Breaking Change (fix or feature that could cause existing functionality not to work as expected)
Documentation Update
Refactoring
Performance Improvement
Other (please describe):

This change is

src/query/storages/stage/src/stage_table.rs

src/query/storages/stage/src/read/row_based/processors/separator.rs

src/query/storages/stage/src/one_file_partition.rs

src/query/storages/stage/src/input_context.rs

src/query/storages/stage/src/read/row_based/formats/csv/separator.rs

youngsofun · 2024-02-18T05:31:11Z

@b41sh thank you very much! please review again (take your time, this PR is not in hurry).

src/query/storages/stage/src/read/row_based/batch.rs

sundy-li · 2024-02-19T12:39:34Z

Do you have any performance results with the old impl ?

You can use copy into table with null engine to test the performance.

youngsofun · 2024-02-19T16:25:29Z

Do you have any performance results with the old impl ?

You can use copy into table with null engine to test the performance.

ok, I will try it later.

as a refactor, the only place that may lead to diff in performance is the reading of file data (less batch buffered).
If there really is a difference in this aspect, it would only become apparent when running on slower storage.

youngsofun · 2024-02-20T08:25:06Z

@sundy-li

A preliminary test indicates negligible performance difference (less than 1%).

Settings:

A single CSV file of 700MB with two integer columns
Tested in both compressed and uncompressed formats.

--

this test is too cpu bounded, add another:

one column with a 100-char string
add max_threads = 1 and 10

not diff either.

github-actions bot added the pr-refactor this PR changes the code base without new features or bugfix label Feb 8, 2024

youngsofun force-pushed the text branch 2 times, most recently from 19a2806 to cd0b3f6 Compare February 8, 2024 02:03

youngsofun marked this pull request as draft February 8, 2024 02:10

youngsofun force-pushed the text branch 3 times, most recently from 63aa4d6 to a0a696d Compare February 8, 2024 03:02

youngsofun marked this pull request as ready for review February 8, 2024 03:27

youngsofun requested review from b41sh, sundy-li and ariesdevil February 8, 2024 03:27

b41sh reviewed Feb 8, 2024

View reviewed changes

youngsofun added 9 commits February 18, 2024 13:25

refactor infer_schema.

1af3668

new impl for loading CSV.

0cc1895

Add new setting enable_new_copy_for_text_formats.

c56a0a2

fix typo.

8481a24

better guess of rows of text file.

d90aa2c

comments for pipeline.

b9e608d

fix typo.

cf4efe3

refactor to avoid use of unreachable!

c41b94c

comments for CSV separator.

7cf5e7f

youngsofun force-pushed the text branch from a0a696d to 7cf5e7f Compare February 18, 2024 05:25

rename file.

7f9b5f0

sundy-li reviewed Feb 19, 2024

View reviewed changes

src/query/storages/stage/src/read/row_based/batch.rs Show resolved Hide resolved

b41sh approved these changes Feb 20, 2024

View reviewed changes

sundy-li approved these changes Feb 20, 2024

View reviewed changes

BohuTANG merged commit 346a955 into databendlabs:main Feb 20, 2024
71 checks passed

hantmac mentioned this pull request Feb 23, 2024

Empty string value show as NULL when insert databendlabs/databend-go#97

Closed

This was referenced Mar 14, 2024

feat: refactor ndjson input format. #14943

Merged

copy tsv.gz 3X faster than csv.gz #14973

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: new impl for loading CSV. #14645

refactor: new impl for loading CSV. #14645

youngsofun commented Feb 8, 2024 •

edited

Loading

youngsofun commented Feb 18, 2024 •

edited

Loading

sundy-li commented Feb 19, 2024 •

edited

Loading

youngsofun commented Feb 19, 2024

youngsofun commented Feb 20, 2024 •

edited

Loading

refactor: new impl for loading CSV. #14645

refactor: new impl for loading CSV. #14645

Conversation

youngsofun commented Feb 8, 2024 • edited Loading

Summary

Tests

Type of change

youngsofun commented Feb 18, 2024 • edited Loading

sundy-li commented Feb 19, 2024 • edited Loading

youngsofun commented Feb 19, 2024

youngsofun commented Feb 20, 2024 • edited Loading

youngsofun commented Feb 8, 2024 •

edited

Loading

youngsofun commented Feb 18, 2024 •

edited

Loading

sundy-li commented Feb 19, 2024 •

edited

Loading

youngsofun commented Feb 20, 2024 •

edited

Loading