-
Notifications
You must be signed in to change notification settings - Fork 752
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor: new impl for loading CSV. #14645
Conversation
19a2806
to
cd0b3f6
Compare
63aa4d6
to
a0a696d
Compare
src/query/storages/stage/src/read/row_based/processors/separator.rs
Outdated
Show resolved
Hide resolved
@b41sh thank you very much! please review again (take your time, this PR is not in hurry). |
Do you have any performance results with the old impl ? You can use copy into table with null engine to test the performance. |
ok, I will try it later. as a refactor, the only place that may lead to diff in performance is the reading of file data (less batch buffered). |
A preliminary test indicates negligible performance difference (less than 1%). Settings: A single CSV file of 700MB with two integer columns -- this test is too cpu bounded, add another:
not diff either. |
I hereby agree to the terms of the CLA available at: https://docs.databend.com/dev/policies/cla/
Summary
the original
InputContex
is intended to include all kinds of file formats, and both streaming load and copy, which turn to be too complicated to maintain, and there are many differences in nature between parquet and text files.this pr introduce new pipeline for text, row-based formats, and force on copy.
the code is mainly in
databend-common-storages-stage::read::row_based
CSV format is refactored with this new pipeline firstly.
logically this is a refactor, but at code level many codes are rewritten (some details are refined by the way). So Add new setting enable_new_copy_for_text_formats.
if there are some problem with new impl, user can fallback to the old one.
because there are already many tests for CSV, I set enable_new_copy_for_text_formats=1 by default.
Tests
Type of change
This change is