feat: add CSV (text) file support #646

EpsilonPrime · 2024-05-30T18:27:36Z

Gluten's version of text file support relies on a named schema which shouldn't be necessary given that by its given nature a text file is comprised of strings. This version doesn't attempt to define a schema or ways to mangle text into that schema.

westonpace

This seems like a reasonably small set of the most common options. I think any arguing about CSV options quickly becomes subjective but I'll pitch the following:

Common options not included here:

Some ability to set the column types (are you proposing that casts be used for this purpose?)
Compression (many CSV files are compressed with some kind of compression scheme)
Newline character (\r, \n, \r\n, though I think this can often be inferred pretty accurately)
Skip (some number of lines at the top of the file to skip)

Options included here that don't feel very common:

max_block_size strikes me as something only needed in more exotic files

proto/substrait/algebra.proto

EpsilonPrime · 2024-06-04T23:14:42Z

This seems like a reasonably small set of the most common options. I think any arguing about CSV options quickly becomes subjective but I'll pitch the following:

Common options not included here:

Some ability to set the column types (are you proposing that casts be used for this purpose?)

Compression (many CSV files are compressed with some kind of compression scheme)

Newline character (\r, \n, \r\n, though I think this can often be inferred pretty accurately)

Skip (some number of lines at the top of the file to skip)

Options included here that don't feel very common:

max_block_size strikes me as something only needed in more exotic files

Setting the schema here doesn't seem appropriate. All of the other formats have a schema defined internally. In absence of anything internal definition, the schema for CSVs is actually text. That said there is a base schema in the ReadRel relation itself that one could go try to cast the data to if one wanted to force the schema of the relation for ease of use. I do prefer casting, however.

I've added a comment about compression. All of the compression formats are easily determined by the MAGIC file header so specifying the compression externally seems extraneous.

The header option removes the first row to get the names (which we don't use). I could replace that with a number of lines to skip instead.

I agree that newlines seem to be fairly standard these days and probably don't need to be included as an option.

jacques-n · 2024-08-01T02:03:06Z

I agree with @westonpace that schema/parsing desires should be included in configuration. As a concrete example, pushing down filters requires data type to be done with CSVs and is pretty common to avoid things like building up large arrow datasets and than deleting most of the records.

I also agree on skip lines. Super common and frequently useful.

Compression also feels like it should be an option but I'm torn on what that should be and would prefer to solve once we have more real use cases.

proto/substrait/algebra.proto

CurtHagenlocher · 2024-08-07T14:29:37Z

proto/substrait/algebra.proto

+        // If true, consume the first row as names of the columns.  These names
+        // are not used elsewhere in the plan.
+        uint64 header = 4;
+        // The character(s) used to escape characters in strings.  Backslash is


While the pernicious influence of C has caused its escape style to spread into new areas, in CSV the most common style is still that the quote character is doubled inside quoted strings in order to escape it.

Using a quote to escape a quote should work if provided here. I'll add it for clarity.

EpsilonPrime · 2024-08-07T23:39:20Z

I also agree on skip lines. Super common and frequently useful.

Added skip_lines which will skip lines. The treat_first_row_as_header option is defined to work after that but we could conceivably just not have a treat_first_row_as_header option since we don't use the names in Substrait.

Edit: Decided to remove treat_first_row_as_header option after all.

jacques-n

Thanks for the patience. LGTM +1

EpsilonPrime · 2024-08-08T02:40:04Z

The only topic left that hasn't reached consensus is the type handling. Do we want to assume the return type is all strings (nullable or not) as currently written or should we expect that the reader uses the ReadRel's schema?

jacques-n · 2024-08-08T02:43:54Z

The only topic left that hasn't reached consensus is the type handling. Do we want to assume the return type is all strings (nullable or not) as currently written or should we expect that the reader uses the ReadRel's schema?

My suggestion is we merge this and that schema/type handling (beyond strings) can be an enhancement. For this, I suggest all strings are nullable.

EpsilonPrime · 2024-08-09T00:38:19Z

Anyone else interested in weighing in? I believe this PR is ready to merge.

EpsilonPrime requested review from jacques-n, cpcloud, westonpace and vbarua as code owners May 30, 2024 18:27

westonpace reviewed Jun 3, 2024

View reviewed changes

jacques-n reviewed Aug 1, 2024

View reviewed changes

jacques-n reviewed Aug 7, 2024

View reviewed changes

proto/substrait/algebra.proto Show resolved Hide resolved

CurtHagenlocher reviewed Aug 7, 2024

View reviewed changes

proto/substrait/algebra.proto Outdated Show resolved Hide resolved

CurtHagenlocher reviewed Aug 7, 2024

View reviewed changes

EpsilonPrime added 4 commits August 7, 2024 15:54

CSV format as copied from Gluten's fork.

7326ddb

Revised CSV format with options appropriate to Substrait.

85a9b02

Updated comment.

f0f3971

handled review notes

ace5553

EpsilonPrime force-pushed the csv_format branch from 08b4d72 to ace5553 Compare August 7, 2024 22:54

EpsilonPrime added 2 commits August 7, 2024 16:18

updated based on review

c4188d5

Added skip_lines

b5bba4c

EpsilonPrime added 3 commits August 7, 2024 16:39

updated type of treat_first_row_as_header

d0605ca

decided to only use skip lines after all

d8ae693

removed note about checking before quotes

27ee344

jacques-n approved these changes Aug 8, 2024

View reviewed changes

EpsilonPrime added the awaiting SMC approval label Aug 9, 2024

jacques-n merged commit 5d49e04 into substrait-io:main Aug 10, 2024
13 checks passed

Blizzara mentioned this pull request Aug 21, 2024

feat: support new IntervalCompound and IntervalDay update substrait-io/substrait-java#288

Merged

EpsilonPrime deleted the csv_format branch September 26, 2024 04:15

vbarua removed the awaiting SMC approval label Sep 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add CSV (text) file support #646

feat: add CSV (text) file support #646

EpsilonPrime commented May 30, 2024

westonpace left a comment

EpsilonPrime commented Jun 4, 2024

jacques-n commented Aug 1, 2024

CurtHagenlocher Aug 7, 2024

EpsilonPrime Aug 7, 2024

EpsilonPrime commented Aug 7, 2024 •

edited

Loading

jacques-n left a comment

EpsilonPrime commented Aug 8, 2024

jacques-n commented Aug 8, 2024 •

edited

Loading

EpsilonPrime commented Aug 9, 2024

feat: add CSV (text) file support #646

feat: add CSV (text) file support #646

Conversation

EpsilonPrime commented May 30, 2024

westonpace left a comment

Choose a reason for hiding this comment

EpsilonPrime commented Jun 4, 2024

jacques-n commented Aug 1, 2024

CurtHagenlocher Aug 7, 2024

Choose a reason for hiding this comment

EpsilonPrime Aug 7, 2024

Choose a reason for hiding this comment

EpsilonPrime commented Aug 7, 2024 • edited Loading

jacques-n left a comment

Choose a reason for hiding this comment

EpsilonPrime commented Aug 8, 2024

jacques-n commented Aug 8, 2024 • edited Loading

EpsilonPrime commented Aug 9, 2024

EpsilonPrime commented Aug 7, 2024 •

edited

Loading

jacques-n commented Aug 8, 2024 •

edited

Loading