Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add CSV (text) file support #646

Merged
merged 9 commits into from
Aug 10, 2024

Conversation

EpsilonPrime
Copy link
Member

Gluten's version of text file support relies on a named schema which shouldn't be necessary given that by its given nature a text file is comprised of strings. This version doesn't attempt to define a schema or ways to mangle text into that schema.

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a reasonably small set of the most common options. I think any arguing about CSV options quickly becomes subjective but I'll pitch the following:

Common options not included here:

  • Some ability to set the column types (are you proposing that casts be used for this purpose?)
  • Compression (many CSV files are compressed with some kind of compression scheme)
  • Newline character (\r, \n, \r\n, though I think this can often be inferred pretty accurately)
  • Skip (some number of lines at the top of the file to skip)

Options included here that don't feel very common:

  • max_block_size strikes me as something only needed in more exotic files

proto/substrait/algebra.proto Outdated Show resolved Hide resolved
proto/substrait/algebra.proto Show resolved Hide resolved
proto/substrait/algebra.proto Outdated Show resolved Hide resolved
proto/substrait/algebra.proto Outdated Show resolved Hide resolved
proto/substrait/algebra.proto Show resolved Hide resolved
@EpsilonPrime
Copy link
Member Author

This seems like a reasonably small set of the most common options. I think any arguing about CSV options quickly becomes subjective but I'll pitch the following:

Common options not included here:

  • Some ability to set the column types (are you proposing that casts be used for this purpose?)
  • Compression (many CSV files are compressed with some kind of compression scheme)
  • Newline character (\r, \n, \r\n, though I think this can often be inferred pretty accurately)
  • Skip (some number of lines at the top of the file to skip)

Options included here that don't feel very common:

  • max_block_size strikes me as something only needed in more exotic files

Setting the schema here doesn't seem appropriate. All of the other formats have a schema defined internally. In absence of anything internal definition, the schema for CSVs is actually text. That said there is a base schema in the ReadRel relation itself that one could go try to cast the data to if one wanted to force the schema of the relation for ease of use. I do prefer casting, however.

I've added a comment about compression. All of the compression formats are easily determined by the MAGIC file header so specifying the compression externally seems extraneous.

The header option removes the first row to get the names (which we don't use). I could replace that with a number of lines to skip instead.

I agree that newlines seem to be fairly standard these days and probably don't need to be included as an option.

@jacques-n
Copy link
Contributor

I agree with @westonpace that schema/parsing desires should be included in configuration. As a concrete example, pushing down filters requires data type to be done with CSVs and is pretty common to avoid things like building up large arrow datasets and than deleting most of the records.

I also agree on skip lines. Super common and frequently useful.

Compression also feels like it should be an option but I'm torn on what that should be and would prefer to solve once we have more real use cases.

proto/substrait/algebra.proto Outdated Show resolved Hide resolved
proto/substrait/algebra.proto Outdated Show resolved Hide resolved
proto/substrait/algebra.proto Show resolved Hide resolved
proto/substrait/algebra.proto Outdated Show resolved Hide resolved
proto/substrait/algebra.proto Outdated Show resolved Hide resolved
// If true, consume the first row as names of the columns. These names
// are not used elsewhere in the plan.
uint64 header = 4;
// The character(s) used to escape characters in strings. Backslash is

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While the pernicious influence of C has caused its escape style to spread into new areas, in CSV the most common style is still that the quote character is doubled inside quoted strings in order to escape it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a quote to escape a quote should work if provided here. I'll add it for clarity.

@EpsilonPrime
Copy link
Member Author

EpsilonPrime commented Aug 7, 2024

I also agree on skip lines. Super common and frequently useful.

Added skip_lines which will skip lines. The treat_first_row_as_header option is defined to work after that but we could conceivably just not have a treat_first_row_as_header option since we don't use the names in Substrait.

Edit: Decided to remove treat_first_row_as_header option after all.

Copy link
Contributor

@jacques-n jacques-n left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the patience. LGTM +1

@EpsilonPrime
Copy link
Member Author

The only topic left that hasn't reached consensus is the type handling. Do we want to assume the return type is all strings (nullable or not) as currently written or should we expect that the reader uses the ReadRel's schema?

@jacques-n
Copy link
Contributor

jacques-n commented Aug 8, 2024

The only topic left that hasn't reached consensus is the type handling. Do we want to assume the return type is all strings (nullable or not) as currently written or should we expect that the reader uses the ReadRel's schema?

My suggestion is we merge this and that schema/type handling (beyond strings) can be an enhancement. For this, I suggest all strings are nullable.

@EpsilonPrime
Copy link
Member Author

Anyone else interested in weighing in? I believe this PR is ready to merge.

@jacques-n jacques-n merged commit 5d49e04 into substrait-io:main Aug 10, 2024
13 checks passed
@EpsilonPrime EpsilonPrime deleted the csv_format branch September 26, 2024 04:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants