-
Notifications
You must be signed in to change notification settings - Fork 166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add CSV (text) file support #646
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a reasonably small set of the most common options. I think any arguing about CSV options quickly becomes subjective but I'll pitch the following:
Common options not included here:
- Some ability to set the column types (are you proposing that casts be used for this purpose?)
- Compression (many CSV files are compressed with some kind of compression scheme)
- Newline character (\r, \n, \r\n, though I think this can often be inferred pretty accurately)
- Skip (some number of lines at the top of the file to skip)
Options included here that don't feel very common:
max_block_size
strikes me as something only needed in more exotic files
Setting the schema here doesn't seem appropriate. All of the other formats have a schema defined internally. In absence of anything internal definition, the schema for CSVs is actually text. That said there is a base schema in the ReadRel relation itself that one could go try to cast the data to if one wanted to force the schema of the relation for ease of use. I do prefer casting, however. I've added a comment about compression. All of the compression formats are easily determined by the MAGIC file header so specifying the compression externally seems extraneous. The header option removes the first row to get the names (which we don't use). I could replace that with a number of lines to skip instead. I agree that newlines seem to be fairly standard these days and probably don't need to be included as an option. |
I agree with @westonpace that schema/parsing desires should be included in configuration. As a concrete example, pushing down filters requires data type to be done with CSVs and is pretty common to avoid things like building up large arrow datasets and than deleting most of the records. I also agree on skip lines. Super common and frequently useful. Compression also feels like it should be an option but I'm torn on what that should be and would prefer to solve once we have more real use cases. |
proto/substrait/algebra.proto
Outdated
// If true, consume the first row as names of the columns. These names | ||
// are not used elsewhere in the plan. | ||
uint64 header = 4; | ||
// The character(s) used to escape characters in strings. Backslash is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While the pernicious influence of C has caused its escape style to spread into new areas, in CSV the most common style is still that the quote character is doubled inside quoted strings in order to escape it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using a quote to escape a quote should work if provided here. I'll add it for clarity.
08b4d72
to
ace5553
Compare
Added skip_lines which will skip lines. The treat_first_row_as_header option is defined to work after that but we could conceivably just not have a treat_first_row_as_header option since we don't use the names in Substrait. Edit: Decided to remove treat_first_row_as_header option after all. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the patience. LGTM +1
The only topic left that hasn't reached consensus is the type handling. Do we want to assume the return type is all strings (nullable or not) as currently written or should we expect that the reader uses the ReadRel's schema? |
My suggestion is we merge this and that schema/type handling (beyond strings) can be an enhancement. For this, I suggest all strings are nullable. |
Anyone else interested in weighing in? I believe this PR is ready to merge. |
Gluten's version of text file support relies on a named schema which shouldn't be necessary given that by its given nature a text file is comprised of strings. This version doesn't attempt to define a schema or ways to mangle text into that schema.