[FEA] Test CSV schema evolution #5475

revans2 · 2022-05-12T18:42:31Z

rapidsai/cudf#10618 made a change to pandas and as a part of that discussion I realized that we are not testing

What happens if there are duplicate columns. I think Spark is going to return an error when doing schema discovery, but I am not sure what happens when we pass in a schema and include column names. This is even more interesting if we read a column that is a duplicate, or what happens if we ask for a column name that CUDF would generate from duplicate column names.
What happens if we have multiple files with different column names, or a different order to the columns. (schema evolution)

revans2 · 2022-05-12T20:59:56Z

I tested this manually and there we are doing the right thing. It is all based off of column position, for good or bad. This means there is no real way to change the order of columns/etc without playing some games. But it would be good to add some integration tests for this anyways.

revans2 added ? - Needs Triage Need team to review and classify task Work required that improves the product but is not user facing labels May 12, 2022

revans2 added test Only impacts tests and removed task Work required that improves the product but is not user facing labels May 12, 2022

sameerz changed the title ~~[FEA] Test CSV scheam evolution~~ [FEA] Test CSV schema evolution May 12, 2022

sameerz added good first issue Good for newcomers and removed ? - Needs Triage Need team to review and classify labels May 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Test CSV schema evolution #5475

[FEA] Test CSV schema evolution #5475

revans2 commented May 12, 2022

revans2 commented May 12, 2022

[FEA] Test CSV schema evolution #5475

[FEA] Test CSV schema evolution #5475

Comments

revans2 commented May 12, 2022

revans2 commented May 12, 2022