-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-16896][SQL] Handle duplicated field names in header consistently with null or empty strings in CSV #14745
Conversation
Test build #64178 has finished for PR 14745 at commit
|
row.zipWithIndex.map { case (value, index) => | ||
if (value == null || value.isEmpty || value == options.nullValue) { | ||
// When there are empty strings or the values set in `nullValue`, put the | ||
// index as a post-fix. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suffix?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
looks good to me. |
Hm, yea, I think we should take that into account as |
Test build #64201 has finished for PR 14745 at commit
|
/** | ||
* Generates a header from the given row which is null-safe and duplicates-safe. | ||
*/ | ||
private def makeSafeHeader(row: Array[String], options: CSVOptions): Array[String] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest putting this function in utils and writing a separate unit test for it.
@falaki and @felixcheung Do you mind if I ask another quick look please? |
Test build #64250 has finished for PR 14745 at commit
|
Cc @rxin as well. |
Test build #64630 has finished for PR 14745 at commit
|
cc @cloud-fan Do you mind if I ask to take a look please? |
firstRow.zipWithIndex.map { case (value, index) => s"_c$index" } | ||
} | ||
val caseSensitive = sparkSession.sessionState.conf.caseSensitiveAnalysis | ||
val header = makeSafeHeader(firstRow, csvOptions, caseSensitive) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we just make makeSafeHeader
a private method in this class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, let me fix this up.
Test build #66605 has finished for PR 14745 at commit
|
retest this please |
Test build #66608 has finished for PR 14745 at commit
|
} | ||
} | ||
} else { | ||
row.zipWithIndex.map { case (value, index) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can be case (_, index) =>
caseSensitive: Boolean): Array[String] = { | ||
if (options.headerFlag) { | ||
val duplicates = { | ||
val safeRow = if (!caseSensitive) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
val headerNames = row.filter(_ != null).map(name => if (caseSensitive) name else name.toLowerCase)
LGTM |
fd24c5b
to
2905047
Compare
Test build #66652 has finished for PR 14745 at commit
|
thanks, merging to master! |
…ly with null or empty strings in CSV ## What changes were proposed in this pull request? Currently, CSV datasource allows to load duplicated empty string fields or fields having `nullValue` in the header. It'd be great if this can deal with normal fields as well. This PR proposes handling the duplicates consistently with the existing behaviour with considering case-sensitivity (`spark.sql.caseSensitive`) as below: data below: ``` fieldA,fieldB,,FIELDA,fielda,, 1,2,3,4,5,6,7 ``` is parsed as below: ```scala spark.read.format("csv").option("header", "true").load("test.csv").show() ``` - when `spark.sql.caseSensitive` is `false` (by default). ``` +-------+------+---+-------+-------+---+---+ |fieldA0|fieldB|_c2|FIELDA3|fieldA4|_c5|_c6| +-------+------+---+-------+-------+---+---+ | 1| 2| 3| 4| 5| 6| 7| +-------+------+---+-------+-------+---+---+ ``` - when `spark.sql.caseSensitive` is `true`. ``` +-------+------+---+-------+-------+---+---+ |fieldA0|fieldB|_c2| FIELDA|fieldA4|_c5|_c6| +-------+------+---+-------+-------+---+---+ | 1| 2| 3| 4| 5| 6| 7| +-------+------+---+-------+-------+---+---+ ``` **In more details**, There is a good reference about this problem, `read.csv()` in R. So, I initially wanted to propose the similar behaviour. In case of R, the CSV data below: ``` fieldA,fieldB,,fieldA,fieldA,, 1,2,3,4,5,6,7 ``` is parsed as below: ```r test <- read.csv(file="test.csv",header=TRUE,sep=",") > test fieldA fieldB X fieldA.1 fieldA.2 X.1 X.2 1 1 2 3 4 5 6 7 ``` However, Spark CSV datasource already is handling duplicated empty strings and `nullValue` as field names. So the data below: ``` ,,,fieldA,,fieldB, 1,2,3,4,5,6,7 ``` is parsed as below: ```scala spark.read.format("csv").option("header", "true").load("test.csv").show() ``` ``` +---+---+---+------+---+------+---+ |_c0|_c1|_c2|fieldA|_c4|fieldB|_c6| +---+---+---+------+---+------+---+ | 1| 2| 3| 4| 5| 6| 7| +---+---+---+------+---+------+---+ ``` R starts the number for each duplicate but Spark adds the number for its position for all fields for `nullValue` and empty strings. In terms of case-sensitivity, it seems R is case-sensitive as below: (it seems it is not configurable). ``` a,a,a,A,A 1,2,3,4,5 ``` is parsed as below: ```r test <- read.csv(file="test.csv",header=TRUE,sep=",") > test a a.1 a.2 A A.1 1 1 2 3 4 5 ``` ## How was this patch tested? Unit test in `CSVSuite`. Author: hyukjinkwon <[email protected]> Closes apache#14745 from HyukjinKwon/SPARK-16896.
As R has a separator for column headers like |
I think you can simply just rename the columns after loading manually. |
…8645) Fixes mangled name bug `read_csv` with duplicate columns. mismatch with pandas behavior. #### csv file: ```csv A,A,A.1,A,A.2,A,A.4,A,A 1,2,3,4.0,a,a,a.4,a,a 2,4,6,8.0,b,b,b.4,b,a 3,6,2,6.0,c,c,c.4,c,c ``` |A| A| A.1| A| A.2| A| A.4| A| A| |-|-|-|-|-|-|-|-|-| |A| A.1| A.1.1| A.2| A.2.1| A.3| A.4| A.4.1| A.5| #### Pandas: ```python In [1]: import pandas as pd In [2]: pd.read_csv("test.csv") Out[2]: A A.1 A.1.1 A.2 A.2.1 A.3 A.4 A.4.1 A.5 0 1 2 3 4.0 a a a.4 a a 1 2 4 6 8.0 b b b.4 b a 2 3 6 2 6.0 c c c.4 c c ``` #### cudf: (21.08 nightly docker) ```python In [1]: import cudf In [2]: cudf.__version__ Out[2]: '21.08.00a+238.gfba09e66d8' In [3]: cudf.read_csv("test.csv") Out[3]: A A.1 A.2 A.3 A.4 A.5 0 1 3 a a a a 1 2 6 b b b a 2 3 2 c c c c ``` This PR fixes this issue. ```python In [2]: cudf.read_csv("test.csv") Out[2]: A A.1 A.1.1 A.2 A.2.1 A.3 A.4 A.4.1 A.5 0 1 2 3 4.0 a a a.4 a a 1 2 4 6 8.0 b b b.4 b a 2 3 6 2 6.0 c c c.4 c c ``` Related info (sparks): Spark duplicate column naming. https://issues.apache.org/jira/browse/SPARK-16896 apache/spark#14745 cudf sparks addon doesn't use libcudf names. So, this PR does not affect it. Authors: - Karthikeyan (https://github.com/karthikeyann) Approvers: - Vukasin Milovanovic (https://github.com/vuule) - Elias Stehle (https://github.com/elstehle) - Ram (Ramakrishna Prabhu) (https://github.com/rgsl888prabhu) - Mike Wilson (https://github.com/hyperbolic2346) URL: #8645
What changes were proposed in this pull request?
Currently, CSV datasource allows to load duplicated empty string fields or fields having
nullValue
in the header. It'd be great if this can deal with normal fields as well.This PR proposes handling the duplicates consistently with the existing behaviour with considering case-sensitivity (
spark.sql.caseSensitive
) as below:data below:
is parsed as below:
when
spark.sql.caseSensitive
isfalse
(by default).when
spark.sql.caseSensitive
istrue
.In more details,
There is a good reference about this problem,
read.csv()
in R. So, I initially wanted to propose the similar behaviour.In case of R, the CSV data below:
is parsed as below:
However, Spark CSV datasource already is handling duplicated empty strings and
nullValue
as field names. So the data below:is parsed as below:
R starts the number for each duplicate but Spark adds the number for its position for all fields for
nullValue
and empty strings.In terms of case-sensitivity, it seems R is case-sensitive as below: (it seems it is not configurable).
is parsed as below:
How was this patch tested?
Unit test in
CSVSuite
.