[SPARK-16896][SQL] Handle duplicated field names in header consistently with null or empty strings in CSV #14745

HyukjinKwon · 2016-08-22T04:49:43Z

What changes were proposed in this pull request?

Currently, CSV datasource allows to load duplicated empty string fields or fields having nullValue in the header. It'd be great if this can deal with normal fields as well.

This PR proposes handling the duplicates consistently with the existing behaviour with considering case-sensitivity (spark.sql.caseSensitive) as below:

data below:

fieldA,fieldB,,FIELDA,fielda,,
1,2,3,4,5,6,7

is parsed as below:

spark.read.format("csv").option("header", "true").load("test.csv").show()

when spark.sql.caseSensitive is false (by default).

+-------+------+---+-------+-------+---+---+
|fieldA0|fieldB|_c2|FIELDA3|fieldA4|_c5|_c6|
+-------+------+---+-------+-------+---+---+
|      1|     2|  3|      4|      5|  6|  7|
+-------+------+---+-------+-------+---+---+

when spark.sql.caseSensitive is true.

+-------+------+---+-------+-------+---+---+
|fieldA0|fieldB|_c2| FIELDA|fieldA4|_c5|_c6|
+-------+------+---+-------+-------+---+---+
|      1|     2|  3|      4|      5|  6|  7|
+-------+------+---+-------+-------+---+---+

In more details,

There is a good reference about this problem, read.csv() in R. So, I initially wanted to propose the similar behaviour.

In case of R, the CSV data below:

fieldA,fieldB,,fieldA,fieldA,,
1,2,3,4,5,6,7

is parsed as below:

test <- read.csv(file="test.csv",header=TRUE,sep=",")
> test
  fieldA fieldB X fieldA.1 fieldA.2 X.1 X.2
1      1      2 3        4        5   6   7

However, Spark CSV datasource already is handling duplicated empty strings and nullValue as field names. So the data below:

,,,fieldA,,fieldB,
1,2,3,4,5,6,7

is parsed as below:

spark.read.format("csv").option("header", "true").load("test.csv").show()

+---+---+---+------+---+------+---+
|_c0|_c1|_c2|fieldA|_c4|fieldB|_c6|
+---+---+---+------+---+------+---+
|  1|  2|  3|     4|  5|     6|  7|
+---+---+---+------+---+------+---+

R starts the number for each duplicate but Spark adds the number for its position for all fields for nullValue and empty strings.

In terms of case-sensitivity, it seems R is case-sensitive as below: (it seems it is not configurable).

a,a,a,A,A
1,2,3,4,5

is parsed as below:

test <- read.csv(file="test.csv",header=TRUE,sep=",")
> test
  a a.1 a.2 A A.1
1 1   2   3 4   5

How was this patch tested?

Unit test in CSVSuite.

SparkQA · 2016-08-22T06:35:43Z

Test build #64178 has finished for PR 14745 at commit a3c3dc6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2016-08-22T12:52:51Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala

+      row.zipWithIndex.map { case (value, index) =>
+        if (value == null || value.isEmpty || value == options.nullValue) {
+          // When there are empty strings or the values set in `nullValue`, put the
+          // index as a post-fix.


felixcheung · 2016-08-22T12:59:14Z

looks good to me.
do we need to consider case? is "a1" the same as "A1"?

HyukjinKwon · 2016-08-22T13:10:33Z

Hm, yea, I think we should take that into account as spark.sql.caseSensitive is false by default. I will take a look at R as well and will fix this up tomorrow. Thank you for reviewing @felixcheung .

SparkQA · 2016-08-22T14:48:33Z

Test build #64201 has finished for PR 14745 at commit 94620ca.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

falaki · 2016-08-22T21:53:21Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala

+  /**
+   * Generates a header from the given row which is null-safe and duplicates-safe.
+   */
+  private def makeSafeHeader(row: Array[String], options: CSVOptions): Array[String] = {


I suggest putting this function in utils and writing a separate unit test for it.

HyukjinKwon · 2016-08-23T01:10:26Z

@falaki and @felixcheung Do you mind if I ask another quick look please?

SparkQA · 2016-08-23T03:01:49Z

Test build #64250 has finished for PR 14745 at commit 316049b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2016-08-24T14:22:48Z

Cc @rxin as well.

SparkQA · 2016-08-30T07:42:03Z

Test build #64630 has finished for PR 14745 at commit 0c02581.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2016-10-08T08:26:56Z

cc @cloud-fan Do you mind if I ask to take a look please?

cloud-fan · 2016-10-09T06:09:37Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala

-      firstRow.zipWithIndex.map { case (value, index) => s"_c$index" }
-    }
+    val caseSensitive = sparkSession.sessionState.conf.caseSensitiveAnalysis
+    val header = makeSafeHeader(firstRow, csvOptions, caseSensitive)


can we just make makeSafeHeader a private method in this class?

Sure, let me fix this up.

SparkQA · 2016-10-09T16:44:37Z

Test build #66605 has finished for PR 14745 at commit fd24c5b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2016-10-09T16:53:56Z

retest this please

SparkQA · 2016-10-09T18:57:45Z

Test build #66608 has finished for PR 14745 at commit fd24c5b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-10-10T07:54:47Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala

+        }
+      }
+    } else {
+      row.zipWithIndex.map { case (value, index) =>


this can be case (_, index) =>

cloud-fan · 2016-10-10T08:10:45Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala

+      caseSensitive: Boolean): Array[String] = {
+    if (options.headerFlag) {
+      val duplicates = {
+        val safeRow = if (!caseSensitive) {


nit:

val headerNames = row.filter(_ != null).map(name => if (caseSensitive) name else name.toLowerCase)

cloud-fan · 2016-10-10T08:14:37Z

LGTM

SparkQA · 2016-10-10T15:16:51Z

Test build #66652 has finished for PR 14745 at commit 969c8f9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-10-11T02:22:40Z

thanks, merging to master!

…ly with null or empty strings in CSV ## What changes were proposed in this pull request? Currently, CSV datasource allows to load duplicated empty string fields or fields having `nullValue` in the header. It'd be great if this can deal with normal fields as well. This PR proposes handling the duplicates consistently with the existing behaviour with considering case-sensitivity (`spark.sql.caseSensitive`) as below: data below: ``` fieldA,fieldB,,FIELDA,fielda,, 1,2,3,4,5,6,7 ``` is parsed as below: ```scala spark.read.format("csv").option("header", "true").load("test.csv").show() ``` - when `spark.sql.caseSensitive` is `false` (by default). ``` +-------+------+---+-------+-------+---+---+ |fieldA0|fieldB|_c2|FIELDA3|fieldA4|_c5|_c6| +-------+------+---+-------+-------+---+---+ | 1| 2| 3| 4| 5| 6| 7| +-------+------+---+-------+-------+---+---+ ``` - when `spark.sql.caseSensitive` is `true`. ``` +-------+------+---+-------+-------+---+---+ |fieldA0|fieldB|_c2| FIELDA|fieldA4|_c5|_c6| +-------+------+---+-------+-------+---+---+ | 1| 2| 3| 4| 5| 6| 7| +-------+------+---+-------+-------+---+---+ ``` **In more details**, There is a good reference about this problem, `read.csv()` in R. So, I initially wanted to propose the similar behaviour. In case of R, the CSV data below: ``` fieldA,fieldB,,fieldA,fieldA,, 1,2,3,4,5,6,7 ``` is parsed as below: ```r test <- read.csv(file="test.csv",header=TRUE,sep=",") > test fieldA fieldB X fieldA.1 fieldA.2 X.1 X.2 1 1 2 3 4 5 6 7 ``` However, Spark CSV datasource already is handling duplicated empty strings and `nullValue` as field names. So the data below: ``` ,,,fieldA,,fieldB, 1,2,3,4,5,6,7 ``` is parsed as below: ```scala spark.read.format("csv").option("header", "true").load("test.csv").show() ``` ``` +---+---+---+------+---+------+---+ |_c0|_c1|_c2|fieldA|_c4|fieldB|_c6| +---+---+---+------+---+------+---+ | 1| 2| 3| 4| 5| 6| 7| +---+---+---+------+---+------+---+ ``` R starts the number for each duplicate but Spark adds the number for its position for all fields for `nullValue` and empty strings. In terms of case-sensitivity, it seems R is case-sensitive as below: (it seems it is not configurable). ``` a,a,a,A,A 1,2,3,4,5 ``` is parsed as below: ```r test <- read.csv(file="test.csv",header=TRUE,sep=",") > test a a.1 a.2 A A.1 1 1 2 3 4 5 ``` ## How was this patch tested? Unit test in `CSVSuite`. Author: hyukjinkwon <[email protected]> Closes apache#14745 from HyukjinKwon/SPARK-16896.

NiharGharat · 2020-05-12T12:02:13Z

As R has a separator for column headers like
a a.1 a.2 A A.1
can we have one as well like underscore?

HyukjinKwon · 2020-05-14T02:47:01Z

I think you can simply just rename the columns after loading manually.

…8645) Fixes mangled name bug `read_csv` with duplicate columns. mismatch with pandas behavior. #### csv file: ```csv A,A,A.1,A,A.2,A,A.4,A,A 1,2,3,4.0,a,a,a.4,a,a 2,4,6,8.0,b,b,b.4,b,a 3,6,2,6.0,c,c,c.4,c,c ``` |A| A| A.1| A| A.2| A| A.4| A| A| |-|-|-|-|-|-|-|-|-| |A| A.1| A.1.1| A.2| A.2.1| A.3| A.4| A.4.1| A.5| #### Pandas: ```python In [1]: import pandas as pd In [2]: pd.read_csv("test.csv") Out[2]: A A.1 A.1.1 A.2 A.2.1 A.3 A.4 A.4.1 A.5 0 1 2 3 4.0 a a a.4 a a 1 2 4 6 8.0 b b b.4 b a 2 3 6 2 6.0 c c c.4 c c ``` #### cudf: (21.08 nightly docker) ```python In [1]: import cudf In [2]: cudf.__version__ Out[2]: '21.08.00a+238.gfba09e66d8' In [3]: cudf.read_csv("test.csv") Out[3]: A A.1 A.2 A.3 A.4 A.5 0 1 3 a a a a 1 2 6 b b b a 2 3 2 c c c c ``` This PR fixes this issue. ```python In [2]: cudf.read_csv("test.csv") Out[2]: A A.1 A.1.1 A.2 A.2.1 A.3 A.4 A.4.1 A.5 0 1 2 3 4.0 a a a.4 a a 1 2 4 6 8.0 b b b.4 b a 2 3 6 2 6.0 c c c.4 c c ``` Related info (sparks): Spark duplicate column naming. https://issues.apache.org/jira/browse/SPARK-16896 apache/spark#14745 cudf sparks addon doesn't use libcudf names. So, this PR does not affect it. Authors: - Karthikeyan (https://github.com/karthikeyann) Approvers: - Vukasin Milovanovic (https://github.com/vuule) - Elias Stehle (https://github.com/elstehle) - Ram (Ramakrishna Prabhu) (https://github.com/rgsl888prabhu) - Mike Wilson (https://github.com/hyperbolic2346) URL: #8645

felixcheung reviewed Aug 22, 2016
View reviewed changes

falaki reviewed Aug 22, 2016
View reviewed changes

cloud-fan reviewed Oct 9, 2016

View reviewed changes

cloud-fan reviewed Oct 10, 2016

View reviewed changes

HyukjinKwon added 6 commits October 10, 2016 22:12

Load duplicated field names consistently with null or empty strings

cb92e39

Fix style nits

7e2ce5e

Fix coments

8341d5b

Separate CSVUtils class and test suites

ef7bb9f

Address comment

2015fe2

Address comments

2905047

HyukjinKwon force-pushed the SPARK-16896 branch from fd24c5b to 2905047 Compare October 10, 2016 13:15

Remove duplicate

969c8f9

asfgit closed this in 90217f9 Oct 11, 2016

HyukjinKwon deleted the SPARK-16896 branch January 2, 2018 03:44

karthikeyann mentioned this pull request Jul 2, 2021

Fix repeated mangled names in read_csv with duplicate column names rapidsai/cudf#8645

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-16896][SQL] Handle duplicated field names in header consistently with null or empty strings in CSV #14745

[SPARK-16896][SQL] Handle duplicated field names in header consistently with null or empty strings in CSV #14745

HyukjinKwon commented Aug 22, 2016 •

edited

Loading

SparkQA commented Aug 22, 2016

felixcheung Aug 22, 2016

HyukjinKwon Aug 22, 2016

felixcheung commented Aug 22, 2016

HyukjinKwon commented Aug 22, 2016

SparkQA commented Aug 22, 2016

falaki Aug 22, 2016

HyukjinKwon commented Aug 23, 2016

SparkQA commented Aug 23, 2016

HyukjinKwon commented Aug 24, 2016

SparkQA commented Aug 30, 2016

HyukjinKwon commented Oct 8, 2016

cloud-fan Oct 9, 2016

HyukjinKwon Oct 9, 2016

SparkQA commented Oct 9, 2016

HyukjinKwon commented Oct 9, 2016

SparkQA commented Oct 9, 2016

cloud-fan Oct 10, 2016

cloud-fan Oct 10, 2016

cloud-fan commented Oct 10, 2016

SparkQA commented Oct 10, 2016

cloud-fan commented Oct 11, 2016

NiharGharat commented May 12, 2020

HyukjinKwon commented May 14, 2020

[SPARK-16896][SQL] Handle duplicated field names in header consistently with null or empty strings in CSV #14745

[SPARK-16896][SQL] Handle duplicated field names in header consistently with null or empty strings in CSV #14745

Conversation

HyukjinKwon commented Aug 22, 2016 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Aug 22, 2016

felixcheung Aug 22, 2016

Choose a reason for hiding this comment

HyukjinKwon Aug 22, 2016

Choose a reason for hiding this comment

felixcheung commented Aug 22, 2016

HyukjinKwon commented Aug 22, 2016

SparkQA commented Aug 22, 2016

falaki Aug 22, 2016

Choose a reason for hiding this comment

HyukjinKwon commented Aug 23, 2016

SparkQA commented Aug 23, 2016

HyukjinKwon commented Aug 24, 2016

SparkQA commented Aug 30, 2016

HyukjinKwon commented Oct 8, 2016

cloud-fan Oct 9, 2016

Choose a reason for hiding this comment

HyukjinKwon Oct 9, 2016

Choose a reason for hiding this comment

SparkQA commented Oct 9, 2016

HyukjinKwon commented Oct 9, 2016

SparkQA commented Oct 9, 2016

cloud-fan Oct 10, 2016

Choose a reason for hiding this comment

cloud-fan Oct 10, 2016

Choose a reason for hiding this comment

cloud-fan commented Oct 10, 2016

SparkQA commented Oct 10, 2016

cloud-fan commented Oct 11, 2016

NiharGharat commented May 12, 2020

HyukjinKwon commented May 14, 2020

HyukjinKwon commented Aug 22, 2016 •

edited

Loading