[FEA] support ConcatWs sql function #63

revans2 · 2020-05-29T17:50:06Z

Is your feature request related to a problem? Please describe.
it would be great to support the concat_ws SQL function

rapidsai/cudf#3726 was filed to get support from cudf.

The text was updated successfully, but these errors were encountered:

sameerz · 2021-03-30T17:43:00Z

Follow up work on cudf, concatenating arrays of strings: rapidsai/cudf#7727

tgravescs · 2021-05-13T16:15:11Z

The Spark behavior of concatws:

Separator parameter Api differences:

SQL can pass column of strings as separator
python and scala api's, only string as separator

Null Behavior:

if separator is NULL, columns are null
if column value is null it leaves off separator and value
if passing in array of all nulls, then its left out without separator
if array contains a null, it skips it and leaves separator off -> *** this seems different then cudf (https://github.com/rapidsai/cudf/pull/7929/files) but need to see if can work around***
if all the values of a row are null, it returns an empty string -> this seems different then cudf behavior

Behavior:

0 input columns with separate specified return column of empty strings, except if separator null, then value is null.
1 column just return column unless Null separator then need column of null match number of rows
if column value is empty is puts on separator
if passing in empty array, then its left off without separator
if passing array with empty strings, separator included
2 or more columns add in separator between column values
For SQL if column specified as separator, the separator in the corresponding row is used
Empty Array would normally return empty string, but it appears it actually special if joining with something else, it leaves off the separator for that empty array, comment below shows example

tgravescs · 2021-05-13T21:59:51Z

Note that concat and concat_ws have different behavior for nulls when all rows are null:

+-------+
|nullcol|
+-------+
|   null|
|notnull|
+-------+

>>> spark.sql("select concat_ws('-', nullcol, nullcol) as res from df").show(truncate=False)
+---------------+
|res            |
+---------------+
|               |
|notnull-notnull|
+---------------+

>>> spark.sql("select concat(nullcol, nullcol) as res from df").show(truncate=False)
+--------------+
|res           |
+--------------+
|null          |
|notnullnotnull|
+--------------+

>>> spark.sql("select concat(null, d) as res from df").show(truncate=False)
+----+
|res |
+----+
|null|
|null|
+----+

tgravescs · 2021-05-14T13:05:27Z

similar cudf behavior for arrays with nulls doesn't match Spark, cudf will put null if any elements in array null, spark skips them. if all nulls, then get empty string

>>> res = spark.sql("select concat_ws('-', array(1, 2, null, 4)) as res from df")
>>> res.show()
+-----+
|  res|
+-----+
|1-2-4|
|1-2-4|
+-----+

>>> spark.sql("select concat_ws('-', array(null, null)) as res from df").show(truncate=False)
+---+
|res|
+---+
|   |
|   |
+---+

array handling for concat is different:

>>> spark.sql("select concat(array(null, 1), array(d)) as res from df").show(truncate=False)
+---------------+
|res            |
+---------------+
|[null, 1, 123] |
|[null, 1, 1234]|
+---------------+

tgravescs · 2021-05-18T13:45:32Z

I discovered another weird case with the CPU where is you are concatenating an empty array and then another value, it leaves off the separator:

+--------------+
|      arrnames|
+--------------+
|            []|
|[Alice2, Bob2]|
+--------------+

>>> spark.conf.set("spark.rapids.sql.enabled", "false")
>>> res = df.select(concat_ws('-', df.arrnames, lit('z')))
>>> res.show()
+-------------------------+
|concat_ws(-, arrnames, z)|
+-------------------------+
|                        z|
|            Alice2-Bob2-z|
+-------------------------+

tgravescs · 2021-05-19T15:30:21Z

another example of arrays with null in middle:

+--------------+
|      arrnames|
+--------------+
|[a, null, ccc]|
|[Alice2, Bob2]|
+--------------+

res = df.select(concat_ws('***',df.arrnames).alias('s'))

+-------------+
|            s|
+-------------+
|      a***ccc|
|Alice2***Bob2|
+-------------+

tgravescs · 2021-05-19T22:47:14Z

example of spark with array of nulls, doesn't matter how many nulls in array, as long as all nulls, spark skips it and leaves off separator.

+-------+
|   name|
+-------+
|beatles|
|  romeo|
+-------+

dfnew.select(concat_ws("-", array(lit(null)), col("name"), lit('a')).alias("s")).show()
+---------+
|        s|
+---------+
|beatles-a|
|  romeo-a|
+---------+

tgravescs · 2021-05-21T19:33:34Z

cudf Java layer PR: rapidsai/cudf#8289

Signed-off-by: spark-rapids automation <[email protected]>

revans2 added feature request New feature or request ? - Needs Triage Need team to review and classify SQL part of the SQL/Dataframe plugin labels May 29, 2020

sameerz removed the ? - Needs Triage Need team to review and classify label Sep 1, 2020

sameerz added the cudf_dependency An issue or PR with this label depends on a new feature in cudf label Feb 18, 2021

razajafri self-assigned this Apr 9, 2021

sameerz added this to the Apr 12 - Apr 23 milestone Apr 9, 2021

sameerz modified the milestones: Apr 12 - Apr 23, Apr 26 - May 7 Apr 22, 2021

sameerz unassigned razajafri Apr 27, 2021

sameerz added the P0 Must have for release label Apr 27, 2021

tgravescs self-assigned this Apr 28, 2021

sameerz modified the milestones: Apr 26 - May 7, May 10 - May 21 May 8, 2021

ttnghia mentioned this issue May 19, 2021

Add separator-on-null parameter to strings concatenate APIs rapidsai/cudf#8282

Merged

tgravescs mentioned this issue May 21, 2021

Support concat with separator on GPU #2479

Merged

sameerz modified the milestones: May 10 - May 21, May 24 - Jun 4 May 25, 2021

tgravescs closed this as completed in #2479 May 27, 2021

tgravescs pushed a commit to tgravescs/spark-rapids that referenced this issue Nov 30, 2023

Update submodule cudf to f263820 (NVIDIA#63)

19d6bfe

Signed-off-by: spark-rapids automation <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] support ConcatWs sql function #63

[FEA] support ConcatWs sql function #63

revans2 commented May 29, 2020

sameerz commented Mar 30, 2021

tgravescs commented May 13, 2021 •

edited

Loading

tgravescs commented May 13, 2021 •

edited

Loading

tgravescs commented May 14, 2021 •

edited

Loading

tgravescs commented May 18, 2021

tgravescs commented May 19, 2021

tgravescs commented May 19, 2021 •

edited

Loading

tgravescs commented May 21, 2021

[FEA] support ConcatWs sql function #63

[FEA] support ConcatWs sql function #63

Comments

revans2 commented May 29, 2020

sameerz commented Mar 30, 2021

tgravescs commented May 13, 2021 • edited Loading

tgravescs commented May 13, 2021 • edited Loading

tgravescs commented May 14, 2021 • edited Loading

tgravescs commented May 18, 2021

tgravescs commented May 19, 2021

tgravescs commented May 19, 2021 • edited Loading

tgravescs commented May 21, 2021

tgravescs commented May 13, 2021 •

edited

Loading

tgravescs commented May 13, 2021 •

edited

Loading

tgravescs commented May 14, 2021 •

edited

Loading

tgravescs commented May 19, 2021 •

edited

Loading