[FEA] Concatenate column to scalar and scalar to column #3726

rwlee · 2020-01-08T21:24:25Z

The string column concatenate function -- found in https://github.com/rapidsai/cudf/blob/branch-0.12/cpp/include/cudf/strings/combine.hpp -- currently supports concatenation of multiple string columns. However, it does not support concatenating the same string to every row in the string column (appending or prepending). This is specifically relevant to replicating Spark's concat string functionality.

Add append and prepend functions.

std::unique_ptr<column> append(column_view const& string_column,
                               string_scalar const& value,
                               string_scalar const& narep = string_scalar("",false),
                               rmm::mr::device_memory_resource* mr = rmm::mr::get_default_resource());

std::unique_ptr<column> prepend(column_view const& string_column,
                                string_scalar const& value,
                                string_scalar const& narep = string_scalar("",false),
                                rmm::mr::device_memory_resource* mr = rmm::mr::get_default_resource());

in_s = ['aa', null, '', 'aa']

append(in_s, "post", '_')
post_s = ['aapost', '_post', 'post', 'aapost']

prepend(in_s, "pre", '_')
pre_s = ['preaa', 'pre_', 'pre', 'preaa']

Creating a column of a single string repeated for every row in the column to append to is an alternative solution, but impractical and slow.

The text was updated successfully, but these errors were encountered:

revans2 · 2020-01-17T14:10:20Z

So I want to clarify a few things.

First the Spark concat_ws operator supports strings and scalars for the separator and all of the values being concatenated together. The first argument is the separator, the other arguments are what should be appended together.

concat_ws (" ", column_a, "TEST", column_b)
concat_ws(sep_col, "A", another_column, "B")

Append and prepend are one way that we could make some of this work in combination with the existing API. But it will not be enough if the separator is a strings_column_view. That part is not a high priority right now, but it is something that we do want to support eventually.

One of the key differences beyond this is that for spark if a null separator is passed in the result of the concat is also null. In the string_scalar version in cudf this results in a logic_error, which is something simple to work around for spark, but if the separator is a strings_column_view it is much harder for us to work around this.

If you want to come up with an API that allows us to pass in any combination of Scalars and string_column_views that would be ideal. If append and prepend fit more with what python needs/wants we can live with them, but it will result in materializing a lot more intermediate results.

sriramch · 2020-05-04T18:28:59Z

@rwlee @revans2 - a couple of coments/questions:

i have edited the description. can you please look and see if it is ok?
w.r.t. this:

But it will not be enough if the separator is a strings_column_view. That part is not a high priority right now, but it is something that we do want to support eventually.

from my limited knowledge, it looks like spark's concat_ws always takes in a separator as a string not a column of strings. this is from what i see here. is that no longer true?

i'm also trying to understand how and when this api is going to be useful. i understand that we would like to arbitrarily combine a bunch of scalars and columns to create a resultant column (without having to create temporaries). in the absence of an api that takes in a vector<element_view> where element_view represents a column or a scalar, wouldn't you end up materializing more number of intermediary columns even with this api? here is what i mean by this.

number of columns == number of scalars (i have excluded cases where these scalars are 
together, as it can be easily concatenated by the caller to a single scalar value)

*[c0, s0] => no intermediary column with the new api; 
1, if s0 has to be made a separate column with existing api

[ c0, s0, c1, s1] => 2 intermediary column with new api {c0+s0 = r0; r0+c1 = r1}; 
2 also with existing api

// more intermediary columns with the new api beyond this

number of columns > number of scalars

[c0, s0, c1] => 1 intermediary column with new api; 
1 with the existing api as well (s0 into an intermediary column)

[c0, s0, c1, s1, c2] => 3 intermediaries with new api {c0+s0 = r0; r0+c1 = r1; r1+s2 = r2}; 
2 with the existing api

// more intermediary columns with the new api beyond this

number of columns <  number of scalars

*[s0, c0, s1] => 1 intermediary column with the new api {r0 = s0+c0}; 
2 with the existing api

[s0, c0, s1, c1, s2] => 3 intermediaries with the new api 
{c0+s0 = r0; r0+s1 = r1; r1+c1 = r2}; 
3 with the existing api as well

[s0, c0, s1, c1, s2, c2, s3] => 5 intermediaries with new api; 
4 with existing api

rwlee · 2020-05-05T00:36:57Z

i have edited the description. can you please look and see if it is ok?

Looks good, thanks for catching the null replacement I missed.

from my limited knowledge, it looks like spark's concat_ws always takes in a separator as a string not a column of strings. this is from what i see here. is that no longer true?

The catalyst functionality that we're overriding is supports string column separator.

i'm also trying to understand how and when this api is going to be useful. i understand that we would like to arbitrarily combine a bunch of scalars and columns to create a resultant column (without having to create temporaries). in the absence of an api that takes in a vector<element_view> where element_view represents a column or a scalar, wouldn't you end up materializing more number of intermediary columns even with this api? here is what i mean by this.

We had planned to use this in the case of a string separator column to avoid an intermediate step of expanding the string. So in the case of [s0, c0, s1] the concatenate would act on [append(separator_col, s0), c0, prepend(separator_col, s1)] instead of [col_s0, separator_col, c0, separator_col, col_s1].

The long term hope of passing off a mixed vector of columns and scalars to remove intermediary columns is still ideal, this was just an intermediate step that allow for a much cleaner temporary implementation of concat_ws.

revans2 · 2020-05-05T12:23:38Z

Thanks for answering @rwlee

I just want to clarify the link to the code that was exposed. Spark supports multiple different APIs, java, python, R, scala, and SQL. A lot of the time not all of these APIs are consistent with each other. In this case all of the APIs except SQL only support a scalar string as the separator, because it is the common case that most people would want to use, but the SQL version is more generic and does support a column of strings. Like I said it is a lower priority to support this, because it is much less common. But we do eventually want to support 100% of what Spark does. I wanted the full set of information in the feature request so that if it is simple to do we could get the full feature done now. If it is not simple then we can wait.

sriramch · 2020-05-05T14:20:37Z

thanks @rwlee @revans2 for the clarification. i understand that the separator as a string column for each row is preferable to be future proof, and that the existing concatenate api does not support that. however, w.r.t. to materializing the number of intermediary columns with the new api - wouldn't the new api still result in materializing as many (if not more of) intermediary columns for most of the cases. even in the example provided by @rwlee:

So in the case of [s0, c0, s1] the concatenate would act on [append(separator_col, s0), c0, prepend(separator_col, s1)] instead of [col_s0, separator_col, c0, separator_col, col_s1].

the append and prepend would result in 2 intermediary columns which would then be fed into the concatenate api (as a table_view of {append_out, c0, prepend_out}) vs the same number of intermediaries with {col_s0, separator_col, c0, separator_col, col_s1} (col_s0 and col_s1 being the temporaries here, and assuming that the concatenate version takes in a separator as a column). there are a number of cases (from the earlier comment) where the new api would even result in more intermediary columns than the existing ones (with more number of column and scalar combinations).

sriramch · 2020-05-06T16:48:55Z

std::unique_ptr<column> concatenate(
  table_view const& string_columns,
  strings_column_view const& separators,
  string_scalar const& separator_narep   = string_scalar("", false),
  string_scalar const& column_narep      = string_scalar("", false),
  rmm::mr::device_memory_resource* mr    = rmm::mr::get_default_resource());

concatenates string_columns with the row separators from separators. this api will work thusly:

columns from string_columns should have the same number of rows. this should match the number of rows in separators
by default, if the separator for a given row in null, the output for that row will be null; if separator_narep is a valid string, then that will be used as a separator to concatenate the different string columns for that row
by default, if a string column for a given row is null, it will be skipped; if column_narep is provided, then that will be used as a column value to concatenate the different strings from different columns for that row
string_columns should have at least 1 column
- if it has exactly 1 column, the output depends on separator value for the given row
  - if separator value is not null, output for that row will be input column (replaced by column_narep if the column value for the given row is null)
  - if separator value is null and separator_narep is valid, output for that row will be input column (replaced by column_narep if the column value for the given row is null)
  - if separator value is null and separator_narep is invalid, output for that row will be null

- this Closes rapidsai#3726 - this emulates `concatenate_ws` spark functionality - provides option for a global separator and global column null replacements - skips null values in a row to perform concatenation

rwlee added feature request New feature or request Needs Triage Need team to review and classify Spark Functionality that helps Spark RAPIDS strings strings issues (C++ and Python) labels Jan 8, 2020

kkraus14 removed the Needs Triage Need team to review and classify label Jan 21, 2020

rwlee mentioned this issue Mar 27, 2020

[BUG] Concatenate with separator, ignore separator for null replacement Inconsistent #4728

Closed

sriramch mentioned this issue May 15, 2020

[REVIEW] concatenate row items using a separator defined per row #5204

Merged

revans2 mentioned this issue May 29, 2020

[FEA] support ConcatWs sql function NVIDIA/spark-rapids#63

Closed

davidwendt closed this as completed in #5204 Jun 1, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Concatenate column to scalar and scalar to column #3726

[FEA] Concatenate column to scalar and scalar to column #3726

rwlee commented Jan 8, 2020 •

edited by sriramch

Loading

revans2 commented Jan 17, 2020

sriramch commented May 4, 2020

rwlee commented May 5, 2020 •

edited

Loading

revans2 commented May 5, 2020

sriramch commented May 5, 2020 •

edited

Loading

sriramch commented May 6, 2020

[FEA] Concatenate column to scalar and scalar to column #3726

[FEA] Concatenate column to scalar and scalar to column #3726

Comments

rwlee commented Jan 8, 2020 • edited by sriramch Loading

revans2 commented Jan 17, 2020

sriramch commented May 4, 2020

rwlee commented May 5, 2020 • edited Loading

revans2 commented May 5, 2020

sriramch commented May 5, 2020 • edited Loading

sriramch commented May 6, 2020

rwlee commented Jan 8, 2020 •

edited by sriramch

Loading

rwlee commented May 5, 2020 •

edited

Loading

sriramch commented May 5, 2020 •

edited

Loading