Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Concatenate column to scalar and scalar to column #3726

Closed
rwlee opened this issue Jan 8, 2020 · 6 comments · Fixed by #5204
Closed

[FEA] Concatenate column to scalar and scalar to column #3726

rwlee opened this issue Jan 8, 2020 · 6 comments · Fixed by #5204
Labels
feature request New feature or request Spark Functionality that helps Spark RAPIDS strings strings issues (C++ and Python)

Comments

@rwlee
Copy link
Contributor

rwlee commented Jan 8, 2020

The string column concatenate function -- found in https://github.com/rapidsai/cudf/blob/branch-0.12/cpp/include/cudf/strings/combine.hpp -- currently supports concatenation of multiple string columns. However, it does not support concatenating the same string to every row in the string column (appending or prepending). This is specifically relevant to replicating Spark's concat string functionality.

Add append and prepend functions.

std::unique_ptr<column> append(column_view const& string_column,
                               string_scalar const& value,
                               string_scalar const& narep = string_scalar("",false),
                               rmm::mr::device_memory_resource* mr = rmm::mr::get_default_resource());

std::unique_ptr<column> prepend(column_view const& string_column,
                                string_scalar const& value,
                                string_scalar const& narep = string_scalar("",false),
                                rmm::mr::device_memory_resource* mr = rmm::mr::get_default_resource());

in_s = ['aa', null, '', 'aa']

append(in_s, "post", '_')
post_s = ['aapost', '_post', 'post', 'aapost']

prepend(in_s, "pre", '_')
pre_s = ['preaa', 'pre_', 'pre', 'preaa']

Creating a column of a single string repeated for every row in the column to append to is an alternative solution, but impractical and slow.

@rwlee rwlee added feature request New feature or request Needs Triage Need team to review and classify Spark Functionality that helps Spark RAPIDS strings strings issues (C++ and Python) labels Jan 8, 2020
@revans2
Copy link
Contributor

revans2 commented Jan 17, 2020

So I want to clarify a few things.

First the Spark concat_ws operator supports strings and scalars for the separator and all of the values being concatenated together. The first argument is the separator, the other arguments are what should be appended together.

concat_ws (" ", column_a, "TEST", column_b)
concat_ws(sep_col, "A", another_column, "B")

Append and prepend are one way that we could make some of this work in combination with the existing API. But it will not be enough if the separator is a strings_column_view. That part is not a high priority right now, but it is something that we do want to support eventually.

One of the key differences beyond this is that for spark if a null separator is passed in the result of the concat is also null. In the string_scalar version in cudf this results in a logic_error, which is something simple to work around for spark, but if the separator is a strings_column_view it is much harder for us to work around this.

If you want to come up with an API that allows us to pass in any combination of Scalars and string_column_views that would be ideal. If append and prepend fit more with what python needs/wants we can live with them, but it will result in materializing a lot more intermediate results.

@sriramch
Copy link
Contributor

sriramch commented May 4, 2020

@rwlee @revans2 - a couple of coments/questions:

  • i have edited the description. can you please look and see if it is ok?

  • w.r.t. this:

But it will not be enough if the separator is a strings_column_view. That part is not a high priority right now, but it is something that we do want to support eventually.

from my limited knowledge, it looks like spark's concat_ws always takes in a separator as a string not a column of strings. this is from what i see here. is that no longer true?

  • i'm also trying to understand how and when this api is going to be useful. i understand that we would like to arbitrarily combine a bunch of scalars and columns to create a resultant column (without having to create temporaries). in the absence of an api that takes in a vector<element_view> where element_view represents a column or a scalar, wouldn't you end up materializing more number of intermediary columns even with this api? here is what i mean by this.
number of columns == number of scalars (i have excluded cases where these scalars are 
together, as it can be easily concatenated by the caller to a single scalar value)

*[c0, s0] => no intermediary column with the new api; 
1, if s0 has to be made a separate column with existing api

[ c0, s0, c1, s1] => 2 intermediary column with new api {c0+s0 = r0; r0+c1 = r1}; 
2 also with existing api

// more intermediary columns with the new api beyond this
number of columns > number of scalars

[c0, s0, c1] => 1 intermediary column with new api; 
1 with the existing api as well (s0 into an intermediary column)

[c0, s0, c1, s1, c2] => 3 intermediaries with new api {c0+s0 = r0; r0+c1 = r1; r1+s2 = r2}; 
2 with the existing api

// more intermediary columns with the new api beyond this
number of columns <  number of scalars

*[s0, c0, s1] => 1 intermediary column with the new api {r0 = s0+c0}; 
2 with the existing api

[s0, c0, s1, c1, s2] => 3 intermediaries with the new api 
{c0+s0 = r0; r0+s1 = r1; r1+c1 = r2}; 
3 with the existing api as well

[s0, c0, s1, c1, s2, c2, s3] => 5 intermediaries with new api; 
4 with existing api

@rwlee
Copy link
Contributor Author

rwlee commented May 5, 2020

i have edited the description. can you please look and see if it is ok?

Looks good, thanks for catching the null replacement I missed.

from my limited knowledge, it looks like spark's concat_ws always takes in a separator as a string not a column of strings. this is from what i see here. is that no longer true?

The catalyst functionality that we're overriding is supports string column separator.

i'm also trying to understand how and when this api is going to be useful. i understand that we would like to arbitrarily combine a bunch of scalars and columns to create a resultant column (without having to create temporaries). in the absence of an api that takes in a vector<element_view> where element_view represents a column or a scalar, wouldn't you end up materializing more number of intermediary columns even with this api? here is what i mean by this.

We had planned to use this in the case of a string separator column to avoid an intermediate step of expanding the string. So in the case of [s0, c0, s1] the concatenate would act on [append(separator_col, s0), c0, prepend(separator_col, s1)] instead of [col_s0, separator_col, c0, separator_col, col_s1].

The long term hope of passing off a mixed vector of columns and scalars to remove intermediary columns is still ideal, this was just an intermediate step that allow for a much cleaner temporary implementation of concat_ws.

@revans2
Copy link
Contributor

revans2 commented May 5, 2020

Thanks for answering @rwlee

I just want to clarify the link to the code that was exposed. Spark supports multiple different APIs, java, python, R, scala, and SQL. A lot of the time not all of these APIs are consistent with each other. In this case all of the APIs except SQL only support a scalar string as the separator, because it is the common case that most people would want to use, but the SQL version is more generic and does support a column of strings. Like I said it is a lower priority to support this, because it is much less common. But we do eventually want to support 100% of what Spark does. I wanted the full set of information in the feature request so that if it is simple to do we could get the full feature done now. If it is not simple then we can wait.

@sriramch
Copy link
Contributor

sriramch commented May 5, 2020

thanks @rwlee @revans2 for the clarification. i understand that the separator as a string column for each row is preferable to be future proof, and that the existing concatenate api does not support that. however, w.r.t. to materializing the number of intermediary columns with the new api - wouldn't the new api still result in materializing as many (if not more of) intermediary columns for most of the cases. even in the example provided by @rwlee:

So in the case of [s0, c0, s1] the concatenate would act on [append(separator_col, s0), c0, prepend(separator_col, s1)] instead of [col_s0, separator_col, c0, separator_col, col_s1].

the append and prepend would result in 2 intermediary columns which would then be fed into the concatenate api (as a table_view of {append_out, c0, prepend_out}) vs the same number of intermediaries with {col_s0, separator_col, c0, separator_col, col_s1} (col_s0 and col_s1 being the temporaries here, and assuming that the concatenate version takes in a separator as a column). there are a number of cases (from the earlier comment) where the new api would even result in more intermediary columns than the existing ones (with more number of column and scalar combinations).

@sriramch
Copy link
Contributor

sriramch commented May 6, 2020

std::unique_ptr<column> concatenate(
  table_view const& string_columns,
  strings_column_view const& separators,
  string_scalar const& separator_narep   = string_scalar("", false),
  string_scalar const& column_narep      = string_scalar("", false),
  rmm::mr::device_memory_resource* mr    = rmm::mr::get_default_resource());

concatenates string_columns with the row separators from separators. this api will work thusly:

  • columns from string_columns should have the same number of rows. this should match the number of rows in separators
  • by default, if the separator for a given row in null, the output for that row will be null; if separator_narep is a valid string, then that will be used as a separator to concatenate the different string columns for that row
  • by default, if a string column for a given row is null, it will be skipped; if column_narep is provided, then that will be used as a column value to concatenate the different strings from different columns for that row
  • string_columns should have at least 1 column
    • if it has exactly 1 column, the output depends on separator value for the given row
      • if separator value is not null, output for that row will be input column (replaced by column_narep if the column value for the given row is null)
      • if separator value is null and separator_narep is valid, output for that row will be input column (replaced by column_narep if the column value for the given row is null)
      • if separator value is null and separator_narep is invalid, output for that row will be null

sriramch added a commit to sriramch/cudf that referenced this issue May 14, 2020
  - this Closes rapidsai#3726
  - this emulates `concatenate_ws` spark functionality
  - provides option for a global separator and global column null replacements
  - skips null values in a row to perform concatenation
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request Spark Functionality that helps Spark RAPIDS strings strings issues (C++ and Python)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants