-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Concatenate column to scalar and scalar to column #3726
Comments
So I want to clarify a few things. First the Spark concat_ws operator supports strings and scalars for the separator and all of the values being concatenated together. The first argument is the separator, the other arguments are what should be appended together. concat_ws (" ", column_a, "TEST", column_b)
concat_ws(sep_col, "A", another_column, "B") Append and prepend are one way that we could make some of this work in combination with the existing API. But it will not be enough if the separator is a strings_column_view. That part is not a high priority right now, but it is something that we do want to support eventually. One of the key differences beyond this is that for spark if a null separator is passed in the result of the concat is also null. In the string_scalar version in cudf this results in a logic_error, which is something simple to work around for spark, but if the separator is a strings_column_view it is much harder for us to work around this. If you want to come up with an API that allows us to pass in any combination of Scalars and string_column_views that would be ideal. If append and prepend fit more with what python needs/wants we can live with them, but it will result in materializing a lot more intermediate results. |
@rwlee @revans2 - a couple of coments/questions:
from my limited knowledge, it looks like spark's concat_ws always takes in a separator as a string not a column of strings. this is from what i see here. is that no longer true?
|
Looks good, thanks for catching the null replacement I missed.
The catalyst functionality that we're overriding is supports string column separator.
We had planned to use this in the case of a string separator column to avoid an intermediate step of expanding the string. So in the case of The long term hope of passing off a mixed vector of columns and scalars to remove intermediary columns is still ideal, this was just an intermediate step that allow for a much cleaner temporary implementation of concat_ws. |
Thanks for answering @rwlee I just want to clarify the link to the code that was exposed. Spark supports multiple different APIs, java, python, R, scala, and SQL. A lot of the time not all of these APIs are consistent with each other. In this case all of the APIs except SQL only support a scalar string as the separator, because it is the common case that most people would want to use, but the SQL version is more generic and does support a column of strings. Like I said it is a lower priority to support this, because it is much less common. But we do eventually want to support 100% of what Spark does. I wanted the full set of information in the feature request so that if it is simple to do we could get the full feature done now. If it is not simple then we can wait. |
thanks @rwlee @revans2 for the clarification. i understand that the separator as a string column for each row is preferable to be future proof, and that the existing concatenate api does not support that. however, w.r.t. to materializing the number of intermediary columns with the new api - wouldn't the new api still result in materializing as many (if not more of) intermediary columns for most of the cases. even in the example provided by @rwlee:
the append and prepend would result in 2 intermediary columns which would then be fed into the concatenate api (as a table_view of |
concatenates
|
- this Closes rapidsai#3726 - this emulates `concatenate_ws` spark functionality - provides option for a global separator and global column null replacements - skips null values in a row to perform concatenation
The string column concatenate function -- found in https://github.com/rapidsai/cudf/blob/branch-0.12/cpp/include/cudf/strings/combine.hpp -- currently supports concatenation of multiple string columns. However, it does not support concatenating the same string to every row in the string column (appending or prepending). This is specifically relevant to replicating Spark's concat string functionality.
Add append and prepend functions.
Creating a column of a single string repeated for every row in the column to append to is an alternative solution, but impractical and slow.
The text was updated successfully, but these errors were encountered: