Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

strings::join_list_elements options for empty list inputs #8285

Merged
merged 26 commits into from
May 26, 2021

Conversation

ttnghia
Copy link
Contributor

@ttnghia ttnghia commented May 19, 2021

This PR implements a new option for strings::join_list_elements on top of #8282. In particular, the new option is:

/**
 * @brief Setting for specifying what will be output from `join_list_elements` when an input list
 * is empty.
 */
enum class output_if_empty_list {
  EMPTY_STRING,  ///< Empty list will result in empty string
  NULL_ELEMENT   ///< Empty list will result in a null
};

This new option is necessary for implementing concat_ws in Spark, since the behavior of the output string is required to be different depending on the situation.

Currently blocked from merging by #8282.

@ttnghia ttnghia added 3 - Ready for Review Ready for review by team libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python) DO NOT MERGE Hold off on merging; see PR for details non-breaking Non-breaking change labels May 19, 2021
@ttnghia ttnghia requested a review from davidwendt May 19, 2021 16:52
@ttnghia ttnghia self-assigned this May 19, 2021
@github-actions github-actions bot added CMake CMake build issue Python Affects Python cuDF API. labels May 19, 2021
@ttnghia ttnghia changed the title Refactor string combine APIs Add an option for strings::join_list_elements May 19, 2021
@ttnghia ttnghia added the feature request New feature or request label May 19, 2021
@ttnghia ttnghia changed the title Add an option for strings::join_list_elements Add an option for strings::join_list_elements [skip ci] May 19, 2021
@ttnghia ttnghia changed the title Add an option for strings::join_list_elements [skip ci] Add an option for strings::join_list_elements [skip ci] May 19, 2021
@harrism harrism changed the title Add an option for strings::join_list_elements [skip ci] strings::join_list_elements options for empty list inputs [skip ci] May 20, 2021
@ttnghia ttnghia changed the title strings::join_list_elements options for empty list inputs [skip ci] strings::join_list_elements options for empty list inputs May 20, 2021
@ttnghia ttnghia marked this pull request as ready for review May 20, 2021 16:01
@shwina
Copy link
Contributor

shwina commented May 25, 2021

FWIW, this does not match Pandas' behaviour. Pandas will return None (actually, NaN) in this case, rather than an empty string. In Pandas:

In [36]: s
Out[36]:
0             [a, b, c]
1          [a, b, None]
2    [None, None, None]
3                 None
dtype: object

In [37]: s.str.join(' ')
Out[37]:
0    a b c
1      NaN
2      NaN
3     None
dtype: object

It feels like this might be a pretty a pretty significant edge case. For example, users may want to call to join() with dropna() to remove all rows that contained nulls:

In [38]: s.str.join(' ').dropna()
Out[38]:
0    a b c
dtype: object

# whereas we will return:
# 0    a b c
# 2
# dtype: object

Is there any way to preserve the original behaviour from the Python side? Perhaps make the original behaviour opt-in in libcudf?

@ttnghia
Copy link
Contributor Author

ttnghia commented May 25, 2021

Is there any way to preserve the original behaviour from the Python side? Perhaps make the original behaviour opt-in in libcudf?

The C++ API allows to do that. I will need to change the python interface (it was setup to call C++ API with default parameters to return an empty string on all-null lists).

Copy link
Contributor

@shwina shwina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@shwina
Copy link
Contributor

shwina commented May 25, 2021

@gpucibot merge

@galipremsagar galipremsagar added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team 4 - Needs Review Waiting for reviewer to review or respond labels May 25, 2021
@kkraus14
Copy link
Collaborator

@gpucibot merge

@ttnghia
Copy link
Contributor Author

ttnghia commented May 25, 2021

Rerun tests.

2 similar comments
@ttnghia
Copy link
Contributor Author

ttnghia commented May 25, 2021

Rerun tests.

@ttnghia
Copy link
Contributor Author

ttnghia commented May 25, 2021

Rerun tests.

@ttnghia
Copy link
Contributor Author

ttnghia commented May 26, 2021

Rerun tests.

4 similar comments
@ttnghia
Copy link
Contributor Author

ttnghia commented May 26, 2021

Rerun tests.

@ttnghia
Copy link
Contributor Author

ttnghia commented May 26, 2021

Rerun tests.

@ttnghia
Copy link
Contributor Author

ttnghia commented May 26, 2021

Rerun tests.

@ttnghia
Copy link
Contributor Author

ttnghia commented May 26, 2021

Rerun tests.

@kkraus14
Copy link
Collaborator

rerun tests

@kkraus14
Copy link
Collaborator

@gpucibot merge

@codecov
Copy link

codecov bot commented May 26, 2021

Codecov Report

❗ No coverage uploaded for pull request base (branch-21.06@ef20706). Click here to learn what that means.
The diff coverage is n/a.

❗ Current head 786e1b9 differs from pull request most recent head 1a2f669. Consider uploading reports for the commit 1a2f669 to get more accurate results
Impacted file tree graph

@@               Coverage Diff               @@
##             branch-21.06    #8285   +/-   ##
===============================================
  Coverage                ?   82.88%           
===============================================
  Files                   ?      105           
  Lines                   ?    17872           
  Branches                ?        0           
===============================================
  Hits                    ?    14813           
  Misses                  ?     3059           
  Partials                ?        0           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ef20706...1a2f669. Read the comment docs.

@rapids-bot rapids-bot bot merged commit 773fc7a into rapidsai:branch-21.06 May 26, 2021
rapids-bot bot pushed a commit that referenced this pull request May 26, 2021
Add java api's to be able to call string concatenate with separators.  

DO NOT MERGE until:  #8285 and #8282 are merged.  We need those changes for the API used in this PR as well as functionality to match what Spark needs.

new arguments were added to the existing concatenate api that takes a scalar for the separator, so I extended that. I added new api's for the concatenate api that takes a column as the separator.  I also added new api's for both join_list_elements api's, one with scalar separator and one with column separator.

Authors:
  - Thomas Graves (https://github.com/tgravescs)

Approvers:
  - Robert (Bobby) Evans (https://github.com/revans2)
  - Jason Lowe (https://github.com/jlowe)

URL: #8289
@ttnghia ttnghia deleted the strings_join_list_elements branch May 27, 2021 13:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Python Affects Python cuDF API. Spark Functionality that helps Spark RAPIDS strings strings issues (C++ and Python)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants