Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] concatenate row items using a separator defined per row #5204

Merged
merged 28 commits into from
Jun 1, 2020

Conversation

sriramch
Copy link
Contributor

sriramch added 3 commits May 14, 2020 21:32
  - this Closes rapidsai#3726
  - this emulates `concatenate_ws` spark functionality
  - provides option for a global separator and global column null replacements
  - skips null values in a row to perform concatenation
@sriramch sriramch added feature request New feature or request 3 - Ready for Review Ready for review by team 4 - Needs Review Waiting for reviewer to review or respond Spark Functionality that helps Spark RAPIDS labels May 15, 2020
@sriramch sriramch requested review from devavret and davidwendt May 15, 2020 04:29
@sriramch sriramch requested a review from a team as a code owner May 15, 2020 04:29
@codecov
Copy link

codecov bot commented May 15, 2020

Codecov Report

❗ No coverage uploaded for pull request base (branch-0.15@a784e37). Click here to learn what that means.
The diff coverage is n/a.

Impacted file tree graph

@@              Coverage Diff               @@
##             branch-0.15    #5204   +/-   ##
==============================================
  Coverage               ?   88.38%           
==============================================
  Files                  ?       55           
  Lines                  ?    10489           
  Branches               ?        0           
==============================================
  Hits                   ?     9271           
  Misses                 ?     1218           
  Partials               ?        0           
Impacted Files Coverage Δ
python/cudf/cudf/utils/queryutils.py 94.00% <0.00%> (ø)
python/cudf/cudf/io/avro.py 81.81% <0.00%> (ø)
python/cudf/cudf/utils/cudautils.py 46.52% <0.00%> (ø)
python/dask_cudf/dask_cudf/io/tests/test_s3.py 95.65% <0.00%> (ø)
python/cudf/cudf/_lib/nvtx/colors.py 46.15% <0.00%> (ø)
python/cudf/cudf/core/column/column.py 86.52% <0.00%> (ø)
python/cudf/cudf/core/column/__init__.py 100.00% <0.00%> (ø)
python/dask_cudf/dask_cudf/io/tests/test_csv.py 100.00% <0.00%> (ø)
python/cudf/cudf/utils/dtypes.py 85.38% <0.00%> (ø)
python/cudf/cudf/utils/docutils.py 100.00% <0.00%> (ø)
... and 45 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a784e37...d938fea. Read the comment docs.

Copy link
Contributor

@davidwendt davidwendt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good.
One minor thing. I've been trying to consistently use the plural strings whenever referring to a strings column in the documentation.

cpp/src/strings/combine.cu Outdated Show resolved Hide resolved
cpp/include/cudf/strings/combine.hpp Outdated Show resolved Hide resolved
cpp/include/cudf/strings/combine.hpp Outdated Show resolved Hide resolved
CHANGELOG.md Outdated Show resolved Hide resolved
Copy link
Contributor

@devavret devavret left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem to directly address the requirement of the associated issue. I can approve but prefer not to merge this until @rwlee and @revans2 confirm that the alternate approach you mentioned is acceptable.

@sriramch
Copy link
Contributor Author

This doesn't seem to directly address the requirement of the associated issue. I can approve but prefer not to merge this until @rwlee and @revans2 confirm that the alternate approach you mentioned is acceptable.

[sc] i have checked with @revans2 offline and he did concur that the alternate api's requested in the ticket will result in as much (or more) intermediaries compared to existing append api. further, this api is required to support the concatenate_ws semantics with a separator per row.

@revans2
Copy link
Contributor

revans2 commented May 22, 2020

This looks fine to me @rwlee does the API look like what you expected?

@devavret
Copy link
Contributor

Just out of curiosity, what can this new API achieve that wasn’t possible with the old concatenate. Eg could you achieve the same result by creating a table_view from {col1, separator_col, col2, separator_col, col3} and an empty string scalar as separator. Is the difference only in the ability to specify different null replacements for value columns and separator columns?

If so, could this be better achieved by making the existing API more generic and take a null replacement per column of the values table? I’m not prescribing that this is how it should be done, just trying to understand the API requirements.

@sriramch
Copy link
Contributor Author

creating a table_view from {col1, separator_col, col2, separator_col, col3} and an empty string scalar as separator.

  • existing api nulls a row even if one column has a null value for the row. this skips nulls
  • existing api requires separator rep. to be valid. this doesn't enforce it, and if a separator for a given row is null, the output is null for that row
  • this api explicitly defines a separator column vs. having to sneak it in between string columns. the caller has to make it to work, which is non-intuitive

Is the difference only in the ability to specify different null replacements for value columns and separator columns?

  • this is one of them

making the existing API more generic and take a null replacement per column of the values table?

  • the existing api can be made to work, but i feel it may result in cluttering the existing api. for instance, the existing use-case requires a valid separator scalar. this cannot be enforced, if this has to be retrofitted to the new requirement, as we cannot interpret the semantics of the different columns contained within the table.

@devavret
Copy link
Contributor

  • the existing api can be made to work, but i feel it may result in cluttering the existing api

Fair enough

@rwlee
Copy link
Contributor

rwlee commented May 26, 2020

This looks fine to me @rwlee does the API look like what you expected?

Yup, this addresses the core functionality we wanted. Thanks @sriramch

@harrism
Copy link
Member

harrism commented May 28, 2020

Retargeting to 0.15 since we enter code freeze tonight.

@harrism
Copy link
Member

harrism commented May 28, 2020

Make sure you communicate with a maintainer when you open PRs. This one was not on my radar and therefore had not been added to the 0.14 project board.

@harrism harrism changed the base branch from branch-0.14 to branch-0.15 May 28, 2020 05:22
@sriramch
Copy link
Contributor Author

@devavret is this good to go?

i have now re-targeted this against 0.15.

@sriramch
Copy link
Contributor Author

sriramch commented Jun 1, 2020

@davidwendt @devavret - can one of you please merge this?

@davidwendt davidwendt merged commit 855e735 into rapidsai:branch-0.15 Jun 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team 4 - Needs Review Waiting for reviewer to review or respond feature request New feature or request Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Concatenate column to scalar and scalar to column
6 participants