Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add strings udf C++ classes and functions for phase II #11912

Merged
merged 38 commits into from
Nov 2, 2022

Conversation

davidwendt
Copy link
Contributor

Description

Adds the C++ classes and functions for the phase II of strings udf. This specifically includes the device side string class which can be used for building udfs the create or modify strings.
Also included are some basic helper functions for split, strip, case, and numeric conversion.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@davidwendt davidwendt added feature request New feature or request 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change labels Oct 12, 2022
@davidwendt davidwendt self-assigned this Oct 12, 2022
@github-actions github-actions bot added Python Affects Python cuDF API. and removed libcudf Affects libcudf (C++/CUDA) code. labels Oct 12, 2022
@codecov
Copy link

codecov bot commented Oct 12, 2022

Codecov Report

Base: 87.40% // Head: 88.11% // Increases project coverage by +0.71% 🎉

Coverage data is based on head (ce1260c) compared to base (f72c4ce).
Patch has no changes to coverable lines.

❗ Current head ce1260c differs from pull request most recent head b3a43b8. Consider uploading reports for the commit b3a43b8 to get more accurate results

Additional details and impacted files
@@               Coverage Diff                @@
##           branch-22.12   #11912      +/-   ##
================================================
+ Coverage         87.40%   88.11%   +0.71%     
================================================
  Files               133      133              
  Lines             21833    22003     +170     
================================================
+ Hits              19084    19389     +305     
+ Misses             2749     2614     -135     
Impacted Files Coverage Δ
python/strings_udf/strings_udf/__init__.py 84.31% <0.00%> (-12.57%) ⬇️
python/cudf/cudf/io/text.py 91.66% <0.00%> (-8.34%) ⬇️
python/cudf/cudf/core/_base_index.py 81.28% <0.00%> (-4.27%) ⬇️
python/cudf/cudf/io/json.py 92.06% <0.00%> (-2.68%) ⬇️
python/strings_udf/strings_udf/_typing.py 94.73% <0.00%> (-1.06%) ⬇️
python/cudf/cudf/utils/utils.py 89.91% <0.00%> (-0.69%) ⬇️
python/cudf/cudf/testing/dataset_generator.py 72.83% <0.00%> (-0.42%) ⬇️
python/dask_cudf/dask_cudf/core.py 73.72% <0.00%> (-0.41%) ⬇️
python/cudf/cudf/io/parquet.py 90.45% <0.00%> (-0.39%) ⬇️
python/dask_cudf/dask_cudf/backends.py 84.90% <0.00%> (-0.37%) ⬇️
... and 28 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@davidwendt davidwendt added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Oct 18, 2022
@davidwendt davidwendt marked this pull request as ready for review October 18, 2022 19:15
@davidwendt davidwendt requested a review from a team as a code owner October 18, 2022 19:15
@davidwendt davidwendt changed the title Add strings udf C++ classes and function for phase II Add strings udf C++ classes and functions for phase II Oct 18, 2022
@davidwendt davidwendt requested a review from vyasr October 19, 2022 19:23
Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking good! I have a bunch of nitty-gritty comments but the core of the code is sound.

I notice some redundancy in udf_string methods (at vs [], = vs assign). From the ones that I checked they all match std::string methods so I assume that's intentional, just noting that in case there are any extra ones it would be good to clean them out.

Out of curiosity, what prompted the change in name from the earlier d_string?

python/strings_udf/cpp/include/cudf/strings/udf/split.cuh Outdated Show resolved Hide resolved
python/strings_udf/cpp/src/strings/udf/udf_apis.cu Outdated Show resolved Hide resolved
python/strings_udf/cpp/src/strings/udf/udf_apis.cu Outdated Show resolved Hide resolved
}
}

__device__ inline udf_string& udf_string::replace(cudf::size_type pos,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used the wrong variable in my example, which is probably adding to the confusion. I think I had two questions, of which only one is still relevant.

First question: what happens if overwritingpos + count characters leads to out of bounds accesses? Plugging the snippet below into the compiler explorer suggests that the code prevents out of bounds accesses. Is that something we should promise for our implementation too?

#include <iostream>
#include <string>
  
int main()
{
    std::string str{"First"};
    // This line has the same behavior for any value of count >= 5. Is that expected 
    auto count = 100;
    str.replace(0, count, "Second");
    std::cout << str << std::endl;
}

Second question: what happens if the replacement string is larger than count? That use case seems like it is fine and the reallocation is the behavior expected by the standard. I don't know if we need to document that since on second thought it seems fairly obvious that this would be standard behavior since otherwise the API would be very limited if it could only replace a string of equal length.

@davidwendt
Copy link
Contributor Author

rerun tests

@davidwendt davidwendt requested a review from vyasr November 1, 2022 11:35
Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All looks fine to me. I have a few small comments, resolve them how you see fit. Thanks @davidwendt!

}
}

__device__ inline udf_string& udf_string::replace(cudf::size_type pos,
Copy link
Contributor

@bdice bdice Nov 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this might deserve docs (for developers) since the behavior differs from https://en.cppreference.com/w/cpp/string/basic_string/replace in a non-obvious way. As I understand it, no exception is thrown (not possible), but a partial replacement (up to the end of the string) is performed.

} // namespace detail

// external APIs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd remove this comment.

@davidwendt
Copy link
Contributor Author

I think this might deserve docs (for developers) since the behavior differs from https://en.cppreference.com/w/cpp/string/basic_string/replace in a non-obvious way. As I understand it, no exception is thrown, and a partial replacement (up to the end of the string) is performed.

This works exactly like std::string::replace() except that no exception is thrown for out of range values (as already documented). For reference here is a sample of std::string::replace behavior: https://godbolt.org/z/zxjj6r3na

Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks for iterating!

@davidwendt
Copy link
Contributor Author

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 5ace809 into rapidsai:branch-22.12 Nov 2, 2022
@davidwendt davidwendt deleted the udf-string-class branch November 2, 2022 12:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants