Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large strings support in cudf::concatenate #15195

Merged
merged 33 commits into from
Apr 4, 2024

Conversation

davidwendt
Copy link
Contributor

@davidwendt davidwendt commented Feb 29, 2024

Description

Enables cudf::concatenate to create and return a large strings column (offsets are INT64).

This also introduces the LIBCUDF_LARGE_STRINGS_ENABLED environment variable and utilities around it.
One internal utility checks the value so appropriate logic can either throw an overflow exception or build INT64 offsets as appropriate.

The cudf::test::large_strings_enabler is introduced to set/unset the env var for individual tests are needed.
A follow on PR will attempt to consolidate these kinds of tests with a specialized test fixture using this utility class.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@davidwendt davidwendt added 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python) improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Feb 29, 2024
@davidwendt davidwendt self-assigned this Feb 29, 2024
@davidwendt davidwendt changed the base branch from branch-24.04 to branch-24.06 March 18, 2024 13:48
@davidwendt davidwendt added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Mar 20, 2024
@davidwendt davidwendt marked this pull request as ready for review March 21, 2024 12:48
@davidwendt davidwendt requested a review from a team as a code owner March 21, 2024 12:48
cpp/include/cudf/strings/detail/utilities.hpp Outdated Show resolved Hide resolved
cpp/tests/copying/concatenate_tests.cpp Outdated Show resolved Hide resolved
@@ -226,6 +197,53 @@ TEST_F(StringColumnTest, ConcatenateTooLarge)
EXPECT_THROW(cudf::concatenate(input_cols), std::overflow_error);
}

TEST_F(StringColumnTest, ConcatenateLargeStrings)
{
CUDF_TEST_ENABLE_LARGE_STRINGS();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the long term plan for CUDF_TEST_ENABLE_LARGE_STRINGS? Will we need this forever or is it only until we turn on long strings by default?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be temporary until long strings are turned on by default.
It will be useful until then in case specific tests require them.
Also, a future PR will introduce a new test fixture where this feature is turned on by default.
And then this test may be moved to use that fixture at that time.

Copy link
Contributor

@hyperbolic2346 hyperbolic2346 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like how you're making this digestible pieces. Thank you for the trouble you're going through for this.

cpp/include/cudf/strings/detail/utilities.hpp Outdated Show resolved Hide resolved
cpp/src/strings/utilities.cu Show resolved Hide resolved
Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Beautiful. Also I would echo @hyperbolic2346's comment, I appreciate the work that is going into making this project digestible in small PRs.

@davidwendt
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit 4e44d5d into rapidsai:branch-24.06 Apr 4, 2024
70 checks passed
@davidwendt davidwendt deleted the concat-large-strings branch April 4, 2024 20:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change strings strings issues (C++ and Python)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants