Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement string list concatenation #7929

Merged
merged 58 commits into from
Apr 26, 2021
Merged
Show file tree
Hide file tree
Changes from 56 commits
Commits
Show all changes
58 commits
Select commit Hold shift + click to select a range
7d4ac5a
Re-organize function declarations, and add new declarations
ttnghia Apr 7, 2021
da41c76
Add a skeleton implementation for the new `concatenate` APIs
ttnghia Apr 7, 2021
208dd51
Change docs
ttnghia Apr 7, 2021
ff6dc6d
Add conditions for checking the parameter validity
ttnghia Apr 7, 2021
e43fe3f
Rename variable
ttnghia Apr 8, 2021
51dcbdd
Implement a function that computes row size of the output strings column
ttnghia Apr 8, 2021
94fd3ea
Finish a draft for the concatenate API
ttnghia Apr 9, 2021
b4aa581
Cleanup `combine.cu`
ttnghia Apr 9, 2021
465b821
Add one test for StringsListsConcatenateTest
ttnghia Apr 9, 2021
13a352a
Merge remote-tracking branch 'origin/branch-0.20' into concat_ws
ttnghia Apr 9, 2021
a31e5f4
Finish ScalarSeparator test
ttnghia Apr 9, 2021
4a46292
Finish SlicedListsWithScalarSeparator test
ttnghia Apr 9, 2021
b61f384
Rewrite InvalidInput test
ttnghia Apr 9, 2021
0b56cdf
Rewrite EmptyInput, ZeroSizeStringsInput, and AllNullsStringsInput tests
ttnghia Apr 9, 2021
b3260ff
Finish ColumnSeparators test
ttnghia Apr 9, 2021
c3b4ecc
Finish SlicedListsWithColumnSeparators test
ttnghia Apr 9, 2021
5b948a4
Rename variables
ttnghia Apr 9, 2021
d1063a9
Fix InvalidInput test
ttnghia Apr 9, 2021
625407c
Fix ZeroSizeStringsInput test
ttnghia Apr 9, 2021
95dcdc2
Fix AllNullsStringsInput test
ttnghia Apr 9, 2021
5dc9fa4
Implement string lists concatenation with scalar separator
ttnghia Apr 9, 2021
ce6868f
Cleanup string lists concatenation functions
ttnghia Apr 9, 2021
227efde
Fix output string size computation
ttnghia Apr 9, 2021
584ea0f
Fix child accessing for lists of strings column
ttnghia Apr 9, 2021
4b235f5
Fix slice indices for tests
ttnghia Apr 9, 2021
2e4e7b4
Fix tests for sliced input column
ttnghia Apr 9, 2021
7c16e27
Fix ClangFormat style
ttnghia Apr 9, 2021
b67e20e
Add comments
ttnghia Apr 9, 2021
468a981
Rename APIs
ttnghia Apr 10, 2021
4f65010
Cleanup and fix ClangFormat style
ttnghia Apr 10, 2021
ebf2e03
Fix ClangFormat style
ttnghia Apr 12, 2021
5f79043
Merge remote-tracking branch 'origin/branch-0.20' into concat_ws
ttnghia Apr 20, 2021
cd6eb88
Remove redundant headers
ttnghia Apr 20, 2021
3df31cd
Resolve merge conflict with branch 0.20
ttnghia Apr 20, 2021
d0eee9e
Add `make_strings_children_with_null_mask` utility function
ttnghia Apr 20, 2021
a2496db
Simplify code by using the new utility function `make_strings_childre…
ttnghia Apr 20, 2021
6715b1d
Remove null_mask if the column does not have any null element
ttnghia Apr 20, 2021
17a3e9a
Revert "Remove null_mask if the column does not have any null element"
ttnghia Apr 20, 2021
097c788
Fix string concatenation tests
ttnghia Apr 20, 2021
391735f
Fix the return null_mask: if null_count is 0 then return an empty buffer
ttnghia Apr 20, 2021
3bdd92a
Re-organize code
ttnghia Apr 20, 2021
628e541
Complete `concatenate_list_elements`
ttnghia Apr 20, 2021
ced89bc
Reorder cmake file list
ttnghia Apr 20, 2021
aab44e0
Update comments
ttnghia Apr 20, 2021
2282010
Update comment
ttnghia Apr 20, 2021
22cd1d3
Reverse changes to `concatenate.cu` and reverse fixes for `combine_te…
ttnghia Apr 20, 2021
3d03bee
Fix ClangFormat style
ttnghia Apr 20, 2021
19a2819
Extract `strings/combine_tests.cpp` into 3 separate cpp files
ttnghia Apr 21, 2021
3cf2d10
Use an additional array of int8_t type to store the validity of the s…
ttnghia Apr 21, 2021
5fe852a
Avoid calling `for_each_fn` the second time if the output chars colum…
ttnghia Apr 21, 2021
691cceb
Refactor functors to remove duplicate code
ttnghia Apr 21, 2021
d1931a4
Fix ClangFormat style
ttnghia Apr 21, 2021
f883278
Rename variable
ttnghia Apr 21, 2021
b33fd2a
Change the print parameter in unit tests
ttnghia Apr 22, 2021
76ba7d1
Rewrite comments, and remove `thrust::uninitialized_fill` for validit…
ttnghia Apr 22, 2021
647ca8f
Fix copyright header and address review comments
ttnghia Apr 22, 2021
0e52459
Add a parameter `exec_size` to allow executing the functor at a diffe…
ttnghia Apr 23, 2021
cbee766
Minor improvement
ttnghia Apr 23, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions cpp/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -322,6 +322,7 @@ add_library(cudf
src/strings/char_types/char_cases.cu
src/strings/char_types/char_types.cu
src/strings/combine/concatenate.cu
src/strings/combine/concatenate_list_elements.cu
src/strings/combine/join.cu
src/strings/contains.cu
src/strings/convert/convert_booleans.cu
Expand Down
179 changes: 138 additions & 41 deletions cpp/include/cudf/strings/combine.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@
#pragma once

#include <cudf/column/column.hpp>
#include <cudf/lists/lists_column_view.hpp>
#include <cudf/scalar/scalar.hpp>
#include <cudf/strings/strings_column_view.hpp>
#include <cudf/table/table_view.hpp>
Expand All @@ -29,47 +30,6 @@ namespace strings {
* @brief Strings APIs for concatenate and join
*/

/**
* @brief Row-wise concatenates the given list of strings columns and
* returns a single strings column result.
*
* Each new string is created by concatenating the strings from the same
* row delimited by the separator provided.
*
* Any row with a null entry will result in the corresponding output
* row to be null entry unless a narep string is specified to be used
* in its place.
*
* The number of strings in the columns provided must be the same.
*
* @code{.pseudo}
* Example:
* s1 = ['aa', null, '', 'aa']
* s2 = ['', 'bb', 'bb', null]
* r1 = concatenate([s1,s2])
* r1 is ['aa', null, 'bb', null]
* r2 = concatenate([s1,s2],':','_')
* r2 is ['aa:', '_:bb', ':bb', 'aa:_']
* @endcode
*
* @throw cudf::logic_error if input columns are not all strings columns.
* @throw cudf::logic_error if separator is not valid.
*
* @param strings_columns List of string columns to concatenate.
* @param separator String that should inserted between each string from each row.
* Default is an empty string.
* @param narep String that should be used in place of any null strings
* found in any column. Default of invalid-scalar means any null entry in any column will
* produces a null result for that row.
* @param mr Device memory resource used to allocate the returned column's device memory.
* @return New column with concatenated results.
*/
std::unique_ptr<column> concatenate(
table_view const& strings_columns,
string_scalar const& separator = string_scalar(""),
string_scalar const& narep = string_scalar("", false),
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

/**
* @brief Concatenates all strings in the column into one new string delimited
* by an optional separator string.
Expand Down Expand Up @@ -158,6 +118,143 @@ std::unique_ptr<column> concatenate(
string_scalar const& col_narep = string_scalar("", false),
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

/**
* @addtogroup strings_combine
* @{
* @file strings/combine.hpp
* @brief Strings APIs for concatenate and join
*/

/**
* @brief Row-wise concatenates the given list of strings columns and
* returns a single strings column result.
*
* Each new string is created by concatenating the strings from the same
* row delimited by the separator provided.
*
* Any row with a null entry will result in the corresponding output
* row to be null entry unless a narep string is specified to be used
* in its place.
*
* The number of strings in the columns provided must be the same.
*
* @code{.pseudo}
* Example:
* s1 = ['aa', null, '', 'aa']
* s2 = ['', 'bb', 'bb', null]
* r1 = concatenate([s1,s2])
* r1 is ['aa', null, 'bb', null]
* r2 = concatenate([s1,s2],':','_')
* r2 is ['aa:', '_:bb', ':bb', 'aa:_']
* @endcode
*
* @throw cudf::logic_error if input columns are not all strings columns.
* @throw cudf::logic_error if separator is not valid.
*
* @param strings_columns List of string columns to concatenate.
* @param separator String that should inserted between each string from each row.
* Default is an empty string.
* @param narep String that should be used in place of any null strings
* found in any column. Default of invalid-scalar means any null entry in any column will
* produces a null result for that row.
* @param mr Device memory resource used to allocate the returned column's device memory.
* @return New column with concatenated results.
*/
std::unique_ptr<column> concatenate(
table_view const& strings_columns,
string_scalar const& separator = string_scalar(""),
string_scalar const& narep = string_scalar("", false),
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

/**
* @brief Given a lists column of strings (each row is a list of strings), concatenates the strings
* within each row and returns a single strings column result.
*
* Each new string is created by concatenating the strings from the same row (same list element)
* delimited by the row separator provided in the `separators` strings column.
*
* A null list row will always result in a null string in the output row. Any non-null list row
* having a null element will result in the corresponding output row to be null unless a valid
* `string_narep` scalar is provided to be used in its place. Any null row in the `separators`
* column will also result in a null output row unless a valid `separator_narep` scalar is provided
* to be used in place of the null separators.
*
* @code{.pseudo}
* Example:
* s = [ {'aa', 'bb', 'cc'}, null, {'', 'dd'}, {'ee', null}, {'ff', 'gg'} ]
* sep = ['::', '%%', '!', '*', null]
*
* r1 = concatenate(s, sep)
* r1 is ['aa::bb::cc', null, '!dd', null, null]
*
* r2 = concatenate(s, sep, ':', '_')
* r2 is ['aa::bb::cc', null, '!dd', 'ee*_', 'ff:gg']
* @endcode
*
* @throw cudf::logic_error if input column is not lists of strings column.
* @throw cudf::logic_error if the number of rows from `separators` and `lists_strings_column` do
* not match
*
* @param lists_strings_column Column containing lists of strings to concatenate
* @param separators Strings column that provides separators for concatenation
* @param separator_narep String that should be used to replace null separator, default is an
* invalid-scalar denoting that rows containing null separator will result in null string in the
* corresponding output rows
* @param string_narep String that should be used to replace null strings in any
* non-null list row, default is an invalid-scalar denoting that list rows containing null strings
* will result in null string in the corresponding output rows
* @param mr Device memory resource used to allocate the returned column's
* device memory
* @return New strings column with concatenated results
*/
std::unique_ptr<column> concatenate_list_elements(
const lists_column_view& lists_strings_column,
const strings_column_view& separators,
string_scalar const& separator_narep = string_scalar("", false),
string_scalar const& string_narep = string_scalar("", false),
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

/**
* @brief Given a lists column of strings (each row is a list of strings), concatenates the strings
* within each row and returns a single strings column result.
*
* Each new string is created by concatenating the strings from the same row (same list element)
* delimited by the separator provided.
*
* A null list row will always result in a null string in the output row. Any non-null list row
* having a null elenent will result in the corresponding output row to be null unless a narep
* string is specified to be used in its place.
*
* @code{.pseudo}
* Example:
* s = [ {'aa', 'bb', 'cc'}, null, {'', 'dd'}, {'ee', null}, {'ff'} ]
*
* r1 = concatenate(s)
* r1 is ['aabbcc', null, 'dd', null, 'ff']
*
* r2 = concatenate(s, ':', '_')
* r2 is ['aa:bb:cc', null, ':dd', 'ee:_', 'ff']
* @endcode
*
* @throw cudf::logic_error if input column is not lists of strings column.
* @throw cudf::logic_error if separator is not valid.
*
* @param lists_strings_column Column containing lists of strings to concatenate
* @param separator String that should inserted between strings of each list row,
* default is an empty string
* @param narep String that should be used to replace null strings in any non-null
* list row, default is an invalid-scalar denoting that list rows containing null strings will
* result in null string in the corresponding output rows
* @param mr Device memory resource used to allocate the returned column's
* device memory
* @return New strings column with concatenated results
*/
std::unique_ptr<column> concatenate_list_elements(
const lists_column_view& lists_strings_column,
string_scalar const& separator = string_scalar(""),
string_scalar const& narep = string_scalar("", false),
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

/** @} */ // end of doxygen group
} // namespace strings
} // namespace cudf
Loading