Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add libcudf strings split API that accepts regex pattern #10128

Merged
merged 25 commits into from
Feb 11, 2022
Merged
Show file tree
Hide file tree
Changes from 23 commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
eaba42e
Add libcudf strings split API that accepts regex pattern
davidwendt Jan 26, 2022
a832436
add error-checking gtests
davidwendt Jan 26, 2022
d4e5746
Merge branch 'branch-22.04' into fea-split-with-regex
davidwendt Jan 27, 2022
d33f79b
use count_matches utility
davidwendt Jan 27, 2022
9c74fdf
add split_re declaration
davidwendt Jan 27, 2022
1a89db5
split_re implementation and tests
davidwendt Jan 27, 2022
8599d0c
rename split_record_re.cu to split_re.cu
davidwendt Jan 28, 2022
b6d7453
refactored split_re/rsplit_re functions
davidwendt Jan 31, 2022
9556fc1
Merge branch 'branch-22.04' into fea-split-with-regex
davidwendt Jan 31, 2022
7bc451b
remove unneeded if-check
davidwendt Jan 31, 2022
93887b1
add all empty and all null test cases
davidwendt Jan 31, 2022
0930513
Merge branch 'branch-22.04' into fea-split-with-regex
davidwendt Feb 1, 2022
c88eeae
add more maxsplit gtests
davidwendt Feb 1, 2022
7d9d30d
Merge branch 'branch-22.04' into fea-split-with-regex
davidwendt Feb 1, 2022
c76456d
Merge branch 'branch-22.04' into fea-split-with-regex
davidwendt Feb 3, 2022
22be900
Merge branch 'branch-22.04' into fea-split-with-regex
davidwendt Feb 3, 2022
3609f2b
Merge branch 'branch-22.04' into fea-split-with-regex
davidwendt Feb 3, 2022
773047d
Merge branch 'branch-22.04' into fea-split-with-regex
davidwendt Feb 4, 2022
1e51736
Merge branch 'branch-22.04' into fea-split-with-regex
davidwendt Feb 7, 2022
eb8c326
fix doxygen typo in @throw line
davidwendt Feb 8, 2022
d6ee883
refactor max-tokens calculation into helper function
davidwendt Feb 8, 2022
0d1480b
Merge branch 'branch-22.04' into fea-split-with-regex
davidwendt Feb 8, 2022
f647cf0
fix doxygen brief and examples
davidwendt Feb 8, 2022
ed309c7
Merge branch 'branch-22.04' into fea-split-with-regex
davidwendt Feb 9, 2022
63ec0ac
Merge branch 'branch-22.04' into fea-split-with-regex
davidwendt Feb 9, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions conda/recipes/libcudf/meta.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -216,6 +216,7 @@ test:
- test -f $PREFIX/include/cudf/strings/replace_re.hpp
- test -f $PREFIX/include/cudf/strings/split/partition.hpp
- test -f $PREFIX/include/cudf/strings/split/split.hpp
- test -f $PREFIX/include/cudf/strings/split/split_re.hpp
- test -f $PREFIX/include/cudf/strings/string_view.hpp
- test -f $PREFIX/include/cudf/strings/strings_column_view.hpp
- test -f $PREFIX/include/cudf/strings/strip.hpp
Expand Down
1 change: 1 addition & 0 deletions cpp/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -445,6 +445,7 @@ add_library(
src/strings/search/find_multiple.cu
src/strings/split/partition.cu
src/strings/split/split.cu
src/strings/split/split_re.cu
src/strings/split/split_record.cu
src/strings/strings_column_factories.cu
src/strings/strings_column_view.cpp
Expand Down
232 changes: 232 additions & 0 deletions cpp/include/cudf/strings/split/split_re.hpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,232 @@
/*
* Copyright (c) 2022, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
#pragma once

#include <cudf/column/column.hpp>
#include <cudf/strings/strings_column_view.hpp>
#include <cudf/table/table.hpp>

namespace cudf {
namespace strings {
/**
* @addtogroup strings_split
* @{
* @file
*/

/**
* @brief Splits strings elements into a table of strings columns
* using a regex pattern to delimit each string.
*
* Each element generates a vector of strings that are stored in corresponding
* rows in the output table -- `table[col,row] = token[col] of strings[row]`
* where `token` is a substring between delimiters.
*
* The number of rows in the output table will be the same as the number of
* elements in the input column. The resulting number of columns will be the
* maximum number of tokens found in any input row.
*
* The `pattern` is used to identify the delimiters within a string
* and splitting stops when either `maxsplit` or the end of the string is reached.
*
* An empty input string will produce a corresponding empty string in the
* corresponding row of the first column.
* A null row will produce corresponding null rows in the output table.
*
* @code{.pseudo}
* s = ["a_bc def_g", "a__bc", "_ab cd", "ab_cd "]
* s1 = split_re(s, "[_ ]")
* s1 is a table of strings columns:
* [ ["a", "a", "", "ab"],
* ["bc", "", "ab", "cd"],
* ["def", "bc", "cd", ""],
* ["g", null, null, null] ]
ttnghia marked this conversation as resolved.
Show resolved Hide resolved
* s2 = split_re(s, "[ _]", 1)
* s2 is a table of strings columns:
* [ ["a", "a", "", "ab"],
* ["bc def_g", "_bc", "ab cd", "cd "] ]
* @endcode
*
* @throw cudf::logic_error if `pattern` is empty.
*
* @param input A column of string elements to be split.
* @param pattern The regex pattern for delimiting characters within each string.
* @param maxsplit Maximum number of splits to perform.
* Default of -1 indicates all possible splits on each string.
* @param mr Device memory resource used to allocate the returned result's device memory.
* @return A table of columns of strings.
*/
std::unique_ptr<table> split_re(
strings_column_view const& input,
std::string const& pattern,
ttnghia marked this conversation as resolved.
Show resolved Hide resolved
size_type maxsplit = -1,
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

/**
* @brief Splits strings elements into a table of strings columns
* using a regex pattern to delimit each string starting from the end of the string.
*
* Each element generates a vector of strings that are stored in corresponding
* rows in the output table -- `table[col,row] = token[col] of string[row]`
* where `token` is the substring between each delimiter.
*
* The number of rows in the output table will be the same as the number of
* elements in the input column. The resulting number of columns will be the
* maximum number of tokens found in any input row.
*
* Splitting occurs by traversing starting from the end of the input string.
* The `pattern` is used to identify the delimiters within a string
* and splitting stops when either `maxsplit` or the beginning of the string
* is reached.
*
* An empty input string will produce a corresponding empty string in the
* corresponding row of the first column.
* A null row will produce corresponding null rows in the output table.
*
* @code{.pseudo}
* s = ["a_bc def_g", "a__bc", "_ab cd", "ab_cd "]
* s1 = rsplit_re(s, "[_ ]")
* s1 is a table of strings columns:
* [ ["a", "a", "", "ab"],
* ["bc", "", "ab", "cd"],
* ["def", "bc", "cd", ""],
* ["g", null, null, null] ]
* s2 = rsplit_re(s, "[ _]", 1)
* s2 is a table of strings columns:
* [ ["a_bc def", "a_", "_ab", "ab"],
* ["g", "bc", "cd", "cd "] ]
* @endcode
*
* @throw cudf::logic_error if `pattern` is empty.
*
* @param input A column of string elements to be split.
* @param pattern The regex pattern for delimiting characters within each string.
* @param maxsplit Maximum number of splits to perform.
* Default of -1 indicates all possible splits on each string.
* @param mr Device memory resource used to allocate the returned result's device memory.
* @return A table of columns of strings.
*/
std::unique_ptr<table> rsplit_re(
strings_column_view const& input,
std::string const& pattern,
size_type maxsplit = -1,
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

/**
* @brief Splits strings elements into a list column of strings
* using the given regex pattern to delimit each string.
*
* Each element generates an array of strings that are stored in an output
* lists column -- `list[row] = [token1, token2, ...] found in input[row]`
* where `token` is a substring between delimiters.
*
* The number of elements in the output column will be the same as the number of
* elements in the input column. Each individual list item will contain the
* new strings for that row. The resulting number of strings in each row can vary
* from 0 to `maxsplit + 1`.
*
* The `pattern` is used to identify the delimiters within a string
* and splitting stops when either `maxsplit` or the end of the string is reached.
*
* An empty input string will produce a corresponding empty list item output row.
* A null row will produce a corresponding null output row.
*
* @code{.pseudo}
* s = ["a_bc def_g", "a__bc", "_ab cd", "ab_cd "]
* s1 = split_record_re(s, "[_ ]")
* s1 is a lists column of strings:
* [ ["a", "bc", "def", "g"],
* ["a", "", "bc"],
* ["", "ab", "cd"],
* ["ab", "cd", ""] ]
* s2 = split_record_re(s, "[ _]", 1)
* s2 is a lists column of strings:
* [ ["a", "bc def_g"],
* ["a", "_bc"],
* ["", "ab cd"],
* ["ab", "cd "] ]
* @endcode
*
* @throw cudf::logic_error if `pattern` is empty.
*
* @param input A column of string elements to be split.
* @param pattern The regex pattern for delimiting characters within each string.
* @param maxsplit Maximum number of splits to perform.
* Default of -1 indicates all possible splits on each string.
* @param mr Device memory resource used to allocate the returned result's device memory.
* @return Lists column of strings.
*/
std::unique_ptr<column> split_record_re(
strings_column_view const& input,
std::string const& pattern,
size_type maxsplit = -1,
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

/**
* @brief Splits strings elements into a list column of strings
* using the given regex pattern to delimit each string starting from the end of the string.
*
* Each element generates a vector of strings that are stored in an output
* lists column -- `list[row] = [token1, token2, ...] found in input[row]`
* where `token` is a substring between delimiters.
*
* The number of elements in the output column will be the same as the number of
* elements in the input column. Each individual list item will contain the
* new strings for that row. The resulting number of strings in each row can vary
* from 0 to `maxsplit + 1`.
*
* Splitting occurs by traversing starting from the end of the input string.
* The `pattern` is used to identify the separation points within a string
* and splitting stops when either `maxsplit` or the beginning of the string
* is reached.
*
* An empty input string will produce a corresponding empty list item output row.
* A null row will produce a corresponding null output row.
*
* @code{.pseudo}
* s = ["a_bc def_g", "a__bc", "_ab cd", "ab_cd "]
* s1 = rsplit_record_re(s, "[_ ]")
* s1 is a lists column of strings:
* [ ["a", "bc", "def", "g"],
* ["a", "", "bc"],
* ["", "ab", "cd"],
* ["ab", "cd", ""] ]
* s2 = rsplit_record_re(s, "[ _]", 1)
* s2 is a lists column of strings:
* [ ["a_bc def", "g"],
* ["a_", "bc"],
* ["_ab", "cd"],
* ["ab_cd", ""] ]
* @endcode
*
* @throw cudf::logic_error if `pattern` is empty.
*
* @param input A column of string elements to be split.
* @param pattern The regex pattern for delimiting characters within each string.
* @param maxsplit Maximum number of splits to perform.
* Default of -1 indicates all possible splits on each string.
* @param mr Device memory resource used to allocate the returned result's device memory.
* @return Lists column of strings.
*/
std::unique_ptr<column> rsplit_record_re(
strings_column_view const& input,
std::string const& pattern,
size_type maxsplit = -1,
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

/** @} */ // end of doxygen group
} // namespace strings
} // namespace cudf
Loading