Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] extract when matching an empty string will some times return a null #5157

Closed
revans2 opened this issue May 11, 2020 · 1 comment · Fixed by #5167
Closed

[BUG] extract when matching an empty string will some times return a null #5157

revans2 opened this issue May 11, 2020 · 1 comment · Fixed by #5167
Assignees
Labels
bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS strings strings issues (C++ and Python)

Comments

@revans2
Copy link
Contributor

revans2 commented May 11, 2020

Describe the bug
There are cases when strings::extract will produce a null, when it should have matched an empty string and returned that.

Steps/Code to reproduce bug

diff --git a/cpp/tests/strings/extract_tests.cpp b/cpp/tests/strings/extract_tests.cpp
index 5f38c142f..816877ff1 100644
--- a/cpp/tests/strings/extract_tests.cpp
+++ b/cpp/tests/strings/extract_tests.cpp
@@ -28,6 +28,84 @@
 struct StringsExtractTests : public cudf::test::BaseFixture {
 };
 
+TEST_F(StringsExtractTests, EmptyExtractTest)
+{
+  std::vector<const char*> h_strings{
+    "First Last", "Joe Schmoe", "John Smith", "Jane Smith", "Beyonce", "Sting", nullptr, ""};
+
+  cudf::test::strings_column_wrapper strings(
+    h_strings.begin(),
+    h_strings.end(),
+    thrust::make_transform_iterator(h_strings.begin(), [](auto str) { return str != nullptr; }));
+  auto strings_view = cudf::strings_column_view(strings);
+
+  std::vector<const char*> h_expecteds{"First",
+                                       "Joe",
+                                       "John",
+                                       "Jane",
+                                       "Beyonce",
+                                       "Sting",
+                                       nullptr,
+                                       "",
+                                       "Last",
+                                       "Schmoe",
+                                       "Smith",
+                                       "Smith",
+                                       "",
+                                       "",
+                                       nullptr,
+                                       ""};
+
+  std::string pattern = "\\A(\\w*) ?(\\w*)\\Z";
+  auto results        = cudf::strings::extract(strings_view, pattern);
+
+  cudf::test::strings_column_wrapper expected1(
+    h_expecteds.data(),
+    h_expecteds.data() + h_strings.size(),
+    thrust::make_transform_iterator(h_expecteds.begin(), [](auto str) { return str != nullptr; }));
+  cudf::test::strings_column_wrapper expected2(
+    h_expecteds.data() + h_strings.size(),
+    h_expecteds.data() + h_expecteds.size(),
+    thrust::make_transform_iterator(h_expecteds.data() + h_strings.size(),
+                                    [](auto str) { return str != nullptr; }));
+  std::vector<std::unique_ptr<cudf::column>> columns;
+  columns.push_back(expected1.release());
+  columns.push_back(expected2.release());
+  cudf::experimental::table expected(std::move(columns));
+  cudf::test::expect_tables_equal(*results, expected);
+}
+
+TEST_F(StringsExtractTests, EmptyExtractTes2)
+{
+  std::vector<const char*> h_strings{
+    nullptr, "AAA", "AAA_A", "AAA_AAA_", "A__", ""};
+
+  cudf::test::strings_column_wrapper strings(
+    h_strings.begin(),
+    h_strings.end(),
+    thrust::make_transform_iterator(h_strings.begin(), [](auto str) { return str != nullptr; }));
+  auto strings_view = cudf::strings_column_view(strings);
+
+  std::vector<const char*> h_expecteds{nullptr,
+                                       "AAA",
+                                       "A",
+                                       "",
+                                       "",
+                                       ""};
+
+  std::string pattern = "([^_]*)\\Z";
+  auto results        = cudf::strings::extract(strings_view, pattern);
+
+  cudf::test::strings_column_wrapper expected1(
+    h_expecteds.data(),
+    h_expecteds.data() + h_strings.size(),
+    thrust::make_transform_iterator(h_expecteds.begin(), [](auto str) { return str != nullptr; }));
+  std::vector<std::unique_ptr<cudf::column>> columns;
+  columns.push_back(expected1.release());
+  cudf::experimental::table expected(std::move(columns));
+  cudf::test::expect_tables_equal(*results, expected);
+}
+
 TEST_F(StringsExtractTests, ExtractTest)
 {
   std::vector<const char*> h_strings{

Expected behavior
Return an empty string in all cases where it would match, and null for cases where it would not.

@revans2 revans2 added bug Something isn't working Needs Triage Need team to review and classify labels May 11, 2020
@davidwendt
Copy link
Contributor

Looks like Pandas is returning empty strings as well:

>>> import pandas as pd
>>> ps = pd.Series(["AAA","A","","",""])
>>> ps.str.extract("([^_]*)\\Z",expand=True)
     0
0  AAA
1    A
2     
3     
4     
>>> import cudf
>>> ds = cudf.Series(["AAA","A","","",""])
>>> ds.str.extract("([^_]*)\\Z")
      0
0   AAA
1     A
2  None
3  None
4  None

@harrism harrism added strings strings issues (C++ and Python) libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS labels May 12, 2020
@davidwendt davidwendt self-assigned this May 12, 2020
@revans2 revans2 removed the Needs Triage Need team to review and classify label May 13, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS strings strings issues (C++ and Python)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants