Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Arrow comparison LIKE/ILIKE/NLIKE kernels do not escape all special characters #1069

Closed
alamb opened this issue Dec 20, 2021 · 3 comments · Fixed by #1085
Closed

Arrow comparison LIKE/ILIKE/NLIKE kernels do not escape all special characters #1069

alamb opened this issue Dec 20, 2021 · 3 comments · Fixed by #1085
Assignees
Labels
arrow Changes to the arrow crate bug good first issue Good for newcomers

Comments

@alamb
Copy link
Contributor

alamb commented Dec 20, 2021

Describe the bug

Characters such as [ and . are sometimes treated as regular expressions rather than literals in regular expressions

The arrow regular expression kernels such as like_utf8 https://github.com/apache/arrow-rs/blob/master/arrow/src/compute/kernels/comparison.rs#L311-L323 take limited SQL style string matching patterns (e.g. %).

However, under the covers a regular expression matching library is used but special regular expression characters are not escaped. @ovr added code to handle ( and ) in #1042 but there are other special characters as well

To Reproduce

    let array: StringArray = vec!["foo", "bar", "baz"]
        .into_iter()
        .map(Some)
        .collect();

    let comparison = arrow::compute::like_utf8_scalar(&array, "foo%.*").unwrap();

    let expected: BooleanArray = vec![false, false, false]
        .into_iter()
        .map(Some)
        .collect();

    assert_eq!(comparison, expected);

Expected behavior
This test should pass (is what postgres produces)

alamb=# select * from foo;
  x  
-----
 foo
 bar
 baz
(3 rows)

alamb=# select x, x like 'foo%.*' from foo;
  x  | ?column? 
-----+----------
 foo | f
 bar | f
 baz | f
(3 rows)

Additional context
Follow on to #1042 where @ovr fixed the parenthesis issue

@alamb alamb added the bug label Dec 20, 2021
@alamb
Copy link
Contributor Author

alamb commented Dec 20, 2021

@jwdeitch suggests the use of escape here: #1042 (comment)

@alamb alamb added arrow Changes to the arrow crate good first issue Good for newcomers labels Dec 20, 2021
@alamb
Copy link
Contributor Author

alamb commented Dec 20, 2021

Fixing this bug is likely a matter of using the code suggested by @jwdeitch and writing some tests. 🙏 thank you to whoever picks it up

@Dandandan
Copy link
Contributor

I am working on this - was annoyed by this already. Using escape seems like a great solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate bug good first issue Good for newcomers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants