Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Search - add case insensitive support for regex queries. #59441

Merged
merged 9 commits into from
Aug 25, 2020

Conversation

markharwood
Copy link
Contributor

@markharwood markharwood commented Jul 13, 2020

While Lucene has more complex bitmask flags for regex match settings like case insensitivity the JSON representation we add here is a simple case_sensitive boolean parameter which defaults to true. Currently this case logic only works with ASCII characters.

To implement this feature I have forked a copy of Lucene’s RegexpQuery and RegExp classes from Lucene master.
These classes can be removed when 8.7 Lucene is released which is the first release with the case insensitivity option.

@markharwood markharwood added >enhancement :Search/Search Search-related issues that do not fall into other categories v8.0.0 v7.9.0 labels Jul 13, 2020
@markharwood markharwood self-assigned this Jul 13, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search (:Search/Search)

@elasticmachine elasticmachine added the Team:Search Meta label for search team label Jul 13, 2020
@markharwood markharwood requested a review from jpountz July 14, 2020 09:07
Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we can't make the request case-sensitive if the analyzer/normalizer lowercased the content, I wonder if we should reject setting case_sensitive: true. (I think this question was initially raised by @jimczi.)

@markharwood
Copy link
Contributor Author

markharwood commented Jul 15, 2020

I wonder if we should reject setting case_sensitive: true

We do state in the docs that case sensitivity (or insensitivity) is in relation to the indexed content so we have a justification for the logic. I don't have a strong opinion either way on whether to error or not.

The bit where I come unstuck is thinking about how to describe the default value (something we normally document).

@markharwood markharwood force-pushed the fix/59235 branch 4 times, most recently from e0f9244 to 98b0e04 Compare July 16, 2020 11:32
@markharwood
Copy link
Contributor Author

Following discussion with @jimczi we may want to reconsider the implementation here.

Rather than an explicit flag in the JSON we could consider adding passing the case insensitivity option as part of the regular expression string. There are a number of ways regex engines do this - either globally using /...../i syntax or locally within a section of the regex using ...(?i)...(?-i)

The advantage of expressing the option as part of the regex string is that query-string and KQL based inputs could be extended to make these choices (if the users know to use this syntax).
The alternative is that we stick with Elasticsearch APIs that express options in less cryptic ways and rely more on GUIs to let users select query options.

@jimczi
Copy link
Contributor

jimczi commented Jul 22, 2020

Rather than an explicit flag in the JSON we could consider adding passing the case insensitivity option as part of the regular expression string. There are a number of ways regex engines do this - either globally using /...../i syntax or locally within a section of the regex using ...(?i)...(?-i)

I made this comment for the query_string query since this is used by Kibana KQL internally. I did not meant that we should do the same for the regex query where a flag or a boolean is ok to me.

@markharwood
Copy link
Contributor Author

markharwood commented Jul 22, 2020

I made this comment for the query_string query since this is used by Kibana KQL internally. I did not meant that we should do the same for the regex query where a flag or a boolean is ok to me.

Ah - I thought we'd agreed we would want KQL, query_string and RegExpQuery to all support the same string syntax if we were to expand it. If there's two ways to express the option (in-string and in JSON boolean) there's the potential for conflicting arguments.

@markharwood markharwood force-pushed the fix/59235 branch 7 times, most recently from 870371b to 6111094 Compare July 29, 2020 12:05
@markharwood
Copy link
Contributor Author

run elasticsearch-ci/docs

Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change looks good to me. I left some comments regarding the testing of the new options.

@markharwood
Copy link
Contributor Author

Thanks for the review, @jimczi
The one remaining concern I have is that user of the HLRC will make the mistake of using the RegExp.ALL flag when creating queries. In Lucene 8.6 that is 0xffff and we now reject that in RegExpQueryBuilder because RegExp87.ALL is only the range up to 0xff.
I suspect HLRC clients are unlikely to pass RegExp.ALL (RegExpQueryBuilder has DEFAULT_FLAGS_VALUE) but there's no type safety to prevent them from doing so.

Otherwise is this good to go?

@jimczi
Copy link
Contributor

jimczi commented Aug 18, 2020

I suspect HLRC clients are unlikely to pass RegExp.ALL (RegExpQueryBuilder has DEFAULT_FLAGS_VALUE) but there's no type safety to prevent them from doing so.

Can we be more lenient ? It's a breaking change imo so we could apply 0xff as a mask to clear out the extra bits ?
We pass the flags value in the rest serialization so currently all clients would fail if they don't use the updated version ?

@markharwood
Copy link
Contributor Author

Good to go @jimczi ?

Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left one question for cleanup, LGTM otherwise

@markharwood markharwood merged commit 7adf766 into elastic:master Aug 25, 2020
markharwood added a commit to markharwood/elasticsearch that referenced this pull request Aug 25, 2020
(Backport)
Added case insensitive support for regex queries.
Forks a copy of Lucene’s RegexpQuery and RegExp from Lucene master.
This can be removed when 8.7 Lucene is released.

Closes elastic#59235
markharwood added a commit that referenced this pull request Aug 25, 2020
…1532)

Backport to add case insensitive support for regex queries. 
Forks a copy of Lucene’s RegexpQuery and RegExp from Lucene master.
This can be removed when 8.7 Lucene is released.

Closes #59235
@kevinqfu
Copy link

kevinqfu commented Oct 6, 2022

Hi, not sure if caused by this PR. We use regex in query_string and want the search to be case insensitive. It has been working until recently we upgraded our AWS ES cluster from 7.7 to 7.10. The same regex query became case sensitive now (it only works with lower case as in the standard analyzer). I'm unable to set case_insensitive param in query_string query.

Changing regex to wildcard makes it case insensitive again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement :Search/Search Search-related issues that do not fall into other categories Team:Search Meta label for search team v7.10.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants