`analysis-common`: make `UniqueTokenFilter` public #14179

rursprung · 2024-06-11T15:04:38Z

Description

this way the filter can also be used from plugins which implement analyzers and want to use it.
the current workaround is that the plugin has to implement the usage from it in a package with the same name, which is just an ugly hack.

Related Issues

no issue, but see this discussion on slack.

Check List

(none applicable)

Functionality includes testing.
API changes companion pull request created, if applicable.
Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

rursprung · 2024-06-11T15:08:20Z

i checked: this is the only *Filter i could find in analysis-common, everything else are analyzers (already public) or factories (also public).

i will need it in the upcoming opensearch-phone-number-analyzer (#11326), see the hack done in elasticsearch-phone to work around this.

i intentionally didn't add this to the changelog as i don't think that it's noteworthy, so the skip-changelog label would be nice :)

andrross · 2024-06-11T15:17:20Z

This seems reasonable to me. @reta @msfroh Any objections to making this class public?

dblock · 2024-06-11T15:36:42Z

This is not anything that needs to be annotated with @PublicApi or something like that?

rursprung · 2024-06-11T15:40:42Z

This is not anything that needs to be annotated with @PublicApi or something like that?

none of the factories & analyzers in analysis-common are annotated with this either, so i presume not?

reta · 2024-06-11T15:40:48Z

This seems reasonable to me. @reta @msfroh Any objections to making this class public?

I understand the issue but I want to point out that nonetheless all *Factory are public, all constructors, including the UniqueTokenFilter are still package private. How does making the class public help?

github-actions · 2024-06-11T15:59:10Z

❌ Gradle check result for 4c77270: null

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

rursprung · 2024-06-11T19:26:21Z

❌ Gradle check result for 4c77270: null

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

gradle check seems to have run into a bug in the implementation?

[..]
Workflow running status :true
Still running, wait for another 30 seconds before checking again, max timeout 7200
Jenkins Workflow Url: https://build.ci.opensearch.org/job/gradle-check/40689/
time pass: 3180
parse error: Invalid numeric literal at line 1, column 7
Workflow running status :Attempt 1/10 failed. The jq processing failed with exit code: 4. Retrying...
true
Complete the run, checking results now......
Please check jenkins url for logs: https://build.ci.opensearch.org/job/gradle-check/40689/
Result: null

the actual jenkins build itself succeeded:

Finished: SUCCESS

the only issue i found about this is #13210, but that's closed with the comment "Looks like Jenkins crashed" which isn't the case here.

andrross · 2024-06-11T19:31:00Z

@dblock Currently we only use those annotations within the :server module.

this way the filter can also be used from plugins which implement analyzers and want to use it. the current workaround is that the plugin has to implement the usage from it in a package with the same name, which is just an ugly hack. Signed-off-by: Ralph Ursprung <[email protected]>

rursprung · 2024-06-11T20:42:51Z

This seems reasonable to me. @reta @msfroh Any objections to making this class public?

I understand the issue but I want to point out that nonetheless all *Factory are public, all constructors, including the UniqueTokenFilter are still package private. How does making the class public help?

i missed the constructor, thanks for pointing that out @reta!🤦
i've now made it public here and tested with my plugin that i can use it successfully (via deployToMavenLocal for the time being).
i've still just made the constructor public for UniqueTokenFilter - i guess if/when this is needed for the other classes it can be done there as well.

reta · 2024-06-11T21:00:38Z

i've still just made the constructor public for UniqueTokenFilter - i guess if/when this is needed for the other classes it can be done there as well.

@rursprung now I do have questions (sorry): what prevents you from copying UniqueTokenFilter to your plugin? I understand that reuse is often better but I believe in this case we just expose yet another class which we would have to maintain as a public API, I am -1 to that.

rursprung · 2024-06-11T21:21:26Z

@rursprung now I do have questions (sorry): what prevents you from copying UniqueTokenFilter to your plugin? I understand that reuse is often better but I believe in this case we just expose yet another class which we would have to maintain as a public API, I am -1 to that.

i think it already (sort-of) is a public API as it can be referenced from JSON (OS docs just mention it in a list, ES docs go into more details).

copying it would IMHO be the worse solution - esp. since i very much hope that the plugin i'm building will end up in opensearch-project anyway, so it'll have the same maintainers 😄
then i'd rather just say "screw it, let's just have duplicate tokens" (i don't think it actually has any real adverse effect? we're not talking about hundreds of duplicate tokens but 1-2 per field with that tokenizer) or "screw it, let's add a class in the same package to circumvent it".

reta · 2024-06-11T21:33:51Z

copying it would IMHO be the worse solution - esp

we are trully talking about ~50 lines of code with no external deps

copying it would IMHO be the worse solution - esp. since i very much hope that the plugin i'm building will end up in opensearch-project anyway, so it'll have the same maintainers 😄

what happens will happen and when happens, the impl is going to adapt to that

github-actions · 2024-06-11T22:36:44Z

✅ Gradle check result for e22345b: SUCCESS

codecov · 2024-06-11T22:38:26Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 71.65%. Comparing base (b15cb0c) to head (e22345b).
Report is 849 commits behind head on main.

Additional details and impacted files

@@             Coverage Diff              @@
##               main   #14179      +/-   ##
============================================
+ Coverage     71.42%   71.65%   +0.23%     
- Complexity    59978    61914    +1936     
============================================
  Files          4985     5117     +132     
  Lines        282275   291173    +8898     
  Branches      40946    42067    +1121     
============================================
+ Hits         201603   208648    +7045     
- Misses        63999    65284    +1285     
- Partials      16673    17241     +568

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

dblock · 2024-06-12T14:24:04Z

@rursprung I think you're arguing that this class is truly reusable functionality (API). What are the reasons? If it didn't exist, would you want to add its code to OpenSearch core and subclass it in your plugin?

rursprung · 2024-06-20T07:17:57Z

@rursprung I think you're arguing that this class is truly reusable functionality (API). What are the reasons? If it didn't exist, would you want to add its code to OpenSearch core and subclass it in your plugin?

i've been thinking about this for a while and i think it depends on the answer to the following questions: does the order of the tokens have any impact on matching (speed), score or anything else? and what's the impact of having the same token more than once?

because in my implementation i anyway have all tokens at once in a List - if the order is not relevant i can just use a Set instead and be done with it. then we can abandon this PR. even if the order were important i could use a LinkedHashSet. so most likely this would be good enough either way?
the advantage of using this filter here would be that it'd maybe make it more visible from the outside what's happening (though arguably it might also be a downside as using the tokenizer directly would then mean that you still get duplicate tokens).

dblock · 2024-06-20T22:23:11Z

I am out of my depth on the token-specific questions, let's try to get help from @msfroh?

opensearch-trigger-bot · 2024-07-22T15:22:12Z

This PR is stalled because it has been open for 30 days with no activity.

opensearch-trigger-bot · 2024-08-24T15:21:23Z

This PR is stalled because it has been open for 30 days with no activity.

rursprung · 2024-09-05T08:27:26Z

pinging @msfroh to maybe get a feedback?

msfroh · 2024-09-06T03:34:45Z

Hey -- sorry! I only just figured out how to enable @ notifications that rise above the flood of general GitHub activity email. I'll take a look at this tomorrow morning.

msfroh · 2024-09-06T17:42:19Z

So, I'm not a huge fan of taking a dependency on a module, just from a dependency management standpoint. If we want to make something available for reuse, I would move it into something under /libs. I don't think any of the existing libraries really apply, though. Maybe a dedicated library for "OpenSearch Lucene components that haven't been pushed down into Lucene"?

Regarding this specific token filter, I was thinking "Why is this useful?". Lucene has RemoveDuplicatesTokenFilter, which I think UniqueTokenFilter copied and modified. The distinction is that RemoveDuplicatesTokenFilter only discards repeated tokens in the same position (which could pop up under some synonym expansions, maybe some cases with NGramTokenFilter if the same substring occurs multiple times, I think), whereas UniqueTokenFilter will drop duplicates across the whole stream. There was a little bit of discussion on the solr-user mailing list about this almost 10 years ago: https://lists.apache.org/thread/7kdrjqzr76f8cytdfg3zg1yhlo3zogtz. The relevant reply from Walter Underwood was:

Why is that useful? It breaks phrase search.

If you want to ignore term frequency in ranking, change the Similarity class.

I can kind of see his point. If you don't want to keep track of all the positions where a term occurs, and you don't care how often the term occurs, you could set "index_options":"docs" for the field. That pushes the problem down to Lucene, which only writes one copy of each term per segment anyway, but (with the default "index_options":"positions") would otherwise keep track of the positions and frequency of repeated occurrences. Nowadays, you could use the match_only_text field type as well -- that sets "index_options":"docs", but also enables (slow) phrase matching by running an analyzer against docs with term matches at search time.

So, I guess my real question is "Why do you need this specific token filter?". It looks like it was something added to Elasticsearch back in 2011, but where the Lucene/Solr community didn't think it was a good idea. I'm inclined to agree with the Lucene/Solr folks of the day. If anything, I would be more inclined mark this token filter for deprecation and removal. (The "good" use is already covered by the remove_duplicates filter.)

I would go back and check the Slack conversation, but we just crossed the 3-month retention window. (I can see your messages blurred behind the "Upgrade to Slack Pro!" banner.)

rursprung · 2024-09-06T21:29:21Z

thanks for your reply, @msfroh!

I would go back and check the Slack conversation, but we just crossed the 3-month retention window. (I can see your messages blurred behind the "Upgrade to Slack Pro!" banner.)

i have the relevant links in my first reply on this issue:

i will need it in the upcoming opensearch-phone-number-analyzer (#11326), see the hack done in elasticsearch-phone to work around this.

as mentioned in this comment i've now implemented a version which just does the filtering by using a Set instead of a List: #11326 (comment)
this way no dependency is needed, but i don't know how "nice"/"clean" this is from a Lucene/OS point of view?

msfroh · 2024-09-06T23:25:42Z

Aha! Got it.

From a Lucene standpoint, I don't think deduping is really necessary. Given that the tokens also don't seem to have/need a particular position (or at least all of the ngrams should probably share a position), you could set the PositionIncrementAttribute to 0. Then RemoveDuplicatesTokenFilter would deduplicate any repeated tokens.

msfroh · 2024-09-06T23:27:06Z

Incidentally, I would absolutely add your analyzer to the analysis-common module.

rursprung · 2024-09-12T16:56:56Z

no longer required based on the feedback received here. thanks everyone!

see #15915 for the new approach :)

rursprung requested review from anasalkouz, andrross, Bukhtawar, CEHENKLE, dblock, dbwiddis, dreamer-89, gbbafna, kotwanikunal, mch2, msfroh, nknize, owaiskazi19, reta, Rishikesh1159, sachinpkale, saratvemulapalli, shwetathareja, sohami, tlfeng and VachaShah as code owners June 11, 2024 15:04

andrross added the skip-changelog label Jun 11, 2024

andrross approved these changes Jun 11, 2024

View reviewed changes

rursprung force-pushed the make-UniqueTokenFilter-public branch from 4c77270 to e22345b Compare June 11, 2024 20:42

rursprung mentioned this pull request Jun 26, 2024

[RFC] new plugin with normalizer & analyzer for phone numbers #11326

Closed

opensearch-trigger-bot bot added stalled Issues that have stalled and removed stalled Issues that have stalled labels Jul 22, 2024

opensearch-trigger-bot bot added stalled Issues that have stalled and removed stalled Issues that have stalled labels Aug 24, 2024

rursprung closed this Sep 12, 2024

rursprung deleted the make-UniqueTokenFilter-public branch September 12, 2024 16:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`analysis-common`: make `UniqueTokenFilter` public #14179

`analysis-common`: make `UniqueTokenFilter` public #14179

rursprung commented Jun 11, 2024

rursprung commented Jun 11, 2024 •

edited

Loading

andrross commented Jun 11, 2024

dblock commented Jun 11, 2024

rursprung commented Jun 11, 2024

reta commented Jun 11, 2024

github-actions bot commented Jun 11, 2024

rursprung commented Jun 11, 2024

andrross commented Jun 11, 2024

rursprung commented Jun 11, 2024

reta commented Jun 11, 2024

rursprung commented Jun 11, 2024

reta commented Jun 11, 2024 •

edited

Loading

github-actions bot commented Jun 11, 2024

codecov bot commented Jun 11, 2024 •

edited

Loading

dblock commented Jun 12, 2024 •

edited

Loading

rursprung commented Jun 20, 2024

dblock commented Jun 20, 2024

opensearch-trigger-bot bot commented Jul 22, 2024

opensearch-trigger-bot bot commented Aug 24, 2024

rursprung commented Sep 5, 2024

msfroh commented Sep 6, 2024

msfroh commented Sep 6, 2024 •

edited

Loading

rursprung commented Sep 6, 2024

msfroh commented Sep 6, 2024

msfroh commented Sep 6, 2024

rursprung commented Sep 12, 2024

analysis-common: make UniqueTokenFilter public #14179

analysis-common: make UniqueTokenFilter public #14179

Conversation

rursprung commented Jun 11, 2024

Description

Related Issues

Check List

rursprung commented Jun 11, 2024 • edited Loading

andrross commented Jun 11, 2024

dblock commented Jun 11, 2024

rursprung commented Jun 11, 2024

reta commented Jun 11, 2024

github-actions bot commented Jun 11, 2024

rursprung commented Jun 11, 2024

andrross commented Jun 11, 2024

rursprung commented Jun 11, 2024

reta commented Jun 11, 2024

rursprung commented Jun 11, 2024

reta commented Jun 11, 2024 • edited Loading

github-actions bot commented Jun 11, 2024

codecov bot commented Jun 11, 2024 • edited Loading

Codecov Report

dblock commented Jun 12, 2024 • edited Loading

rursprung commented Jun 20, 2024

dblock commented Jun 20, 2024

opensearch-trigger-bot bot commented Jul 22, 2024

opensearch-trigger-bot bot commented Aug 24, 2024

rursprung commented Sep 5, 2024

msfroh commented Sep 6, 2024

msfroh commented Sep 6, 2024 • edited Loading

rursprung commented Sep 6, 2024

msfroh commented Sep 6, 2024

msfroh commented Sep 6, 2024

rursprung commented Sep 12, 2024

`analysis-common`: make `UniqueTokenFilter` public #14179

`analysis-common`: make `UniqueTokenFilter` public #14179

rursprung commented Jun 11, 2024 •

edited

Loading

reta commented Jun 11, 2024 •

edited

Loading

codecov bot commented Jun 11, 2024 •

edited

Loading

dblock commented Jun 12, 2024 •

edited

Loading

msfroh commented Sep 6, 2024 •

edited

Loading