Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

analysis-common: make UniqueTokenFilter public #14179

Conversation

rursprung
Copy link
Contributor

Description

this way the filter can also be used from plugins which implement analyzers and want to use it.
the current workaround is that the plugin has to implement the usage from it in a package with the same name, which is just an ugly hack.

Related Issues

no issue, but see this discussion on slack.

Check List

(none applicable)

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@rursprung
Copy link
Contributor Author

rursprung commented Jun 11, 2024

i checked: this is the only *Filter i could find in analysis-common, everything else are analyzers (already public) or factories (also public).

i will need it in the upcoming opensearch-phone-number-analyzer (#11326), see the hack done in elasticsearch-phone to work around this.

i intentionally didn't add this to the changelog as i don't think that it's noteworthy, so the skip-changelog label would be nice :)

@andrross
Copy link
Member

This seems reasonable to me. @reta @msfroh Any objections to making this class public?

@dblock
Copy link
Member

dblock commented Jun 11, 2024

This is not anything that needs to be annotated with @PublicApi or something like that?

@rursprung
Copy link
Contributor Author

This is not anything that needs to be annotated with @PublicApi or something like that?

none of the factories & analyzers in analysis-common are annotated with this either, so i presume not?

@reta
Copy link
Collaborator

reta commented Jun 11, 2024

This seems reasonable to me. @reta @msfroh Any objections to making this class public?

I understand the issue but I want to point out that nonetheless all *Factory are public, all constructors, including the UniqueTokenFilter are still package private. How does making the class public help?

Copy link
Contributor

❌ Gradle check result for 4c77270: null

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@rursprung
Copy link
Contributor Author

❌ Gradle check result for 4c77270: null

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

gradle check seems to have run into a bug in the implementation?

[..]
Workflow running status :true
Still running, wait for another 30 seconds before checking again, max timeout 7200
Jenkins Workflow Url: https://build.ci.opensearch.org/job/gradle-check/40689/
time pass: 3180
parse error: Invalid numeric literal at line 1, column 7
Workflow running status :Attempt 1/10 failed. The jq processing failed with exit code: 4. Retrying...
true
Complete the run, checking results now......
Please check jenkins url for logs: https://build.ci.opensearch.org/job/gradle-check/40689/
Result: null

the actual jenkins build itself succeeded:

Finished: SUCCESS

the only issue i found about this is #13210, but that's closed with the comment "Looks like Jenkins crashed" which isn't the case here.

@andrross
Copy link
Member

@dblock Currently we only use those annotations within the :server module.

this way the filter can also be used from plugins which implement
analyzers and want to use it.
the current workaround is that the plugin has to implement the usage
from it in a package with the same name, which is just an ugly hack.

Signed-off-by: Ralph Ursprung <[email protected]>
@rursprung rursprung force-pushed the make-UniqueTokenFilter-public branch from 4c77270 to e22345b Compare June 11, 2024 20:42
@rursprung
Copy link
Contributor Author

This seems reasonable to me. @reta @msfroh Any objections to making this class public?

I understand the issue but I want to point out that nonetheless all *Factory are public, all constructors, including the UniqueTokenFilter are still package private. How does making the class public help?

i missed the constructor, thanks for pointing that out @reta!🤦
i've now made it public here and tested with my plugin that i can use it successfully (via deployToMavenLocal for the time being).
i've still just made the constructor public for UniqueTokenFilter - i guess if/when this is needed for the other classes it can be done there as well.

@reta
Copy link
Collaborator

reta commented Jun 11, 2024

i've still just made the constructor public for UniqueTokenFilter - i guess if/when this is needed for the other classes it can be done there as well.

@rursprung now I do have questions (sorry): what prevents you from copying UniqueTokenFilter to your plugin? I understand that reuse is often better but I believe in this case we just expose yet another class which we would have to maintain as a public API, I am -1 to that.

@rursprung
Copy link
Contributor Author

@rursprung now I do have questions (sorry): what prevents you from copying UniqueTokenFilter to your plugin? I understand that reuse is often better but I believe in this case we just expose yet another class which we would have to maintain as a public API, I am -1 to that.

i think it already (sort-of) is a public API as it can be referenced from JSON (OS docs just mention it in a list, ES docs go into more details).

copying it would IMHO be the worse solution - esp. since i very much hope that the plugin i'm building will end up in opensearch-project anyway, so it'll have the same maintainers 😄
then i'd rather just say "screw it, let's just have duplicate tokens" (i don't think it actually has any real adverse effect? we're not talking about hundreds of duplicate tokens but 1-2 per field with that tokenizer) or "screw it, let's add a class in the same package to circumvent it".

@reta
Copy link
Collaborator

reta commented Jun 11, 2024

copying it would IMHO be the worse solution - esp

we are trully talking about ~50 lines of code with no external deps

copying it would IMHO be the worse solution - esp. since i very much hope that the plugin i'm building will end up in opensearch-project anyway, so it'll have the same maintainers 😄

what happens will happen and when happens, the impl is going to adapt to that

Copy link
Contributor

✅ Gradle check result for e22345b: SUCCESS

Copy link

codecov bot commented Jun 11, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 71.65%. Comparing base (b15cb0c) to head (e22345b).
Report is 849 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main   #14179      +/-   ##
============================================
+ Coverage     71.42%   71.65%   +0.23%     
- Complexity    59978    61914    +1936     
============================================
  Files          4985     5117     +132     
  Lines        282275   291173    +8898     
  Branches      40946    42067    +1121     
============================================
+ Hits         201603   208648    +7045     
- Misses        63999    65284    +1285     
- Partials      16673    17241     +568     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@dblock
Copy link
Member

dblock commented Jun 12, 2024

@rursprung I think you're arguing that this class is truly reusable functionality (API). What are the reasons? If it didn't exist, would you want to add its code to OpenSearch core and subclass it in your plugin?

@rursprung
Copy link
Contributor Author

@rursprung I think you're arguing that this class is truly reusable functionality (API). What are the reasons? If it didn't exist, would you want to add its code to OpenSearch core and subclass it in your plugin?

i've been thinking about this for a while and i think it depends on the answer to the following questions: does the order of the tokens have any impact on matching (speed), score or anything else? and what's the impact of having the same token more than once?

because in my implementation i anyway have all tokens at once in a List - if the order is not relevant i can just use a Set instead and be done with it. then we can abandon this PR. even if the order were important i could use a LinkedHashSet. so most likely this would be good enough either way?
the advantage of using this filter here would be that it'd maybe make it more visible from the outside what's happening (though arguably it might also be a downside as using the tokenizer directly would then mean that you still get duplicate tokens).

@dblock
Copy link
Member

dblock commented Jun 20, 2024

I am out of my depth on the token-specific questions, let's try to get help from @msfroh?

@opensearch-trigger-bot
Copy link
Contributor

This PR is stalled because it has been open for 30 days with no activity.

@opensearch-trigger-bot opensearch-trigger-bot bot added stalled Issues that have stalled and removed stalled Issues that have stalled labels Jul 22, 2024
@opensearch-trigger-bot
Copy link
Contributor

This PR is stalled because it has been open for 30 days with no activity.

@opensearch-trigger-bot opensearch-trigger-bot bot added stalled Issues that have stalled and removed stalled Issues that have stalled labels Aug 24, 2024
@rursprung
Copy link
Contributor Author

pinging @msfroh to maybe get a feedback?

@msfroh
Copy link
Collaborator

msfroh commented Sep 6, 2024

Hey -- sorry! I only just figured out how to enable @ notifications that rise above the flood of general GitHub activity email. I'll take a look at this tomorrow morning.

@msfroh
Copy link
Collaborator

msfroh commented Sep 6, 2024

So, I'm not a huge fan of taking a dependency on a module, just from a dependency management standpoint. If we want to make something available for reuse, I would move it into something under /libs. I don't think any of the existing libraries really apply, though. Maybe a dedicated library for "OpenSearch Lucene components that haven't been pushed down into Lucene"?

Regarding this specific token filter, I was thinking "Why is this useful?". Lucene has RemoveDuplicatesTokenFilter, which I think UniqueTokenFilter copied and modified. The distinction is that RemoveDuplicatesTokenFilter only discards repeated tokens in the same position (which could pop up under some synonym expansions, maybe some cases with NGramTokenFilter if the same substring occurs multiple times, I think), whereas UniqueTokenFilter will drop duplicates across the whole stream. There was a little bit of discussion on the solr-user mailing list about this almost 10 years ago: https://lists.apache.org/thread/7kdrjqzr76f8cytdfg3zg1yhlo3zogtz. The relevant reply from Walter Underwood was:

Why is that useful? It breaks phrase search.

If you want to ignore term frequency in ranking, change the Similarity class.

I can kind of see his point. If you don't want to keep track of all the positions where a term occurs, and you don't care how often the term occurs, you could set "index_options":"docs" for the field. That pushes the problem down to Lucene, which only writes one copy of each term per segment anyway, but (with the default "index_options":"positions") would otherwise keep track of the positions and frequency of repeated occurrences. Nowadays, you could use the match_only_text field type as well -- that sets "index_options":"docs", but also enables (slow) phrase matching by running an analyzer against docs with term matches at search time.

So, I guess my real question is "Why do you need this specific token filter?". It looks like it was something added to Elasticsearch back in 2011, but where the Lucene/Solr community didn't think it was a good idea. I'm inclined to agree with the Lucene/Solr folks of the day. If anything, I would be more inclined mark this token filter for deprecation and removal. (The "good" use is already covered by the remove_duplicates filter.)

I would go back and check the Slack conversation, but we just crossed the 3-month retention window. (I can see your messages blurred behind the "Upgrade to Slack Pro!" banner.)

@rursprung
Copy link
Contributor Author

thanks for your reply, @msfroh!

I would go back and check the Slack conversation, but we just crossed the 3-month retention window. (I can see your messages blurred behind the "Upgrade to Slack Pro!" banner.)

i have the relevant links in my first reply on this issue:

i will need it in the upcoming opensearch-phone-number-analyzer (#11326), see the hack done in elasticsearch-phone to work around this.

as mentioned in this comment i've now implemented a version which just does the filtering by using a Set instead of a List: #11326 (comment)
this way no dependency is needed, but i don't know how "nice"/"clean" this is from a Lucene/OS point of view?

@msfroh
Copy link
Collaborator

msfroh commented Sep 6, 2024

Aha! Got it.

From a Lucene standpoint, I don't think deduping is really necessary. Given that the tokens also don't seem to have/need a particular position (or at least all of the ngrams should probably share a position), you could set the PositionIncrementAttribute to 0. Then RemoveDuplicatesTokenFilter would deduplicate any repeated tokens.

@msfroh
Copy link
Collaborator

msfroh commented Sep 6, 2024

Incidentally, I would absolutely add your analyzer to the analysis-common module.

@rursprung
Copy link
Contributor Author

no longer required based on the feedback received here. thanks everyone!

see #15915 for the new approach :)

@rursprung rursprung closed this Sep 12, 2024
@rursprung rursprung deleted the make-UniqueTokenFilter-public branch September 12, 2024 16:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants