-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
analysis-common
: make UniqueTokenFilter
public
#14179
analysis-common
: make UniqueTokenFilter
public
#14179
Conversation
i checked: this is the only i will need it in the upcoming opensearch-phone-number-analyzer (#11326), see the hack done in elasticsearch-phone to work around this. i intentionally didn't add this to the changelog as i don't think that it's noteworthy, so the |
This is not anything that needs to be annotated with |
none of the factories & analyzers in |
❌ Gradle check result for 4c77270: null Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
gradle check seems to have run into a bug in the implementation?
the actual jenkins build itself succeeded:
the only issue i found about this is #13210, but that's closed with the comment "Looks like Jenkins crashed" which isn't the case here. |
@dblock Currently we only use those annotations within the |
this way the filter can also be used from plugins which implement analyzers and want to use it. the current workaround is that the plugin has to implement the usage from it in a package with the same name, which is just an ugly hack. Signed-off-by: Ralph Ursprung <[email protected]>
4c77270
to
e22345b
Compare
i missed the constructor, thanks for pointing that out @reta!🤦 |
@rursprung now I do have questions (sorry): what prevents you from copying |
i think it already (sort-of) is a public API as it can be referenced from JSON (OS docs just mention it in a list, ES docs go into more details). copying it would IMHO be the worse solution - esp. since i very much hope that the plugin i'm building will end up in |
we are trully talking about ~50 lines of code with no external deps
what happens will happen and when happens, the impl is going to adapt to that |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #14179 +/- ##
============================================
+ Coverage 71.42% 71.65% +0.23%
- Complexity 59978 61914 +1936
============================================
Files 4985 5117 +132
Lines 282275 291173 +8898
Branches 40946 42067 +1121
============================================
+ Hits 201603 208648 +7045
- Misses 63999 65284 +1285
- Partials 16673 17241 +568 ☔ View full report in Codecov by Sentry. |
@rursprung I think you're arguing that this class is truly reusable functionality (API). What are the reasons? If it didn't exist, would you want to add its code to OpenSearch core and subclass it in your plugin? |
i've been thinking about this for a while and i think it depends on the answer to the following questions: does the order of the tokens have any impact on matching (speed), score or anything else? and what's the impact of having the same token more than once? because in my implementation i anyway have all tokens at once in a |
I am out of my depth on the token-specific questions, let's try to get help from @msfroh? |
This PR is stalled because it has been open for 30 days with no activity. |
This PR is stalled because it has been open for 30 days with no activity. |
pinging @msfroh to maybe get a feedback? |
Hey -- sorry! I only just figured out how to enable @ notifications that rise above the flood of general GitHub activity email. I'll take a look at this tomorrow morning. |
So, I'm not a huge fan of taking a dependency on a module, just from a dependency management standpoint. If we want to make something available for reuse, I would move it into something under Regarding this specific token filter, I was thinking "Why is this useful?". Lucene has
I can kind of see his point. If you don't want to keep track of all the positions where a term occurs, and you don't care how often the term occurs, you could set So, I guess my real question is "Why do you need this specific token filter?". It looks like it was something added to Elasticsearch back in 2011, but where the Lucene/Solr community didn't think it was a good idea. I'm inclined to agree with the Lucene/Solr folks of the day. If anything, I would be more inclined mark this token filter for deprecation and removal. (The "good" use is already covered by the I would go back and check the Slack conversation, but we just crossed the 3-month retention window. (I can see your messages blurred behind the "Upgrade to Slack Pro!" banner.) |
thanks for your reply, @msfroh!
i have the relevant links in my first reply on this issue:
as mentioned in this comment i've now implemented a version which just does the filtering by using a |
Aha! Got it. From a Lucene standpoint, I don't think deduping is really necessary. Given that the tokens also don't seem to have/need a particular position (or at least all of the ngrams should probably share a position), you could set the |
Incidentally, I would absolutely add your analyzer to the analysis-common module. |
no longer required based on the feedback received here. thanks everyone! see #15915 for the new approach :) |
Description
this way the filter can also be used from plugins which implement analyzers and want to use it.
the current workaround is that the plugin has to implement the usage from it in a package with the same name, which is just an ugly hack.
Related Issues
no issue, but see this discussion on slack.
Check List
(none applicable)
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.