-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update lucene to version 8.11.2 #15
Conversation
…queries, search as before
I think "a" is not a stop word in this context, because it is a verb here. But my French is not that good.
I don't speak all of these languages, so I sometimes just changed the test to reflect the output. At least that should prevent regression.
@@ -17,7 +17,7 @@ | |||
* © CrossWire Bible Society, 2008 - 2016 | |||
* | |||
*/ | |||
package org.crosswire.jsword.index.lucene.analysis; | |||
package org.apache.lucene.analysis; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are these moved away from our namespace (org.crosswire)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because I needed to access some protected methods of Lucene classes in order to implement AbstractBookAnalyzer. Access to protected is only allowed in the same namespace.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm then solution is somewhat hacky. Options to consider:
- Fork lucene analysis lib and remove protected from that particular class (and make upstream PR). Use fork while it is needed.
- Maybe protected is for a reason? Use some other way if lib author suggest something.
- Accept hackyness and just leave it like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found out that there's public abtract class StopWordAnalyzerBase that probably could be used as a base class. At least that is the baseclass within lucene core lib that is used for per-language classes there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question arises are all custom per-language analyzers still really needed or could we simplify code by using analyzers from lucene core directly.
AbstractBookAnalyzer carries book info and it is passed to some filter classes, but any of those does not seem to use that information. I am having a feeling that all that could be simplified greatly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, I'll look into it
import org.apache.lucene.util.Version; | ||
|
||
/** | ||
* An Analyzer whose {@link TokenStream} is built from a | ||
* {@link ArabicLetterTokenizer} filtered with {@link LowerCaseFilter}, | ||
* {@link StandardTokenizer} filtered with {@link LowerCaseFilter}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- arabic need to be tested
This gave access to some new features in Lucene, such as Regular Expression search. This is a major refactor because I updated Lucene 5 major versions.
I tested several languages, English, Czech, Chinese, Japanese, Thai and search works in these languages. I am not capable to test if the stemming is good for all languages, so some more testing by native speakers is necessary.