forked from crosswire/jsword
-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update lucene to version 8.11.2 #15
Closed
Closed
Changes from all commits
Commits
Show all changes
18 commits
Select commit
Hold shift + click to select a range
8479cf7
Pull translations
tuomas2 8258128
Compiles
JJK96 41a8b6d
Uncleaned version that supports regex searching
JJK96 fbeaac7
For regex queries search in full non-canonical text, while for other …
JJK96 982ce80
Add switch for regex search type
JJK96 4239e9c
Make Regex search case insensitive
JJK96 4c92c9c
Fix Thai analyzer
JJK96 a06ecda
Fix Hebrew analyser
JJK96 c784ccc
Fix Arabic
JJK96 7c43cca
Fix Persian
JJK96 d7616bc
Remove local.properties
JJK96 02fa61f
Fix analyzer references
JJK96 54c73b6
Fix tests
JJK96 a4f26c2
Add local.properties to gitignore
JJK96 c3933c7
Add smartcn analyzer
JJK96 d26a312
Fix Chinese and Japanese
JJK96 f00f512
Fix French stemmer test
JJK96 f355696
Fix all tests
JJK96 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -15,3 +15,5 @@ rebel.xml | |
/.gradle/ | ||
/build/ | ||
atlassian-ide-plugin.xml | ||
.DS_Store | ||
local.properties |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -17,7 +17,7 @@ | |
* © CrossWire Bible Society, 2009 - 2016 | ||
* | ||
*/ | ||
package org.crosswire.jsword.index.lucene.analysis; | ||
package org.apache.lucene.analysis; | ||
|
||
import java.io.IOException; | ||
import java.io.Reader; | ||
|
@@ -26,14 +26,14 @@ | |
import org.apache.lucene.analysis.StopFilter; | ||
import org.apache.lucene.analysis.TokenStream; | ||
import org.apache.lucene.analysis.ar.ArabicAnalyzer; | ||
import org.apache.lucene.analysis.ar.ArabicLetterTokenizer; | ||
import org.apache.lucene.analysis.ar.ArabicNormalizationFilter; | ||
import org.apache.lucene.analysis.ar.ArabicStemFilter; | ||
import org.apache.lucene.analysis.standard.StandardTokenizer; | ||
import org.apache.lucene.util.Version; | ||
|
||
/** | ||
* An Analyzer whose {@link TokenStream} is built from a | ||
* {@link ArabicLetterTokenizer} filtered with {@link LowerCaseFilter}, | ||
* {@link StandardTokenizer} filtered with {@link LowerCaseFilter}, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
* {@link ArabicNormalizationFilter}, {@link ArabicStemFilter} (optional) and | ||
* Arabic {@link StopFilter} (optional). | ||
* | ||
|
@@ -45,50 +45,20 @@ public ArabicLuceneAnalyzer() { | |
stopSet = ArabicAnalyzer.getDefaultStopSet(); | ||
} | ||
|
||
/* (non-Javadoc) | ||
* @see org.apache.lucene.analysis.Analyzer#tokenStream(java.lang.String, java.io.Reader) | ||
*/ | ||
@Override | ||
public final TokenStream tokenStream(String fieldName, Reader reader) { | ||
TokenStream result = new ArabicLetterTokenizer(reader); | ||
result = new LowerCaseFilter(result); | ||
protected TokenStreamComponents createComponents(String fieldName) { | ||
Tokenizer source = new StandardTokenizer(); | ||
TokenStream result = new LowerCaseFilter(source); | ||
result = new ArabicNormalizationFilter(result); | ||
if (doStopWords && stopSet != null) { | ||
result = new StopFilter(false, result, stopSet); | ||
result = new StopFilter(result, (CharArraySet) stopSet); | ||
} | ||
|
||
if (doStemming) { | ||
result = new ArabicStemFilter(result); | ||
} | ||
|
||
return result; | ||
return new TokenStreamComponents(source, result); | ||
} | ||
|
||
/* (non-Javadoc) | ||
* @see org.apache.lucene.analysis.Analyzer#reusableTokenStream(java.lang.String, java.io.Reader) | ||
*/ | ||
@Override | ||
public TokenStream reusableTokenStream(String fieldName, Reader reader) throws IOException { | ||
SavedStreams streams = (SavedStreams) getPreviousTokenStream(); | ||
if (streams == null) { | ||
streams = new SavedStreams(new ArabicLetterTokenizer(reader)); | ||
streams.setResult(new LowerCaseFilter(streams.getResult())); | ||
streams.setResult(new ArabicNormalizationFilter(streams.getResult())); | ||
|
||
if (doStopWords && stopSet != null) { | ||
streams.setResult(new StopFilter(StopFilter.getEnablePositionIncrementsVersionDefault(matchVersion), streams.getResult(), stopSet)); | ||
} | ||
|
||
if (doStemming) { | ||
streams.setResult(new ArabicStemFilter(streams.getResult())); | ||
} | ||
|
||
setPreviousTokenStream(streams); | ||
} else { | ||
streams.getSource().reset(reader); | ||
} | ||
return streams.getResult(); | ||
} | ||
|
||
private final Version matchVersion = Version.LUCENE_29; | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
48 changes: 48 additions & 0 deletions
48
src/main/java/org/apache/lucene/analysis/CzechLuceneAnalyzer.java
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,48 @@ | ||
/** | ||
* Distribution License: | ||
* JSword is free software; you can redistribute it and/or modify it under | ||
* the terms of the GNU Lesser General Public License, version 2.1 or later | ||
* as published by the Free Software Foundation. This program is distributed | ||
* in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even | ||
* the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. | ||
* See the GNU Lesser General Public License for more details. | ||
* | ||
* The License is available on the internet at: | ||
* http://www.gnu.org/copyleft/lgpl.html | ||
* or by writing to: | ||
* Free Software Foundation, Inc. | ||
* 59 Temple Place - Suite 330 | ||
* Boston, MA 02111-1307, USA | ||
* | ||
* © CrossWire Bible Society, 2007 - 2016 | ||
* | ||
*/ | ||
package org.apache.lucene.analysis; | ||
|
||
import org.apache.lucene.analysis.core.LetterTokenizer; | ||
import org.apache.lucene.analysis.cz.CzechAnalyzer; | ||
|
||
/** | ||
* An Analyzer whose {@link TokenStream} is built from a | ||
* {@link LetterTokenizer} filtered with {@link LowerCaseFilter and @link StopFilter} (optional). | ||
* Stemming not implemented yet | ||
* | ||
* @see gnu.lgpl.License The GNU Lesser General Public License for details. | ||
* @author Sijo Cherian | ||
* @author DM SMITH | ||
*/ | ||
final public class CzechLuceneAnalyzer extends AbstractBookAnalyzer { | ||
public CzechLuceneAnalyzer() { | ||
stopSet = CzechAnalyzer.getDefaultStopSet(); | ||
} | ||
|
||
@Override | ||
protected TokenStreamComponents createComponents(String fieldName) { | ||
Tokenizer source = new LetterTokenizer(); | ||
TokenStream result = new LowerCaseFilter(source); | ||
if (doStopWords && stopSet != null) { | ||
result = new StopFilter(result, (CharArraySet) stopSet); | ||
} | ||
return new TokenStreamComponents(source, result); | ||
} | ||
} |
63 changes: 63 additions & 0 deletions
63
src/main/java/org/apache/lucene/analysis/EnglishLuceneAnalyzer.java
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,63 @@ | ||
/** | ||
* Distribution License: | ||
* JSword is free software; you can redistribute it and/or modify it under | ||
* the terms of the GNU Lesser General Public License, version 2.1 or later | ||
* as published by the Free Software Foundation. This program is distributed | ||
* in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even | ||
* the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. | ||
* See the GNU Lesser General Public License for more details. | ||
* | ||
* The License is available on the internet at: | ||
* http://www.gnu.org/copyleft/lgpl.html | ||
* or by writing to: | ||
* Free Software Foundation, Inc. | ||
* 59 Temple Place - Suite 330 | ||
* Boston, MA 02111-1307, USA | ||
* | ||
* © CrossWire Bible Society, 2007 - 2016 | ||
* | ||
*/ | ||
package org.apache.lucene.analysis; | ||
|
||
import org.apache.lucene.analysis.core.LetterTokenizer; | ||
import org.apache.lucene.analysis.en.EnglishAnalyzer; | ||
import org.apache.lucene.analysis.en.PorterStemFilter; | ||
|
||
/** | ||
* English Analyzer works like lucene SimpleAnalyzer + Stemming. | ||
* (LowerCaseTokenizer > PorterStemFilter). Like the AbstractAnalyzer, | ||
* {@link StopFilter} is off by default. | ||
* | ||
* | ||
* @see gnu.lgpl.License The GNU Lesser General Public License for details. | ||
* @author sijo cherian | ||
*/ | ||
final public class EnglishLuceneAnalyzer extends AbstractBookAnalyzer { | ||
|
||
public EnglishLuceneAnalyzer() { | ||
stopSet = EnglishAnalyzer.ENGLISH_STOP_WORDS_SET; | ||
} | ||
|
||
|
||
/** | ||
* Constructs a {@link LetterTokenizer} with {@link LowerCaseFilter} filtered by a language filter | ||
* {@link StopFilter} and {@link PorterStemFilter} for English. | ||
*/ | ||
@Override | ||
protected TokenStreamComponents createComponents(String fieldName) { | ||
Tokenizer source = new LetterTokenizer(); | ||
TokenStream result = new LowerCaseFilter(source); | ||
|
||
if (doStopWords && stopSet != null) { | ||
result = new StopFilter(result, (CharArraySet) stopSet); | ||
} | ||
|
||
// Using Porter Stemmer | ||
if (doStemming) { | ||
result = new PorterStemFilter(result); | ||
} | ||
|
||
return new TokenStreamComponents(source, result); | ||
} | ||
|
||
} |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are these moved away from our namespace (org.crosswire)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because I needed to access some protected methods of Lucene classes in order to implement AbstractBookAnalyzer. Access to protected is only allowed in the same namespace.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm then solution is somewhat hacky. Options to consider:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found out that there's public abtract class StopWordAnalyzerBase that probably could be used as a base class. At least that is the baseclass within lucene core lib that is used for per-language classes there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question arises are all custom per-language analyzers still really needed or could we simplify code by using analyzers from lucene core directly.
AbstractBookAnalyzer carries book info and it is passed to some filter classes, but any of those does not seem to use that information. I am having a feeling that all that could be simplified greatly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, I'll look into it