Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update lucene to version 8.11.2 #15

Closed
wants to merge 18 commits into from
Closed
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -15,3 +15,5 @@ rebel.xml
/.gradle/
/build/
atlassian-ide-plugin.xml
.DS_Store
local.properties
10 changes: 6 additions & 4 deletions build.gradle.kts
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ tasks.withType<Test>() {
}

group = "org.crosswire"
version = "2.3"
version = "2.4"

repositories {
mavenCentral()
Expand All @@ -25,11 +25,13 @@ repositories {
dependencies {
// implementation("org.jetbrains.kotlin:kotlin-stdlib")
implementation("org.apache.commons:commons-compress:1.12")
implementation("com.chenlb.mmseg4j:mmseg4j-analysis:1.8.6")
implementation("com.chenlb.mmseg4j:mmseg4j-dic:1.8.6")

implementation("org.jdom:jdom2:2.0.6.1")
implementation("org.apache.lucene:lucene-analyzers:3.6.2")
implementation("org.apache.lucene:lucene-analyzers-common:8.11.2")
implementation("org.apache.lucene:lucene-analyzers-smartcn:8.11.2")
implementation("org.apache.lucene:lucene-analyzers-kuromoji:8.11.2")
implementation("org.apache.lucene:lucene-queryparser:8.11.2")

// To upgrade Lucene, change to
// implementation("org.apache.lucene:lucene-analyzers-common:x")

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@
* © CrossWire Bible Society, 2007 - 2016
*
*/
package org.crosswire.jsword.index.lucene.analysis;
package org.apache.lucene.analysis;

import java.util.Set;

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@
* © CrossWire Bible Society, 2008 - 2016
*
*/
package org.crosswire.jsword.index.lucene.analysis;
package org.apache.lucene.analysis;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are these moved away from our namespace (org.crosswire)?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because I needed to access some protected methods of Lucene classes in order to implement AbstractBookAnalyzer. Access to protected is only allowed in the same namespace.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm then solution is somewhat hacky. Options to consider:

  1. Fork lucene analysis lib and remove protected from that particular class (and make upstream PR). Use fork while it is needed.
  2. Maybe protected is for a reason? Use some other way if lib author suggest something.
  3. Accept hackyness and just leave it like this.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found out that there's public abtract class StopWordAnalyzerBase that probably could be used as a base class. At least that is the baseclass within lucene core lib that is used for per-language classes there.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question arises are all custom per-language analyzers still really needed or could we simplify code by using analyzers from lucene core directly.

AbstractBookAnalyzer carries book info and it is passed to some filter classes, but any of those does not seem to use that information. I am having a feeling that all that could be simplified greatly.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, I'll look into it


import org.apache.lucene.analysis.TokenFilter;
import org.apache.lucene.analysis.TokenStream;
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@
* © CrossWire Bible Society, 2007 - 2016
*
*/
package org.crosswire.jsword.index.lucene.analysis;
package org.apache.lucene.analysis;

import java.io.IOException;

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@
* © CrossWire Bible Society, 2009 - 2016
*
*/
package org.crosswire.jsword.index.lucene.analysis;
package org.apache.lucene.analysis;

import java.io.IOException;
import java.io.Reader;
Expand All @@ -26,14 +26,14 @@
import org.apache.lucene.analysis.StopFilter;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.ar.ArabicAnalyzer;
import org.apache.lucene.analysis.ar.ArabicLetterTokenizer;
import org.apache.lucene.analysis.ar.ArabicNormalizationFilter;
import org.apache.lucene.analysis.ar.ArabicStemFilter;
import org.apache.lucene.analysis.standard.StandardTokenizer;
import org.apache.lucene.util.Version;

/**
* An Analyzer whose {@link TokenStream} is built from a
* {@link ArabicLetterTokenizer} filtered with {@link LowerCaseFilter},
* {@link StandardTokenizer} filtered with {@link LowerCaseFilter},
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • arabic need to be tested

* {@link ArabicNormalizationFilter}, {@link ArabicStemFilter} (optional) and
* Arabic {@link StopFilter} (optional).
*
Expand All @@ -45,50 +45,20 @@ public ArabicLuceneAnalyzer() {
stopSet = ArabicAnalyzer.getDefaultStopSet();
}

/* (non-Javadoc)
* @see org.apache.lucene.analysis.Analyzer#tokenStream(java.lang.String, java.io.Reader)
*/
@Override
public final TokenStream tokenStream(String fieldName, Reader reader) {
TokenStream result = new ArabicLetterTokenizer(reader);
result = new LowerCaseFilter(result);
protected TokenStreamComponents createComponents(String fieldName) {
Tokenizer source = new StandardTokenizer();
TokenStream result = new LowerCaseFilter(source);
result = new ArabicNormalizationFilter(result);
if (doStopWords && stopSet != null) {
result = new StopFilter(false, result, stopSet);
result = new StopFilter(result, (CharArraySet) stopSet);
}

if (doStemming) {
result = new ArabicStemFilter(result);
}

return result;
return new TokenStreamComponents(source, result);
}

/* (non-Javadoc)
* @see org.apache.lucene.analysis.Analyzer#reusableTokenStream(java.lang.String, java.io.Reader)
*/
@Override
public TokenStream reusableTokenStream(String fieldName, Reader reader) throws IOException {
SavedStreams streams = (SavedStreams) getPreviousTokenStream();
if (streams == null) {
streams = new SavedStreams(new ArabicLetterTokenizer(reader));
streams.setResult(new LowerCaseFilter(streams.getResult()));
streams.setResult(new ArabicNormalizationFilter(streams.getResult()));

if (doStopWords && stopSet != null) {
streams.setResult(new StopFilter(StopFilter.getEnablePositionIncrementsVersionDefault(matchVersion), streams.getResult(), stopSet));
}

if (doStemming) {
streams.setResult(new ArabicStemFilter(streams.getResult()));
}

setPreviousTokenStream(streams);
} else {
streams.getSource().reset(reader);
}
return streams.getResult();
}

private final Version matchVersion = Version.LUCENE_29;
}
Original file line number Diff line number Diff line change
Expand Up @@ -17,28 +17,23 @@
* © CrossWire Bible Society, 2007 - 2016
*
*/
package org.crosswire.jsword.index.lucene.analysis;
package org.apache.lucene.analysis;

import java.io.IOException;
import java.io.Reader;
import java.util.HashMap;
import java.util.Map;
import java.util.Set;

import org.apache.lucene.analysis.LowerCaseTokenizer;
import org.apache.lucene.analysis.StopAnalyzer;
import org.apache.lucene.analysis.StopFilter;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.core.LetterTokenizer;
import org.apache.lucene.analysis.de.GermanAnalyzer;
import org.apache.lucene.analysis.en.EnglishAnalyzer;
import org.apache.lucene.analysis.fr.FrenchAnalyzer;
import org.apache.lucene.analysis.nl.DutchAnalyzer;
import org.apache.lucene.analysis.snowball.SnowballFilter;
import org.apache.lucene.util.Version;
import org.crosswire.jsword.book.Book;

/**
* An Analyzer whose {@link TokenStream} is built from a
* {@link LowerCaseTokenizer} filtered with {@link SnowballFilter} (optional)
* {@link LetterTokenizer} filtered with {@link SnowballFilter} and {@link org.apache.lucene.analysis.LowerCaseFilter}(optional)
* and {@link StopFilter} (optional) Default behavior: Stemming is done, Stop
* words not removed A snowball stemmer is configured according to the language
* of the Book. Currently it takes following stemmer names (available stemmers
Expand Down Expand Up @@ -73,46 +68,20 @@ final public class ConfigurableSnowballAnalyzer extends AbstractBookAnalyzer {
public ConfigurableSnowballAnalyzer() {
}

/**
* Filters {@link LowerCaseTokenizer} with {@link StopFilter} if enabled and
* {@link SnowballFilter}.
*/
@Override
public final TokenStream tokenStream(String fieldName, Reader reader) {
TokenStream result = new LowerCaseTokenizer(reader);
protected TokenStreamComponents createComponents(String fieldName) {
Tokenizer source = new LetterTokenizer();
TokenStream result = new LowerCaseFilter(source);
if (doStopWords && stopSet != null) {
result = new StopFilter(false, result, stopSet);
result = new StopFilter(result, (CharArraySet) stopSet);
}

// Configure Snowball filter based on language/stemmerName
if (doStemming) {
result = new SnowballFilter(result, stemmerName);
}

return result;
}

/* (non-Javadoc)
* @see org.apache.lucene.analysis.Analyzer#reusableTokenStream(java.lang.String, java.io.Reader)
*/
@Override
public TokenStream reusableTokenStream(String fieldName, Reader reader) throws IOException {
SavedStreams streams = (SavedStreams) getPreviousTokenStream();
if (streams == null) {
streams = new SavedStreams(new LowerCaseTokenizer(reader));
if (doStopWords && stopSet != null) {
streams.setResult(new StopFilter(StopFilter.getEnablePositionIncrementsVersionDefault(matchVersion), streams.getResult(), stopSet));
}

if (doStemming) {
streams.setResult(new SnowballFilter(streams.getResult(), stemmerName));
}

setPreviousTokenStream(streams);
} else {
streams.getSource().reset(reader);
}
return streams.getResult();
return new TokenStreamComponents(source, result);
}

@Override
Expand Down Expand Up @@ -173,8 +142,7 @@ public void pickStemmer(String languageCode) {
defaultStopWordMap.put("fr", FrenchAnalyzer.getDefaultStopSet());
defaultStopWordMap.put("de", GermanAnalyzer.getDefaultStopSet());
defaultStopWordMap.put("nl", DutchAnalyzer.getDefaultStopSet());
defaultStopWordMap.put("en", StopAnalyzer.ENGLISH_STOP_WORDS_SET);
defaultStopWordMap.put("en", EnglishAnalyzer.ENGLISH_STOP_WORDS_SET);
}

private final Version matchVersion = Version.LUCENE_29;
}
48 changes: 48 additions & 0 deletions src/main/java/org/apache/lucene/analysis/CzechLuceneAnalyzer.java
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
/**
* Distribution License:
* JSword is free software; you can redistribute it and/or modify it under
* the terms of the GNU Lesser General Public License, version 2.1 or later
* as published by the Free Software Foundation. This program is distributed
* in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even
* the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
* See the GNU Lesser General Public License for more details.
*
* The License is available on the internet at:
* http://www.gnu.org/copyleft/lgpl.html
* or by writing to:
* Free Software Foundation, Inc.
* 59 Temple Place - Suite 330
* Boston, MA 02111-1307, USA
*
* © CrossWire Bible Society, 2007 - 2016
*
*/
package org.apache.lucene.analysis;

import org.apache.lucene.analysis.core.LetterTokenizer;
import org.apache.lucene.analysis.cz.CzechAnalyzer;

/**
* An Analyzer whose {@link TokenStream} is built from a
* {@link LetterTokenizer} filtered with {@link LowerCaseFilter and @link StopFilter} (optional).
* Stemming not implemented yet
*
* @see gnu.lgpl.License The GNU Lesser General Public License for details.
* @author Sijo Cherian
* @author DM SMITH
*/
final public class CzechLuceneAnalyzer extends AbstractBookAnalyzer {
public CzechLuceneAnalyzer() {
stopSet = CzechAnalyzer.getDefaultStopSet();
}

@Override
protected TokenStreamComponents createComponents(String fieldName) {
Tokenizer source = new LetterTokenizer();
TokenStream result = new LowerCaseFilter(source);
if (doStopWords && stopSet != null) {
result = new StopFilter(result, (CharArraySet) stopSet);
}
return new TokenStreamComponents(source, result);
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
/**
* Distribution License:
* JSword is free software; you can redistribute it and/or modify it under
* the terms of the GNU Lesser General Public License, version 2.1 or later
* as published by the Free Software Foundation. This program is distributed
* in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even
* the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
* See the GNU Lesser General Public License for more details.
*
* The License is available on the internet at:
* http://www.gnu.org/copyleft/lgpl.html
* or by writing to:
* Free Software Foundation, Inc.
* 59 Temple Place - Suite 330
* Boston, MA 02111-1307, USA
*
* © CrossWire Bible Society, 2007 - 2016
*
*/
package org.apache.lucene.analysis;

import org.apache.lucene.analysis.core.LetterTokenizer;
import org.apache.lucene.analysis.en.EnglishAnalyzer;
import org.apache.lucene.analysis.en.PorterStemFilter;

/**
* English Analyzer works like lucene SimpleAnalyzer + Stemming.
* (LowerCaseTokenizer &gt; PorterStemFilter). Like the AbstractAnalyzer,
* {@link StopFilter} is off by default.
*
*
* @see gnu.lgpl.License The GNU Lesser General Public License for details.
* @author sijo cherian
*/
final public class EnglishLuceneAnalyzer extends AbstractBookAnalyzer {

public EnglishLuceneAnalyzer() {
stopSet = EnglishAnalyzer.ENGLISH_STOP_WORDS_SET;
}


/**
* Constructs a {@link LetterTokenizer} with {@link LowerCaseFilter} filtered by a language filter
* {@link StopFilter} and {@link PorterStemFilter} for English.
*/
@Override
protected TokenStreamComponents createComponents(String fieldName) {
Tokenizer source = new LetterTokenizer();
TokenStream result = new LowerCaseFilter(source);

if (doStopWords && stopSet != null) {
result = new StopFilter(result, (CharArraySet) stopSet);
}

// Using Porter Stemmer
if (doStemming) {
result = new PorterStemFilter(result);
}

return new TokenStreamComponents(source, result);
}

}
Loading
Loading