-
Notifications
You must be signed in to change notification settings - Fork 641
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP Port Lucene.Net.Analysis.Ko #645
base: master
Are you sure you want to change the base?
Conversation
The current released version of Lucenenet does not implement the Korean Analyzer as it does in Java. This commit serves to port over the logic from the Java repo to C# however it only contains logic for the Analyzer class.
The current released version of Lucenenet does not implement the Korean Analyzer as it does in Java. This commit serves to port over the logic from the Java repo to C# however it only contains logic for the Analyzer class.
…rean-analyzer # Conflicts: # src/Lucene.Net.Analysis.Common/Analysis/Ko/DecompoundToken.cs # src/Lucene.Net.Analysis.Common/Analysis/Ko/Dict/BinaryDictionary.cs # src/Lucene.Net.Analysis.Common/Analysis/Ko/Dict/CharacterDefinition.cs # src/Lucene.Net.Analysis.Common/Analysis/Ko/Dict/ConnectionCosts.cs # src/Lucene.Net.Analysis.Common/Analysis/Ko/Dict/Dictionary.cs # src/Lucene.Net.Analysis.Common/Analysis/Ko/Dict/TokenInfoDictionary.cs # src/Lucene.Net.Analysis.Common/Analysis/Ko/Dict/TokenInfoFST.cs # src/Lucene.Net.Analysis.Common/Analysis/Ko/Dict/UnknownDictionary.cs # src/Lucene.Net.Analysis.Common/Analysis/Ko/DictionaryToken.cs # src/Lucene.Net.Analysis.Common/Analysis/Ko/KoreanAnalyzer.cs # src/Lucene.Net.Analysis.Common/Analysis/Ko/TokenAttributes/PartOfSpeechAttributes.cs # src/Lucene.Net.Analysis.Common/Analysis/Ko/TokenAttributes/ReadingAttributes.cs # src/Lucene.Net.Analysis.Common/Analysis/Ko/TokenAttributes/ReadingAttributesImpl.cs
…rean-analyzer # Conflicts: # src/Lucene.Net.Analysis.Common/Analysis/Ko/DecompoundToken.cs # src/Lucene.Net.Analysis.Common/Analysis/Ko/Dict/BinaryDictionary.cs # src/Lucene.Net.Analysis.Common/Analysis/Ko/Dict/CharacterDefinition.cs # src/Lucene.Net.Analysis.Common/Analysis/Ko/Dict/ConnectionCosts.cs # src/Lucene.Net.Analysis.Common/Analysis/Ko/Dict/Dictionary.cs # src/Lucene.Net.Analysis.Common/Analysis/Ko/Dict/TokenInfoDictionary.cs # src/Lucene.Net.Analysis.Common/Analysis/Ko/Dict/TokenInfoFST.cs # src/Lucene.Net.Analysis.Common/Analysis/Ko/Dict/UnknownDictionary.cs # src/Lucene.Net.Analysis.Common/Analysis/Ko/DictionaryToken.cs # src/Lucene.Net.Analysis.Common/Analysis/Ko/KoreanAnalyzer.cs # src/Lucene.Net.Analysis.Common/Analysis/Ko/TokenAttributes/PartOfSpeechAttributes.cs # src/Lucene.Net.Analysis.Common/Analysis/Ko/TokenAttributes/ReadingAttributes.cs # src/Lucene.Net.Analysis.Common/Analysis/Ko/TokenAttributes/ReadingAttributesImpl.cs
implementation of the number filter and the respective filter factories
Wow! Very interesting contribution. It does not look like Java Lucene 4.8.0 or 4.8.1 contain the Which Java Lucene version is this contribution a port of? |
Thanks for the contribution. There was a prior attempt at porting the nori analyzer on the feature/analysis-nori branch. However, there were 2 issues preventing it from functioning on Lucene.Net 4.8.0:
We are more concerned with getting past the first issue than the second, since it would be trivial to exclude Issues to Fix
FSTAt the time I attempted the feature/analysis-nori branch, the FST API seemed to fit, however, due to some design changes it produced completely different results than the version I had ported it from (unfortunately, I don't recall what version it is based upon). At the time I thought that FST was tied deeply into other Lucene components and having multiple incompatible versions in the project wouldn't work. However, I have since learned that FST is only used in specific scenarios that end users won't need to plug together, so having a copy of the a later version of FST in the To be able to debug, we need to be able to step through the code and get FST to return the exact results from the Lucene version this is based upon. So, we need a fresh port of FST from the same version of Lucene that nori is based upon. The convention we are following is to put "extra" components such as this into a folder named
Please be mindful that we will be using similar namespace conventions as we are currently (the namespace may not necessarily match the name of the project it belongs to). For now, please put the new FST port into the BigDecimalWe definitely don't want to take on a dependency to IKVM, both because it is large and because it doesn't support most of the .NET runtimes that we do. Please try one of the following (in order of preference):
|
Nice to meet, you. This is a port of Lucene 8.11.0. The problem with the CJK Analyzer that I ran into was the method |
Thank you for such a detailed response. Are these unresolved issues directly related to the output of the methods used by |
If by this you mean that you are using the same index files in both C# and Scala, this won't work because of differences in the binary format between the 4.x and 8.x codecs. 8.x doesn't have support for reading 4.x indexes, it ends at 5.x. Backward compatibility of codecs is read-only, but if that works for your use case (i.e. no docs added from Scala), porting the 4.6 codec from Lucene 5.5.5 to plug into Lucene 8.11.0 is likely possible given the fact the codec interface is pluggable, just the binary format changes between implementations. Another option is to use Lucene 8.11.0 and use IKVM's <Project Sdk="Microsoft.NET.Sdk">
<PropertyGroup>
<TargetFramework>netcoreapp3.1</TargetFramework>
</PropertyGroup>
<ItemGroup>
<PackageReference Include="IKVM.Maven.Sdk" Version="1.0.1" />
</ItemGroup>
<ItemGroup>
<MavenReference Include="org.apache.lucene:lucene-analyzers-nori" Version="8.11.0" Debug="true" />
</ItemGroup>
</Project> Alternatively, you could use Use the Maven copy-dependencies plugin to bring all of the .jar files into a specific local directory. pom.xmlUse this <?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.mycompany.app</groupId>
<artifactId>my-app</artifactId>
<version>1.0-SNAPSHOT</version>
<dependencies>
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-analyzers-nori</artifactId>
<version>8.11.0</version>
</dependency>
</dependencies>
</project> mvn org.apache.maven.plugins:maven-dependency-plugin:3.3.0:copy-dependencies -DoutputDirectory="F:\deps" -DincludeScope=compile To figure out what depends on what, you can use the depgraph-maven-plugin mvn com.github.ferstl:depgraph-maven-plugin:3.3.0:for-artifact -DoutputFileName=".\depgraph.txt" -DgraphFormat=text -DclasspathScope=compile -Dscopes=compile -Dversion=1.0-SNAPSHOT -DgroupId=com.mycompany.app -DartifactId=my-app -DshowVersions=true -DshowDuplicates -DshowGroupIds -DshowConflicts
Not all of them. We need to mirror the organization of the Lucene project. Nori is a separate module from analyzers-common for good reason - to allow most users to exclude it if all they need is Unfortunately, KoreanAnalyzer wasn't added until Lucene 8.0, and was backported to 7.4. Although most of it seems to work (only 3 of the tests that apply to it fail), the I have since attempted updating the the feature/analysis-nori branch (which I have determined must be 8.2.0 by eliminating the other possibilities) and porting FST from 8.2.0, but it is setup not to function with any version below 6, and that is by design. FST is used directly by the codecs, so although there are no references, I am pretty sure that they must be equivalent to read/write the same binary format from/to the index. I also made an attempt to re-export the mecab-ko-dic to using the 4.6 codec, but for some reason that also doesn't seem to fix the problems. It may be due to a bug in the nori port, but without anything to run to directly compare it to, it is difficult to determine what the problem actually is. There are instructions on how to build mecab-ko-dic which I am providing here for reference, but I have also added a test to build the dictionary with the same settings as the nori build. The mecab dictionary can be downloaded here. In theory, there is no real reason why KoreanAnalyzer cannot be ported to 4.8.0, but it will require cooperation with someone on the Lucene team to help us backport it, since we have absolutely nothing in the real world to compare it with. Maybe just by studying the binary formats of the codecs and the FST specifications that Lucene is based on it could be done, but I suspect it will also require someone who is familiar enough with the history of the Lucene binary formats and changes to the test framework/codecs for us to have passing tests in 4.8.0. There are only 3 test failures that would need to be addressed (see option 3 below). There are more failures due to the Options
|
Note that for BigDecimal, there is a decent port of one here: https://github.com/Singulink/Singulink.Numerics.BigDecimal. Although, it may take some effort to work out how to convert the rounding modes from Java to the equivalent in .NET. |
The dotnet implementation of the Lucene library has yet to release a version containing the Analyzer class for Korean.
To mirror the Java releases, the Korean Analyzer and its respective dependencies have been implemented.