-
Notifications
You must be signed in to change notification settings - Fork 641
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
One character is missing in class ASCIIFoldingFilter #618
Comments
Thanks for the report. As this is a line-by-line port from Java Lucene 4.8.0 (for the most part), we have faithfully reproduced the ASCIIFoldingFilter in its entirety. While we have admittedly included some patches from later versions of Lucene where they affect usability (for example, If you wish to pursue adding more characters to However, do note this isn't the only filter included in the box that is capable of removing diacritics from ASCII characters. Some alternatives: Note that you can also create a custom folding filter by using a similar approach in the ICUFoldingFilter implementation (ported from Lucene 7.1.0). There is a tool you can port to generate a Alternatively, if you wish to extend the public TokenStream GetTokenStream(string fieldName, TextReader reader)
{
TokenStream result = new StandardTokenizer(reader);
result = new StandardFilter(result);
result = new LowerCaseFilter(result);
// etc etc ...
result = new StopFilter(result, yourSetOfStopWords);
result = new MyCustomFoldingFilter(result);
result = new ASCIIFoldingFilter(result);
return result;
} |
@NightOwl888 where do you use that GetTokenStream?
}` Thanks! |
My bad. It looks like the example I pulled was from an older version of Lucene. However, "Answer 2" in this link shows an example from 4.9.0, which is similar enough to 4.8.0. // Accent insensitive analyzer
public class CustomAnalyzer : StopwordAnalyzerBase {
public CustomAnalyzer (LuceneVersion matchVersion)
: base(matchVersion, StopAnalyzer.ENGLISH_STOP_WORDS_SET)
{
}
protected override TokenStreamComponents CreateComponents(string fieldName, TextReader reader)
{
Tokenizer tokenizer = new KeywordTokenizer(reader);
TokenStream result = new StopFilter(m_matchVersion, tokenizer, m_stopwords);
result = new LowerCaseFilter(matchVersion, result);
result = new CustomFoldingFilter(result);
result = new StandardFilter(matchVersion, result);
result = new ASCIIFoldingFilter(result);
}
} And of course, the whole idea of the last example is to implement another folding filter named Alternatively, use |
Thanks @NightOwl888 for the quick response! So added the necesary package to VS for ICU and changed the analyzer with
But I must be missing something because its still not looking for case changes nor accent. Am I missing something? For search I'm bulding this query:
For example, on the original database and saved to the index I had "Carlos Pírez" It seems like I'm missing something to force those filters added to the analyer, right? Thanks |
Ok, some advance here,
Changed analyzer for the TokenStream restult to start from StandarTokenizer instead of StopFilter and its filtering at least (previous version up in previous post wasn't even filtering) |
Nope, it isn't valid to use multiple tokenizers in the same Analyzer, as there are strict consuming rules to adhere to. It would be great to build code analysis components to ensure developers adhere to these tokenizer rules while typing, such as the rule that ensures I built a demo showing how to setup testing on custom analyzers here: https://github.com/NightOwl888/LuceneNetCustomAnalyzerDemo (as well as showing how the above example fails the tests). The functioning analyzer just uses a using Lucene.Net.Analysis;
using Lucene.Net.Analysis.Core;
using Lucene.Net.Analysis.Icu;
using Lucene.Net.Util;
using System.IO;
namespace LuceneExtensions
{
public sealed class CustomAnalyzer : Analyzer
{
private readonly LuceneVersion matchVersion;
public CustomAnalyzer(LuceneVersion matchVersion)
{
this.matchVersion = matchVersion;
}
protected override TokenStreamComponents CreateComponents(string fieldName, TextReader reader)
{
// Tokenize...
Tokenizer tokenizer = new WhitespaceTokenizer(matchVersion, reader);
TokenStream result = tokenizer;
// Filter...
result = new ICUFoldingFilter(result);
// Return result...
return new TokenStreamComponents(tokenizer, result);
}
}
} using Lucene.Net.Analysis;
using NUnit.Framework;
namespace LuceneExtensions.Tests
{
public class TestCustomAnalyzer : BaseTokenStreamTestCase
{
[Test]
public virtual void TestRemoveAccents()
{
Analyzer a = new CustomAnalyzer(TEST_VERSION_CURRENT);
// removal of latin accents (composed)
AssertAnalyzesTo(a, "résumé", new string[] { "resume" });
// removal of latin accents (decomposed)
AssertAnalyzesTo(a, "re\u0301sume\u0301", new string[] { "resume" });
// removal of latin accents (multi-word)
AssertAnalyzesTo(a, "Carlos Pírez", new string[] { "carlos", "pirez" });
}
}
} For other ideas about what test conditions you may use, I suggest having a look at Lucene.Net's extensive analyzer tests including the ICU tests. You may also refer to the tests to see if you can find a similar use case to yours for building queries (although do note that the tests don't show .NET best practices for disposing objects). |
FYI - there is also another demo showing additional ways to build analyzers here: https://github.com/NightOwl888/LuceneNetDemo |
Thanks again!!
I just lower case the text the user puts for searching and I find all combinations of accent and case. |
Just out of curiosity, do all of your use cases work without the Lowercasing is not the same as case folding (which is what
AssertAnalyzesTo(a, "Fuß", new string[] { "fuss" }); // German
AssertAnalyzesTo(a, "QUİT", new string[] { "quit" }); // Turkish While this might not matter for your use case, it is also worth noting that performance will be improved without the In addition, search performance and accuracy can be improved by using a |
Thanks! Just removed the LowerCase filter and changed the StandardFilter for the stopfilter and its working fine with casing and diactrics searches. |
One thing I noticed, there is a field that has the format like "M-4-20" or "B-7-68" ...
is there a way to escape the dash from the term or skip analysis from that field? |
FYI - There is a generic Spanish stop word list that can be accessed through SpanishAnalyzer.DefaultStopSet.
If all of the data in the field can be considered a token, there is a |
Thanks again! Your suggestions helped me a lot! I'm currently doing it like this
The WhitespaceAnalyzer did not help my case of the code format ("M-12-14", "B-10-39", etc) but will try other more suitable. And using the finalAnalyer for indexing and search. |
Ok, still using WhitespaceAnalyzer for those special columns, problem was I was lowecasing the search term for the CustomAnalyzer. So for those columns I actually uppercase it since I know its a column it only holds upercase characters and the analyzer doesn't lower case it. |
This seems to have been resolved, now. |
I think one character in class ASCIIFoldingFilter is missing
Character: Ʀ
Nº: 422
UTF-16: 01A6
Source code that might need to be added to method
FoldToASCII(char[] input, int inputPos, char[] output, int outputPos, int length):
Links about this character:
https://codepoints.net/U+01A6
https://en.wikipedia.org/wiki/%C6%A6
The text was updated successfully, but these errors were encountered: