-
Notifications
You must be signed in to change notification settings - Fork 641
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TokenStream.IncrementToken() is called after Dispose() is called #271
Comments
It is possible that this issue was due to a bug in Lucene.Net.TestFramework that has been addressed already. We need to confirm that it was the test framework that was errantly causing the double-call, not a feature of Lucene.NET. Also, the ICUCollationKeyFilter no longer implements Dispose() - that was when we were referencing icu-dotnet (which had unmanaged resources) instead of ICU4N. However, the behavior can be observed in any existing disposable TokenStream or by overriding Dispose(bool) in one that isn't disposable. by nightowl888 |
I looked into this a little, and it appears that the bug that was fixed in the test framework didn't solve this. When adding a This is occuring in At this point I haven't confirmed this is how it works in Java so it is still unclear if this is expected behavior or a bug. If the former, we should consider the possibility of making |
I started by trying to solve the two TestRandomChains failures (see #269), but it ended up unraveling back to this issue again. I isolated a specific case that fails in .NET: [Test]
public void TestWordDelimiterFilterChain()
{
Analyzer a = Analyzer.NewAnonymous(createComponents: (fieldName, reader) =>
{
Tokenizer tokenizer = new NGramTokenizer(LuceneVersion.LUCENE_48, reader, 70, 79);
TokenStream filter = tokenizer;
filter = new WordDelimiterFilter(LuceneVersion.LUCENE_48, filter, WordDelimiterFlags.CATENATE_NUMBERS | WordDelimiterFlags.SPLIT_ON_CASE_CHANGE, new CharArraySet(LuceneVersion.LUCENE_48, new HashSet<string> { "htxodcrfoi", "zywcmh", "pkw", "urdvcqr", "gbuskh" }, ignoreCase: true));
filter = new ShingleFilter(filter, "EMAIL");
return new TokenStreamComponents(tokenizer, filter);
});
string text = "ufsczdev -[({0, \u20d0\u20f4\u20f4\u20d4\u20e0\u20e4\u20d0\u20ef\u20e7\u20dd\u20f7\u20d2\u20e1 kdoxod \u175b\u1758\u174a pirsk --><scr \ufe5d\ufe51\ufe6c\ufe5d\ufe5e\ufe61 \ud802\udf57 \ua6fb\ua6cd \ufc78\uf998\u06e9\u1b3a \ud802\ude2c\ud802\ude59\ud802\ude22\ud802\ude3b lkpnukew rqniibv \ue990\ud8d0\udfe6\ue6a8[ lmev wzpts \u306a\u304a\u3093\u3080\u3092";
bool useCharFilter = false;
bool offsetsAreCorrect = true;
long seed = 0x21773254b5ac8fa0L;
Random random = new J2N.Randomizer(seed);
try
{
CheckAnalysisConsistency(random, a, useCharFilter, text, offsetsAreCorrect);
}
catch (Exception e) when (e.IsThrowable())
{
Console.WriteLine("Exception from random analyzer: " + a);
throw;
}
} And after translating it to Java, it still fails: public void testWordDelimiterFilterChain() throws Throwable {
Analyzer a = new Analyzer() {
public TokenStreamComponents createComponents(String fieldName, Reader reader) {
Tokenizer tokenizer = new org.apache.lucene.analysis.ngram.NGramTokenizer(Version.LUCENE_48, reader, 70, 79);
TokenStream filter = tokenizer;
filter = new WordDelimiterFilter(Version.LUCENE_48, filter, WordDelimiterFilter.CATENATE_NUMBERS | WordDelimiterFilter.SPLIT_ON_CASE_CHANGE, new CharArraySet(Version.LUCENE_48, asSet("htxodcrfoi", "zywcmh", "pkw", "urdvcqr", "gbuskh"), /*ignoreCase:*/ true));
filter = new org.apache.lucene.analysis.shingle.ShingleFilter(filter, "EMAIL");
return new TokenStreamComponents(tokenizer, filter);
}
};
String text = "ufsczdev -[({0, \u20d0\u20f4\u20f4\u20d4\u20e0\u20e4\u20d0\u20ef\u20e7\u20dd\u20f7\u20d2\u20e1 kdoxod \u175b\u1758\u174a pirsk --><scr \ufe5d\ufe51\ufe6c\ufe5d\ufe5e\ufe61 \ud802\udf57 \ua6fb\ua6cd \ufc78\uf998\u06e9\u1b3a \ud802\ude2c\ud802\ude59\ud802\ude22\ud802\ude3b lkpnukew rqniibv \ue990\ud8d0\udfe6\ue6a8[ lmev wzpts \u306a\u304a\u3093\u3080\u3092";
boolean useCharFilter = false;
boolean offsetsAreCorrect = false;
long seed = 0x21773254b5ac8fa0L;
java.util.Random random = new java.util.Random(seed);
try
{
checkAnalysisConsistency(random, a, useCharFilter, text, offsetsAreCorrect);
}
catch (Throwable e)
{
System.out.println("Exception from random analyzer: " + a);
throw e;
}
} I discovered this is primarily due to the fact that some flags are not supported during indexing: https://stackoverflow.com/a/66478863/, and confirmed that removing the In Long story short, this circles back to the fact that we converted the Since most of Lucene's tests do not call Some Options to Consider
private bool isValid;
~TokenStream() // Only called if we don't meet our preconditions
{
Dispose(false);
}
public void Dispose()
{
Dispose(true);
GC.SuppressFinalize(this);
}
protected virtual void Dispose(bool disposing)
{
if (disposing)
{
isValid = Validate(out string reason);
if (!isValid)
throw new InvalidOperationException(reason); // Exception short circuits the call to SuppressFinalize
DoDispose();
}
if (!isValid)
DoDispose();
}
internal void DoDispose()
{
// Dispose file handles here...
} Of course, that means we need to wait for the finalizer to execute before beginning the next test. From the finalizer docs:
Options for fixing the tests:
Since the preconditions are only meant to prevent programming bugs that aren't likely to make it into a production application, it seems reasonable to rely on a finalizer as a fallback for this edge case. Further ConsiderationBeing that Lucene may call But we definitely need to revert |
Throwing exceptions in Dispose() is bad form.
|
Thanks. So that leaves the question - how do we enforce the TokenStream workflow contract in .NET? The exception is meant to be a guide to help at development time to ensure all of the requirements of the contract are followed. Once the contract is being followed, this exception will never occur at runtime. So, that means the part that they are worried about - adding an exception handler inside a finally block - is not a thing that will ever need to happen. One option I have considered is to build a Roslyn code analyzer to enforce this contract at design time, but I am not sure how complex it would be to analyze APIs of the user code no matter how they have it configured and ensure that all of the methods are called in the right sequence and the right number of times in the code. They could be building one or more abstractions that do some of the Some of the rules, such as enforcing the calls to However, this still doesn't get us out of the situation we are in where According to this error message, it is supposed to be invalid to call |
After some research, I have determined the following:
The proposed solution to fix these issues is:
This neatly puts the lifetime of Analyzer analyzer = new MyAnalyzer(); to using Analyzer analyzer = new MyAnalyzer(); Analyzer won't need to be disposed during normal usage, only for cases where there are expensive long-lived components to dispose. I am toying with the idea of creating a |
@NightOwl888 Found that ReusableStringReader should also implement this proposed ICloseable interface instead of IDisposable, especially given that its design to be reused is right there in the name. |
I'll take this one.
@NightOwl888 Are you good with the |
There is no public Support namespace, so, no. It should be in the Support folder, but perhaps this belongs in the |
Next question: on the IOUtils restoring the |
FYI my branch for this issue is https://github.com/paulirwin/lucene.net/tree/issue/271 - note that I might occasionally force-push to rebase this against latest master as other PRs are merged, as I expect this one to take a bit to complete. I'll also be checking off @NightOwl888's checklist above as I go, but this is just an informal progress tracker. |
I don't recall adding any overloads, just moving over the ones that existed. I see this defined upstream: Is that not what you are referring to? |
Oh interesting, I was looking at the 4.8.1 code. It looks like those were removed in 4.8.1: https://github.com/apache/lucene/blob/releases/lucene-solr/4.8.1/lucene/core/src/java/org/apache/lucene/util/IOUtils.java Should we go ahead and remove them and target 4.8.1 compatibility? |
If they are unused, I guess we can remove them. So, clearly Lucene isn't following semantic versioning (or, at least wasn't at the time). |
FYI, here is the 4.8.1 commit that removed the usage of them: apache/lucene@56ab1d1 Also note that they are currently unused because the Disposable versions are used instead; if we target 4.8.1 we can clearly remove them. If we must target 4.8.0 then I haven't yet evaluated if these will be used, or if the code will continue to use the Disposable versions (because their types would not be ICloseable). |
I spent quite a bit of time looking at this over the past few days, and I've come to what I think is a surprising idea, or at least it would have surprised past me when I started looking into this: I think we should remove IDisposable from TokenStream and its descendants completely, in favor of ICloseable. This is, I believe, the simplest solution, as well as the one that most closely matches Lucene. Then we avoid the whole IDisposable problem completely. As noted above, these token streams are intended to be reused after Close is called. The fact that they are stored in an Analyzer via a "ReuseStrategy" speaks to this. So we know that having ICloseable is the right design choice here. This has the added benefit of more closely matching Lucene. I went through and moved all reusable close logic to Close overrides, and that was pretty easy (although time-consuming to go through ~200 implementations to review them all). I made Dispose call Close, because Close is supposed to be idempotent, so it should be able to be called multiple times, i.e. if you call The problem with IDisposable TokenStreams is that you're expected to dispose of IDisposables when you're done, and as noted above, TokenStreams are intended to be reused by the Analyzer, per the ReuseStrategy. We cannot make ReuseStrategy disposable, because it is supposed to use the singleton instances I asked myself: why is this not a problem in Lucene? And after digging, and asking ChatGPT, the answer is that Analyzers and TokenStreams are not supposed to hold state that can't be garbage collected or at least released via Close (in the case of TokenStreams). And if you do need one that has to be disposed, then you can track the lifetime of that outside of the analyzer, and dispose of it yourself, because users are usually instantiating Analyzers directly anyways. There's nothing stopping someone from adding IDisposable to their custom Tokenizer or even Analyzer... but we don't have to make Lucene.NET unnecessarily complex and divergent from Lucene in order to accommodate this. (Aside: there was a thought above that in Java, AutoCloseable means that Close is sometimes called implicitly, which implies automatically, and this is not exactly true. It is called implicitly when the AutoCloseable is used in a try-with-resources statement (akin to a using statement in C#), but it is not called automatically as part of resource cleanup.) You also might be asking: what about the IDisposable TextReader in Tokenizer? Well, that is already going to be disposed in Close as a result of this change, but also it is passed into the Tokenizer via SetReader, so really it's the responsibility of the caller to dispose of it if Close is not called. The only case where this realistically could be a problem (since most cases use StringReader which is safely GC-able without being disposed) is if you use a TextReader Field value to some non-GC-able implementation of TextReader, in which case you're probably already safely disposing of that in a using block/statement anyways. There is very little good reason within the Lucene.NET code to have TokenStream be IDisposable, as there are only two cases that I found in the entire codebase, after reviewing all TokenStream subclasses, where the cleanup logic is not reusable (and thus would not already have its Dispose logic moved into Close):
So out of the O(200) derived types from TokenStream, only two have non-reusable state that needs to be cleaned up, and even then, both are private implementation details, and it appears that the only valid intended use of these is with in-memory data where it's safe to not clean up the enumerator in Dispose, and one of them even has updated code that avoids the problem. So we'd be going through the hassle of making TokenStream and its descendants IDisposable for no discernible benefit, when users that need their custom implementations to be disposable can handle that outside of our library code with a custom or anonymous Analyzer. So if we remove IDisposable from TokenStream in favor of ICloseable, it simplifies this dramatically, and avoids a lot of the concerns expressed in the comments above. We can still have I'm going to prototype this approach to determine if it's viable, but I have a hard time believing it wouldn't be at this point. |
@paulirwin thanks for the detailed information, At the beginning of the port, it was clearly that the closable Java matches with the net disposable However, looks like it's not a real dispose, especially when this can be reused after dispose How complex this might be? |
Thanks so much for the detailed analysis, Paul. So, what I gather, you are saying a disposable class would be injected like this? using var someDisposable = new SomeDisposable();
Analyzer analyzer = Analyzer.NewAnonymous(createComponents: (fieldName, reader) =>
{
var src = new StandardTokenizer(m_matchVersion, reader);
TokenStream tok = new StandardFilter(m_matchVersion, src);
tok = new LowerCaseFilter(m_matchVersion, tok);
tok = new StopFilter(m_matchVersion, tok, m_stopwords);
tok = new MyFilter(someDisposable, tok);
return new TokenStreamComponents(src, tok);
}); Alternatively, the user would need to implement a whole Analyzer subclass, but they could pass the disposable instance into the constructor to accomplish the same thing. I have to admit, doing it that way hadn't crossed my mind, but I am not sure it covers everything we need to fix. It may address all usability issues, though. The main motivation for this issue is to fix the There were 3 main issues we ran into with
It was the 3rd problem that made me open this issue, as using icu-dotnet with Lucene.NET was completely impossible without having a My thought was to separate the concepts of
A good proof of concept would be to ensure the
It is that last condition that made me think there was no way out except for having disposable token stream components, since the original configuration would always leave the file handle open upon test failure. |
That's the idea, yeah. Although your example of the icu-dotnet case is a good thought experiment. After some thinking, I think we can made TokenStream IDisposable as well as ICloseable, although the virtual To do this, I think we would need to make TokenStreamComponents implement IDisposable and dispose of its two fields, although we would have to note that because there is no guarantee that the sink ends up disposing of its source, that we would have to allow for Dispose to be idempotent and able to be called multiple times, because if we just i.e. not dispose of the TokenStreamComponents source field (assuming the sink would dispose it), then there's a chance that it might not, and then something could leak. We could add a trivial check to not redundantly call Dispose on the sink if it ReferenceEquals the source, but other than that I think we have to allow for Dispose to be called more than once. We then need to change DisposableThreadLocal to dispose of all of its Values that are IDisposable. I'll prototype this and see how it goes. Note that I do not think we should do the suggestion earlier in this thread of throwing if any of the other TokenStream methods are called on it after it has been disposed. The Closeable implementation does not do this in the upstream code, and we'd have to override many methods from AttributeSource, and make IncrementToken virtual instead of abstract, which would significantly deviate from upstream, and cause a performance penalty, just to prevent someone from doing this in their custom types (as again, we would not be overriding Dispose in our code). I think we should make that the responsibility of the implementer if they care about that, and just let it be.
They do pass, even with high iteration counts. I've removed the AwaitsFix attribute from them.
This also seems to work fine. I think it was due to wrapping the IDisposables in using statements previously like you said, and now I've replaced those with
Do we have any tests that you know of that do this that I can confirm? But again, if they were passing previously, they're still passing now, as there's basically little difference in runtime behavior than before. |
Also, I can still reproduce this test failure: #271 (comment) I'm going to attempt removing the try/finally from those BaseTokenStreamTestCase methods to match upstream and see what happens. |
I don't have a solution to that failing test, but I've created a draft PR with my work on this: #1058. It worked out pretty well I think to have TokenStreamComponents be IDisposable, and to create a disposable wrapper around JCG.Dictionary for PerFieldReuseStrategy. |
As pointed out above, the backported test failed in Java as well.
However, IIRC there were still some problems with managing the state of the
There is some context that I failed to document between the last two paragraphs, but it shouldn't be too hard to figure out why I thought this was failing due to the addition of the try-finally blocks. |
I spent more time looking at this last night and updated the BaseTokenStreamTestCase to be much closer to the original, mostly by removing some code that didn't exist upstream. The only thing that really differs from upstream is that TestRandomChains still randomly fails (fairly reliably if you put a Repeat(100) on it), but it doesn't cause cascading failures, and I'm not convinced it's failing because of this issue. I'm not entirely sure where to go next from here, so I could use some help on this one with not only reviewing what I've got, but trying it out to figure out if more work needs to be done here or not. If TestRandomChains randomly failing is related to this issue, I'd appreciate some enlightenment as to why, as that isn't clear to me. If the core problem with this issue was that failing analysis tests (TestRandomChains or otherwise) caused other cascading failures, that does not seem to be the case anymore. I am 99% sure I've solved the issue of TokenStreams being reused after Close is called. Next up I'll make a test that tests to make sure Dispose is not called before reuse (as that should not be done in a reusable way), and make those two types mentioned above reusable (even though they aren't causing any failing tests). |
When overriding Dispose(bool) in either TokenStream subclasses, Lucene.Net will call Dispose() before it is done using the TokenStream, and call IncrementToken() again.
The behavior can be observed in the Lucene.Net.Collation.TestICUCollationKeyFilter.TestCollationKeySort() when the ICUCollationKeyFilter implements Dispose().
JIRA link - [LUCENENET-611] created by nightowl888The text was updated successfully, but these errors were encountered: