-
Notifications
You must be signed in to change notification settings - Fork 173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Thread-safe version for C# / Moving the state out of the Stemmer instance #146
Comments
Looking at the compiler more closely, the snowball/compiler/generator_csharp.c Line 1194 in 8b49140
snowball/compiler/generator_csharp.c Line 1239 in 8b49140
|
Please open a PR if you want to propose a change - then the CI will test it, and I won't have to ham-fistedly try to interpret what you're suggesting through the lens of my very limited grasp of C#! |
Hahaha - no problem, I'll give it a shot. |
Cool. FWIW, making the data structures such as the I wonder if you could inline all the snowball routines to give a single C# function so then the cursor, slice ends, limits, etc can just be local variables of that (as indeed could the Snowball variables). Recursion couldn't be handled, but I've never seen a Snowball stemmer which was recursive so in practice that isn't really a limitation. It's a bit ugly to only support a subset of the language for a particular backend though... |
Yeah, it should be possible - I've seen something similar in someone else's version of the Porter algorithm in C# (theirs was pretty allocate-y as they kept regenerating strings throughout the calculating). When doing so, it should be possible with maybe 2 or 3 sets of allocations:
Using In terms of being recursive, that should be possible without issue. We can pass a struct by reference to a recursive version of |
Oh, can you allocate that struct on the stack so it doesn't needs to be GC-ed? |
Yep! Basically a var myArray = new char[] { 'A', 'B', 'C' };
var mySpan = myArray.AsSpan(); // A, B, C
var frontSlice = mySpan.Slice(0, 2); // A, B
var backSlice = mySpan.Slice(1); // B, C While the initial array is an allocation that needs to be GC'd, the spans representing the portions of the array are entirely stack-based. For the stemmer, this is really useful for all trimming logic as you can initially trim off bits you don't need via slicing without creating any allocations. When the stemmer needs to add a character, it would need to create a new eg. var mySpan = new char[] { 'A', 'B', 'C' }.AsSpan();
var stem = mySpan.Slice(0, 2);
...
// Add "D" to the span
var tmp = new char[stem.Length + 1];
stem.CopyTo(tmp);
tmp[stem.Length] = 'D'; |
Fun fact though, you could also initialise a character array on the stack too: Span<char> characters = stackalloc char[3]; This would help avoid even more allocations during processing though you would need to be careful not to allocate too much on the stack. At the end of the stemmer, you would still have to allocate a final string but that is fine and makes sense. |
Oh, I should note - the use of Quick background on .NET Framework/.NET Core - basically the future of C# and .NET is .NET Core/.NET 5. The older, slower and more annoying .NET Framework is effectively a legacy platform. It is still "supported" by Microsoft but that is due to it being a component of Windows... Personally, I think you'll be fine with |
Hmmmm, I looked at making the For example, see this with the Finnish stemmer: ...
new Among("n", -1, 7),
new Among("han", 11, 1),
new Among("den", 11, -1, r_VI),
new Among("seen", 11, -1, r_LONG),
new Among("hen", 11, 2),
new Among("tten", 11, -1, r_VI),
new Among("hin", 11, 3),
new Among("siin", 11, -1, r_VI),
new Among("hon", 11, 4),
new Among("h\u00E4n", 11, 5),
new Among("h\u00F6n", 11, 6),
... Where private bool r_LONG()
{
if (find_among_b(a_5) == 0)
{
return false;
}
return true;
}
private bool r_VI()
{
if (!(eq_s_b("i")))
{
return false;
}
if (in_grouping_b(g_V2, 97, 246, false) != 0)
{
return false;
}
return true;
} Would require a larger redesign of the generated output to be able to work around that. |
Yes - those are optional conditions on the among cases, used by a few stemmers. I had an idea for the C stemmer to try generating nested switch statements for among, which I've seen used effectively elsewhere. I don't have a useful patch yet, but this would probably work for C# too, and would mean the among tables would be replaced by generated code at the call site and the |
I've merged #147 with a tweak to leave amongs with functions as non-static, which gets us most of the benefit of that change without breaking Finnish, etc. |
Thanks for the fix for #147 - makes sense to do what you've done. 🙂
That would be pretty clever if you work out how to do it and yeah, I don't see why that wouldn't work for C# too. Might have a look at other optimizations we can do for C#, potentially relating to using nested switch statements - just need to get a bit more of a feel for the data structure of the generator. May need to change some of the |
FYI: Snowball's Java generator was recently updated to use a That might be useful to look at as inspiration for implementing some of the changes discussed above. In particular, as well as providing the existing method to return the result as a string, the underlying buffer can be obtained by the caller instead, which avoids overhead there if they don't particularly want it as a string. |
Currently from the generated code for C#, the stemmer uses an internal state to keep track of position etc. Unfortunately this prevents re-use of the same stemmer instance across multiple threads. Creating a new instance of a stemmer per thread (or in a web application, per request) means the constructor for the stemmer (with all the
Among
-type arrays generated) occurs frequently, generating the same data, creating additional GC pressure.In moving the state into the main stem-processing method, it should also allow for the ability to also switch to using
Span<char>
to do a lot of the work where you haveStringBuilder
, avoiding further allocations too.I'm a C# dev, not a C dev, so while I want to help wherever I can, I would be somewhat limited in my capacity to help in the generational side of things.
The text was updated successfully, but these errors were encountered: