Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thread-safe version for C# / Moving the state out of the Stemmer instance #146

Open
Turnerj opened this issue Apr 5, 2021 · 14 comments
Open

Comments

@Turnerj
Copy link
Contributor

Turnerj commented Apr 5, 2021

Currently from the generated code for C#, the stemmer uses an internal state to keep track of position etc. Unfortunately this prevents re-use of the same stemmer instance across multiple threads. Creating a new instance of a stemmer per thread (or in a web application, per request) means the constructor for the stemmer (with all the Among-type arrays generated) occurs frequently, generating the same data, creating additional GC pressure.

In moving the state into the main stem-processing method, it should also allow for the ability to also switch to using Span<char> to do a lot of the work where you have StringBuilder, avoiding further allocations too.

I'm a C# dev, not a C dev, so while I want to help wherever I can, I would be somewhat limited in my capacity to help in the generational side of things.

@Turnerj
Copy link
Contributor Author

Turnerj commented Apr 5, 2021

Looking at the compiler more closely, the Among-type declarations could be pushed to a static properties with a static constructor (thus are shared across instances of the same stemmer) which would relieve some pressure from creating new instances of a stemmer.

w(g, "~Mprivate readonly Among[] a_~I0;~N");

w(g, "~Mpublic ~n()~N~{");

@ojwb
Copy link
Member

ojwb commented Apr 5, 2021

Please open a PR if you want to propose a change - then the CI will test it, and I won't have to ham-fistedly try to interpret what you're suggesting through the lens of my very limited grasp of C#!

@Turnerj
Copy link
Contributor Author

Turnerj commented Apr 5, 2021

Hahaha - no problem, I'll give it a shot.

@ojwb
Copy link
Member

ojwb commented Apr 6, 2021

Cool.

FWIW, making the data structures such as the Among stuff static sounds a good idea to me.

I wonder if you could inline all the snowball routines to give a single C# function so then the cursor, slice ends, limits, etc can just be local variables of that (as indeed could the Snowball variables). Recursion couldn't be handled, but I've never seen a Snowball stemmer which was recursive so in practice that isn't really a limitation. It's a bit ugly to only support a subset of the language for a particular backend though...

@Turnerj
Copy link
Contributor Author

Turnerj commented Apr 6, 2021

Yeah, it should be possible - I've seen something similar in someone else's version of the Porter algorithm in C# (theirs was pretty allocate-y as they kept regenerating strings throughout the calculating).

When doing so, it should be possible with maybe 2 or 3 sets of allocations:

  • Initially converting a string to a char[]
  • Any time we need to add characters to the string for stemming, we'd need to re-allocate the array and copy the data
  • Turning the array back out into the final string

Using Span<char> allows us to do slicing to logically remove characters from the start or end of the stemming process without allocations (it sits on the stack and is effectively a pointer with a length property).

In terms of being recursive, that should be possible without issue. We can pass a struct by reference to a recursive version of stem() or whatever function we need to.

@ojwb
Copy link
Member

ojwb commented Apr 6, 2021

Oh, can you allocate that struct on the stack so it doesn't needs to be GC-ed?

@Turnerj
Copy link
Contributor Author

Turnerj commented Apr 6, 2021

Yep! Basically a Span<T> is just a span of memory - a string, an array or a real pointer. It lives on the stack and is specifically designed for high performance/low allocation scenarios.

var myArray = new char[] { 'A', 'B', 'C' };
var mySpan = myArray.AsSpan(); // A, B, C
var frontSlice = mySpan.Slice(0, 2); // A, B
var backSlice = mySpan.Slice(1); // B, C

While the initial array is an allocation that needs to be GC'd, the spans representing the portions of the array are entirely stack-based.

For the stemmer, this is really useful for all trimming logic as you can initially trim off bits you don't need via slicing without creating any allocations. When the stemmer needs to add a character, it would need to create a new char array and copy values to it.

eg.

var mySpan = new char[] { 'A', 'B', 'C' }.AsSpan();
var stem = mySpan.Slice(0, 2);

...

// Add "D" to the span
var tmp = new char[stem.Length + 1];
stem.CopyTo(tmp);
tmp[stem.Length] = 'D';

@Turnerj
Copy link
Contributor Author

Turnerj commented Apr 6, 2021

Fun fact though, you could also initialise a character array on the stack too:

Span<char> characters = stackalloc char[3];

This would help avoid even more allocations during processing though you would need to be careful not to allocate too much on the stack.

At the end of the stemmer, you would still have to allocate a final string but that is fine and makes sense.

@Turnerj
Copy link
Contributor Author

Turnerj commented Apr 6, 2021

Oh, I should note - the use of Span<T> is a new feature of C# 7 (we're on C# 9 as of last year). With .NET Framework, it requires a dependency on System.Memory (a backwards compat library by Microsoft). With .NET Core/.NET 5, it works out-of-the-box.

Quick background on .NET Framework/.NET Core - basically the future of C# and .NET is .NET Core/.NET 5. The older, slower and more annoying .NET Framework is effectively a legacy platform. It is still "supported" by Microsoft but that is due to it being a component of Windows...

Personally, I think you'll be fine with Span<T> and stackalloc. Anyone including the generated C# stemmer code into their project can easily add the one dependency if they are using .NET Framework.

@Turnerj
Copy link
Contributor Author

Turnerj commented Apr 8, 2021

Hmmmm, I looked at making the Among values static though it looks like a few stemmers do something I didn't expect - they have actions that are usually passed in an instance-method.

For example, see this with the Finnish stemmer:

...
                new Among("n", -1, 7),
                new Among("han", 11, 1),
                new Among("den", 11, -1, r_VI),
                new Among("seen", 11, -1, r_LONG),
                new Among("hen", 11, 2),
                new Among("tten", 11, -1, r_VI),
                new Among("hin", 11, 3),
                new Among("siin", 11, -1, r_VI),
                new Among("hon", 11, 4),
                new Among("h\u00E4n", 11, 5),
                new Among("h\u00F6n", 11, 6),
...

Where r_VI and r_LONG are:

        private bool r_LONG()
        {
            if (find_among_b(a_5) == 0)
            {
                return false;
            }
            return true;
        }

        private bool r_VI()
        {
            if (!(eq_s_b("i")))
            {
                return false;
            }
            if (in_grouping_b(g_V2, 97, 246, false) != 0)
            {
                return false;
            }
            return true;
        }

Would require a larger redesign of the generated output to be able to work around that.

@ojwb
Copy link
Member

ojwb commented Apr 11, 2021

Yes - those are optional conditions on the among cases, used by a few stemmers.

I had an idea for the C stemmer to try generating nested switch statements for among, which I've seen used effectively elsewhere. I don't have a useful patch yet, but this would probably work for C# too, and would mean the among tables would be replaced by generated code at the call site and the Among class wouldn't exist.

@ojwb
Copy link
Member

ojwb commented Apr 11, 2021

I've merged #147 with a tweak to leave amongs with functions as non-static, which gets us most of the benefit of that change without breaking Finnish, etc.

@Turnerj
Copy link
Contributor Author

Turnerj commented Apr 12, 2021

Thanks for the fix for #147 - makes sense to do what you've done. 🙂

I had an idea for the C stemmer to try generating nested switch statements for among, which I've seen used effectively elsewhere. I don't have a useful patch yet, but this would probably work for C# too, and would mean the among tables would be replaced by generated code at the call site and the Among class wouldn't exist.

That would be pretty clever if you work out how to do it and yeah, I don't see why that wouldn't work for C# too.

Might have a look at other optimizations we can do for C#, potentially relating to using nested switch statements - just need to get a bit more of a feel for the data structure of the generator. May need to change some of the Stemmer.cs private methods with some ideas I have. Hopefully with each step like this, we can make it more thread safe while also making it even faster.

@ojwb
Copy link
Member

ojwb commented Aug 7, 2024

FYI: Snowball's Java generator was recently updated to use a char[] buffer to reduce string allocations and overhead: #195

That might be useful to look at as inspiration for implementing some of the changes discussed above. In particular, as well as providing the existing method to return the result as a string, the underlying buffer can be obtained by the caller instead, which avoids overhead there if they don't particularly want it as a string.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants