Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding GetChunks which allow efficient scanning of a StringBuilder #26207

Closed
vancem opened this issue May 17, 2018 · 8 comments
Closed

Adding GetChunks which allow efficient scanning of a StringBuilder #26207

vancem opened this issue May 17, 2018 · 8 comments
Labels
api-approved API was approved in API review, it can be implemented area-System.Runtime
Milestone

Comments

@vancem
Copy link
Contributor

vancem commented May 17, 2018

This issue started as #22371 but has morphed so much that it makes sense to open a new issue.

See coreclr pr dotnet/coreclr#17530 for more. This is just the API review.

EDIT @stephentoub 5/21. Copied proposal from below to here:
EDIT: Changed to final result (which has just the ChunkEnumerator type).

class StringBuilder {
        public ChunkEnumerator GetChunks();

        public struct ChunkEnumerator
        {
            public ChunkEnumerator GetEnumerator();

            public bool MoveNext();
            public ReadOnlyMemory<char> Current { get; }
        }
}
@karelz
Copy link
Member

karelz commented May 17, 2018

@vancem can you please post here the final API proposal?

@vancem
Copy link
Contributor Author

vancem commented May 17, 2018

We need a way of efficiently enumerating the chunks of a StringBuilder. The suggestion here is to provide the new method StringBuilder.EnumerateChunks which returns a (nested but public) type ChunkEnumerable that implements the IEnumerable<ReadOnlyMemory>. Here is the new surface area.

class StringBuilder {
        public ChunkEnumerable EnumerateChunks();

        public struct ChunkEnumerable
        {
            public ChunkEnumerator GetEnumerator();
        }
        public struct ChunkEnumerator
        {
            public bool MoveNext();
            public ReadOnlyMemory<char> Current { get; }
        }
}

From a user's point of view the only new API is the StringBuilder.EnumerateChunks() method. The rest is just the pattern to make the C# foreach statement just work on the result.. Here is an example for efficiently implementing 'IndexOf' using EnumerateChunks() and foreach

        static int IndexOf(StringBuilder sb, char c)
        {
            int pos = 0;
            foreach (ReadOnlyMemory<char> chunk in sb.EnumerateChunks())
            {
                var span = chunk.Span;  // We create a local variable because it is more efficient
                for (int i = 0; i < span.Length; i++)
                    if (span[i] == c)
                        return pos + i;
                pos += span.Length;
            }
            return -1;
        }

The PR contains the actual implementation. Note that ChunkEnumerable does not actually declare that it implements IEnumerable<ReadOnlyMemory>, it just implemented all the right methods (which make foreach work).

We considered having it return ReadOnlySpan, but decided that it was too fragile in that Span can't be used across 'await' statements (easily) and can't be passed to methods that MAY be async (these methods always take ReadOnlyMemory). ReadOnlyMemory does not have this problem (but is a bit less efficient since it does require you to convert it to a Span before accessing the characters (and also notice you need to cache the span in a local variable to keep span creation out of the inner loop).

We considered also providing both an API to produce Spans and one that makes Memorys. We could do this, but it is a bunch of cloned code, and it SHOULD be the case that the JIT could eliminate the inefficiency (although it will probably require targeted work).

API issues to consider include

  1. Do we like the names.
  2. We are OK using ReadOnlyMemory instead of ReadOnlySpan
  3. We are OK with Chunk* not actually implementing IEnumer* interfaces (we can add later if desired)
  4. We are OK with the Chunk* classed being public but nested (because you should not need to name them most of the time.

See dotnet/coreclr#17530 for the actual implementation.

@terrajobst @stephentoub

@terrajobst
Copy link
Contributor

terrajobst commented May 22, 2018

Video

Looks good, only change is that we think we should collapse the two types:

class StringBuilder {
        public ChunkEnumerator GetChunks();

        public struct ChunkEnumerator
        {
            [EditorBrowsable(Never)] // Only here to make foreach work
            public ChunkEnumerator GetEnumerator();
            public bool MoveNext();
            public ReadOnlyMemory<char> Current { get; }
        }
}
  • Potentially we can add CurrentSpan to make it cheaper to get the span
  • We could also in the future expose a CurrentSpanReadWrite if we ever need a mutating version.

@danmoseley
Copy link
Member

As an aside, there was observation in the room that it might be time to make StringBuilder doubly linked. This would make the implementation of ChunkEnumerator simpler and make it possible to offer IEnumerable. (Indexer would not be any faster though)

@vancem
Copy link
Contributor Author

vancem commented May 22, 2018

I have updated this issue to the API suggested by @terrajobst as well as the implementation.

@vancem
Copy link
Contributor Author

vancem commented May 22, 2018

I have tests for this code (and they pass for me locally) but they go in the CoreFX repo and I need to check in this implementation and wait for a new build of CoreCLR before I can complete that.

So I would like to check this in. If anyone has concerns please raise them i the next 1/2 hour or so, as I would like to check in before the nightly build kicks off.

@danmoseley
Copy link
Member

danmoseley commented May 22, 2018

If the PR is green/signed off and CoreFX tests pass locally then you can certainly merge.

@vancem vancem changed the title Adding EnumerateChunks which allow efficient scanning of a StringBuilder Adding GetChunks which allow efficient scanning of a StringBuilder May 24, 2018
@vancem
Copy link
Contributor Author

vancem commented May 25, 2018

Closing the issue as the StringBuilder.GetChunks was exposed in CoreFX with dotnet/corefx#29877.

@vancem vancem closed this as completed May 25, 2018
@msftgits msftgits transferred this issue from dotnet/corefx Jan 31, 2020
@msftgits msftgits added this to the 3.0 milestone Jan 31, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Dec 16, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
api-approved API was approved in API review, it can be implemented area-System.Runtime
Projects
None yet
Development

No branches or pull requests

5 participants