Adding GetChunks which allow efficient scanning of a StringBuilder #17530

vancem · 2018-04-12T15:00:40Z

This change is the start of addressing the issue CoreFX issue

https://github.com/dotnet/corefx/issues/21248

which wants to have an efficient way of reading the characters of a StringBuilder. However it has morphed alot since then (from a callback to an iterator model). The API review part of this is the following issue.

https://github.com/dotnet/corefx/issues/29770

This PR is an implementation of that spec. In particular we implement the following new method on StringBuilder.

class StringBuilder { 
      ChunkEnumerator GetChunks();
}

ChunkEnumerator is a struct that follows the IEnumerable<ReadOnlyMemory> pattern and thus works with the C# foreach statement. Here is an example for efficiently implementing 'IndexOf' using GetChunks

        static int IndexOf(StringBuilder sb, char c)
        {
            int pos = 0;
            foreach (ReadOnlyMemory<char> chunk in sb.GetChunks())
            {
                var span = chunk.Span;  // We create a local variable because it is more efficient
                for (int i = 0; i < span .Length; i++)
                    if (span [i] == c)
                        return pos + i;
                pos += span.Length;
            }
            return -1;
        }

The PR contains the actual implementation. Note that ChunkEnumerable does not actually implement IEnumerable<ReadOnlyMemory>.

We considered having it return ReadOnlySpan, but decided that it was too fragile in that Span can't be used across 'await' calls and can't be passed to methods that MAY be async. ReadOnlyMemory does not have this problem (but is a bit less efficient), but does require you to convert it to a Span before accessing the characters (and notice you need to cache the span in a local variable to keep span creation out of the inner loop).

This is not ideal, but is this issue will resurface other places (whenever we want to return chunks), and so it is likely we will invest in JIT optimization to optimize away any inefficiency of using ReadOnlyMemory and converting immediately to Span (which is what would happen in the synchronous case).

At this point it is ready for API review. This amounts to signing off on using ReadOnlyMemory instead of ReadOnlySpan, and the names (is 'Chunks' the right suffix?) and the fact that ChunkEnumerable does not actually implement IEnumerable<ReadOnlyMemory> (It could, it is just more code (which we can add later)).

I have tests, however there seems to be a chicken and egg problem in that you need to have an implementation, before CoreFX can expose it (and have tests for it). I have tested the code (all code paths are covered), but I will need to check in tests as a separate PR.

vancem · 2018-04-12T15:15:58Z

@jaredpar @VSadov - above is a proposal for a way of getting chunks out of an StringBuilder. It is very simple, it is an Enumerator that returns ReadOnlySpan.

The stack-only ness of Span DOES NOT PREVENT building an Enumerator that returns it, but at least currently it would cause some grief if you try to use it in an async method. Consider this example where i put some awaits in various points.

        static int IndexOf(StringBuilder sb, char c)
        {
            int pos = 0;
            foreach (ReadOnlySpan<char> span in sb.EnumerateChunks())
            {
                await Task.Delay(10);  // Point 1
                for (int i = 0; i < span.Length; i++)
                    if (span[i] == c)
                        await Task.Delay(10);  // Point 2
                pos += span.Length;
                await Task.Delay(10);  // Point 3
            }
            return -1;
        }

Today only Point 3 would be allowed as the span is 'dead' at that point and thus it would not need to be captured.

However the compiler could be smart, and frankly after every 'await' it could insert a 'span=enumerator.Current' right after every 'await' and NOT have to capture the span into the closure.

The question to you is : What do you think about this as a feature? If you like it, how hard is it?

While this feature would be handy for this particular API, it really is a general problem of enumerating a collection of spans (in an async environment). We can fix it just by avoiding spans (using memory instead), but that is both less efficient, and more semantically complex.

What do you think? (Today people can work-around this issue by expanding the 'foreach' by hand, but that is reasonably ugly....)

cc: @davidfowl, @stephentoub, @jkotas

jaredpar · 2018-04-12T15:22:19Z

The question to you is : What do you think about this as a feature? If you like it, how hard is it?

This is introducing hidden side effects into developer code. How would you explain such a feature to users? Consider cases where var is used. How do you explain the set of variables and functions that execute at the conclusion of an await block.

Also consider the times when await executes promptly. Do we call sb.Current anyways? If not then that means the times, and potential side effects, of executing this hidden code is random-ish.

vancem · 2018-04-12T15:50:19Z

@jaredpar

I think part of the makes this idea tractable is that there is a strong expectation that Enumerator.Current is side-effect free. Thus you could simply state that it is undefined what happens when this is not true (that is the compiler is allowed to insert as many calls to Enumerator.Current as it wants). You could narrow this if you like to say that this freedom only applies to enumerators that return ref-like structs and use await.

Alternatively, you could say that semantics is that yes, logically there is EXACTLY a call to Enumerator.Current at variable initialization, and then after every await statement, but the compiler is free to optimize these away if it can. To first order, it simply does not (the JIT may or may not do it). This is a bit sub-optimal, but I could easily live with it.

I want to note, however that the CURRENT behavior is pretty unhelpful (if you try to use enumerations of Span in an async method, you will get errors that you can't do that (even though there is an obvious way to make it work). This is rather unfortunate.

If it were me I would define the semantics such that compiler is free to call Enumerator.Current as many times as it wants (that is it is the behavior can vary if Enumerator.Current has side effects).

jaredpar · 2018-04-12T16:16:56Z

I think part of the makes this idea tractable is that there is a strong expectation that Enumerator.Current is side-effect free.

I do not share that expectation. Customers do bad things with their code all the time. This wouldn't be the first, second, third, etc ... time that a method we strongly expected to be side effect free was indeed not side effect free.

Thus you could simply state that it is undefined what happens when this is not true (that is the compiler is allowed to insert as many calls to Enumerator.Current as it wants).

Undefined behavior is for C++. 😄

vancem · 2018-04-12T17:17:11Z

If you really want to have well defined semantics in the face of Enumerator.Current having side effects, you can do that. Simply state it happens when the variable is created, and after every await. You are done. Perfectly well defined semantics, AND semantics that people WANT (because it turns error behavior into very reasonable non-error behavior).

The only reason the compile does this (and not the user), is because the user CAN'T because the COMPILER hid the variable needed to do it.

VSadov · 2018-04-12T17:50:40Z

Even if we could express purity of methods in CLR somehow, the Current here is clearly not idempotent since it can observe sideeffects. Calling Current twice can return different results.
The overall language design (and common sense) forbids duplicating operations like this.

I understand the overall idea and there could be some way to do what is desired here, but initial proposal for how to do it feels too fragile for something that language will have to live with, perhaps for many years.

I like the part that iterator produces spans. It seems very clever.

vancem · 2018-04-12T18:21:36Z

Here is what the code looks like if you tell people not to use foreach.

I suppose it is not that bad (and only is needed if you need awaits in point 1 or point 2).

        static async Task<int> IndexOf(StringBuilder sb, char c)
        {
            int pos = 0;
            var spanEnum = sb.EnumerateChunks().GetEnumerator();
            while (spanEnum.MoveNext())
                await Task.Delay(10); // Point 1
                var span = spanEnum.Current();
                for (int i = 0; i < span.Length; i++)
                    if (span[i] == c)
                    {
                        await Task.Delay(10); // Point 2
                        span = spanEnum.Current();
                    }
                await Task.Delay(10);  // Point 3
            }
            return -1;
        }

Is the current compiler is smart enough that it knows that 'span' is dead at all the awaits and thus does not try to capture them (which presumably would cause an error)?

VSadov · 2018-04-12T22:05:17Z

The code does not look too bad, but it would not currently compile.

There is a bigger problem. Any kind of Span locals are not allowed in async methods right now. There is a proposal to relax this for { } blocks that do not contain awaits.

The reason why we do not want to key of the capturing is because capturing is an implementation detail. We keep improving/changing the capturing story with every version. As the strategy changes we may capture less (or sometimes more, but hopefully less). Also the strategy is different between Release and Debug. In Debug we capture more aggressively, otherwise locals may vanish after await.

For once I cannot guarantee that Mono C# compiler captures exactly the same set of locals. It most certainly does not.

Basically we do not want to rely on a fluid thing full of optional implementation details and optimizations as a reason to accept or reject programs.

VSadov · 2018-04-12T22:11:59Z

One idea could be that the Current returns a delegate that returns the span. In this way you could keep the delegate however you want and fetch the span between awaits with a simple ().

In the context of async method a delegate invoke should be cheap (evidently you are suggesting calling Current after every await as "ok") , but I would be a bit worried about penalizing the non-async case.

VSadov · 2018-04-12T22:16:14Z

That would still run into the issue with Span locals not being allowed in async methods, but assuming dotnet/csharplang#1331 , it could work.

vancem · 2018-04-12T22:48:33Z

The following code should work today but it needs a helper method.
Basically it passes the span into the helper and the helper does all the work, and returns if it needs to await. This could get arbitrarily ugly to partition your code in this way, but I see that realistic cases are likely to be fine (it is not a hardship. Indeed with dotnet/csharplang#1331 you probably would not need the helper method most of the time and the code would still be clean.

        static async Task<int> IndexOf1(StringBuilder sb, char c)
        {
            var spanEnum = sb.EnumerateChunks().GetEnumerator();
            while (spanEnum.MoveNext())
            {
                await Task.Delay(10); // Point 1

                int i = 0;
                while (Helper(spanEnum.Current, char c))
                    await Task.Delay(10); // Point 2

                await Task.Delay(10); //Point 3
            }
            return -1;
        }

        static bool Helper(ReadOnlySpan<char> span, ref int i, char c)
        {
            while (i < span.Length)
            {
                if (span[i] == c)
                    return true;
                i++;
            }
            return false;
        }

So I think the upshot is that you CAN make things work with an Enumerable of a ReadOnlySpan in an async method, but dotnet/csharplang#1331 makes it better in most cases. It does mean that you can't use foreach over it in any async method. This may be acceptable.

VSadov · 2018-04-12T22:52:45Z

BTW: You can make Helper a local nested method and get very close (syntactically, and operationally, if JIT inlines it) to what dotnet/csharplang#1331 promises.

VSadov · 2018-04-12T23:22:54Z

Something like:

        static async Task<int> IndexOf1(StringBuilder sb, char c)
        {
            var spanEnum = sb.EnumerateChunks().GetEnumerator();
            int i = 0;
            
            while (spanEnum.MoveNext())
            {
                await Task.Delay(10); // Point 1

                bool InnerLoop()
                {
                    foreach (char ch in spanEnum.Current)
                    {
                        if (ch == c)
                            return true;

                        i++;

                        // Can't have await in this block since InnerLoop is an ordinary method.
                        // On the other hand it can use span locals.
                        // await Task.Delay(10); // Point 2
                    }

                    return false;
                }

                if (InnerLoop())
                    return i;                                        
                                   
                await Task.Delay(10); //Point 3
            }
            return -1;
        }

Not the prettiest piece, but I agree with the main point - you CAN make this work even in async, if ok with jumping through some hoops.

vancem · 2018-04-13T00:25:20Z

Ah, I do see that variable capture inside the nested function is pretty nice here. The main pain is that you have to return to do any await, which is a pain, I am not at all convinced that it will happen enough to matter at all.

At this point I think I am becoming comfortable with the API returning an Enumeratable over ReadOnlySpan (we don't need ReadOnlyMemory). It works great for synchronous code, and it is not that bad for async code.

davidfowl · 2018-04-13T14:40:58Z

@vancem what if I want to call an asynchronous TextWriter method?

danmoseley · 2018-04-17T20:41:30Z

@vancem can you clarify what the "chunks" are in this context? Do they map to the chunks in the implementation? That would couple API to implementation (which could change again, indeed StringBuilder was not originally chunk based) and also produce Spans of possibly wildly (and unexpectedly) varying sizes, depending on the history of the StringBuilder. Clearly the span returned cannot straddle a chunk boundary without an allocation somewhere. But would it make sense to cap the size of the span, if only to logically separate the implementation detail?

VSadov · 2018-04-17T20:46:59Z

I think the pattern is more general - a similar approach would work for anything backed by segmented buffers. Of which a "single buffer" is just a subset.

I think varying chunk sizes is ok and expected.
It should probably never return an empty chunk (just because that is silly, but not dangerous really).

vancem · 2018-04-17T20:56:23Z

My response is basically the same as @VSadov. All we want is an API that allows things to give back the characters efficiently in a way that is mostly independent of the implementation. This proposal covers a broad swath (as long as the characters are stored as bobs of characters you can implement this interface efficiently). Sure if you encode the stringbuilder as UTF8 chars you have a problem, but it is only a perf problem and it is no worse then the perf problem they have today (since today we force them to form a string.

stephentoub · 2018-05-15T19:51:36Z

@vancem, I think this should yield ReadOnlyMemory<char> rather than ReadOnlySpan<char>. It's not just await'ing between iterations that's an issue, but rather being able to hand off the value to an asynchronous method as well, e.g.

foreach (ReadOnlyMemory<char> buffer in sb.EnumerateChunks())
{
    await textWriter.WriteAsync(buffer);
}

I can't do that if buffer is a span.

vancem · 2018-05-15T21:38:24Z

I have changed it from ReadOnlySpan back to ReadOnlyMemory. It is fragile being a Span, and the cost of going to Memory is not high.

vancem · 2018-05-17T20:40:57Z

After a bunch of discussion above, I have decided to use ReadOnlyMemory rather than ReadOnlySpan because it allows the chunks to be passed around to APIs that are ASYNC (which can't use Span) (this is Stephen's point above).

The disadvantage of this choice is that

Generally you want to access the characters, which means calling chunk.Span. Thus we made things slightly harder in a common case (but we allow more schenarios.
This is less efficient since you have to make a Memory object just to fetch a span from it and throw it away. JIT Optimization could fix this however.

danmoseley · 2018-05-18T04:09:36Z

src/System.Private.CoreLib/shared/System/Text/StringBuilder.cs

+                _currentChunk = next;
+                return true;
+            }
+            /// <summary>


Nit newline above

danmoseley · 2018-05-18T04:11:09Z

src/System.Private.CoreLib/shared/System/Text/StringBuilder.cs

+
+                // There is a performance-vs-allocation tradeoff.   Because the chunks
+                // are a linked list with each chunk pointing to its PREDECESSOR, walking
+                // the list FORWARD is not efficient.   If there are few chunks (< 8) we


I guess the number 8 is just a judgement call here? I don't have good intuition for the number of chunks in the real world.

Yes, 8 is a judgement call. Bascially you want a number before N*N gets too big. Generally if you do 'nothing' StringBuilders start with 16 bytes and double each time, so this handles strings up to about 2K-4K if you don't ever specify a starting length. Thus 8 keeps things non-allocating for the common case of strings < 256 bytes (e.g. paths). On the other hand, at 2-4K of data, the cost of copying the data is probably now large with respect to the cost of allocating the array, so the value of making the nubmer bigger is probably not a big deal (especially because people using StringBuilder on large strings should be setting a good starting length).

We can tune it as necessary.

danmoseley · 2018-05-18T04:12:47Z

src/System.Private.CoreLib/shared/System/Text/StringBuilder.cs

+                    while (0 <= --chunkCount)
+                    {
+                        _chunks[chunkCount] = stringBuilder;
+                        stringBuilder = stringBuilder.m_ChunkPrevious;


Assert stringBuilder.m_ChunkPrevious?

danmoseley · 2018-05-18T04:15:22Z

src/System.Private.CoreLib/shared/System/Text/StringBuilder.cs

+                // However above this size, we allocate an array to hold pointers to all
+                // the chunks and we can be efficient for large N.    
+                int chunkCount = ChunkCount(stringBuilder);
+                if (8 < chunkCount)


Nit, this is sort of a Yoda conditional

I don't follow the change you want.

if (chunkCount > 8) https://en.wikipedia.org/wiki/Yoda_conditions

I see. Quite a while back, I adopted a convention of always making relational operators go from smallest to largest. (basically I never use > or >=). I find it slightly clearer and less error prone, as it eliminates one source of arbitrary variation, and allows visual position to help in reading the code (this is especially nice if you are testing for a range).

For what it is worth...

That makes sense. I may do it.

stephentoub · 2018-05-18T18:54:43Z

src/System.Private.CoreLib/shared/System/Text/StringBuilder.cs

+                Debug.Assert(stringBuilder != null);    // Because it is only called with a 'this' pointer.    
+                _stringBuilder = stringBuilder;
+            }
+            private StringBuilder _stringBuilder;   // We insure this is never null. 


Nit: insure => ensure

Nit: readonly

stephentoub · 2018-05-18T18:55:20Z

src/System.Private.CoreLib/shared/System/Text/StringBuilder.cs

+        /// It needs to be public (so the compiler can use it when building a foreach statement) 
+        /// but users typically don't use it explicitly (which is why it is a nested type).  
+        /// </summary>
+        public struct ChunkEnumerable


Nit: readonly?

stephentoub · 2018-05-18T18:59:22Z

src/System.Private.CoreLib/shared/System/Text/StringBuilder.cs

+                    }
+                    _chunkPos = -1;
+                }
+                StringBuilder[] _chunks;    // These are in normal order (first chunk first) 


Nit: private readonly (I'll stop commenting on these things)

davidfowl · 2018-05-19T08:20:41Z

@vancem so this just needs to go through an API review and it'll be into 2.2?

vancem · 2018-05-22T20:48:44Z

OK, I have made the updates based on the API review https://github.com/dotnet/corefx/issues/29770, so now we just have ChunkEnumerator, but from a users point of view (that he can use foreach) nothing has changed.

I have tests to add, but they go in the CoreFX repo along with changes to System.Runtime, and have to be done after this merge is complete and we have a good build that CoreFX can refer to.

So fair warning: I am going to merge this later today (so we tonight's build has it), so if you have any reservations, speak now (in the next hour or two).

…erator Adding EnumerateChunks which allow efficient scanning of a StringBuilder Signed-off-by: dotnet-bot <[email protected]>

…erator Adding EnumerateChunks which allow efficient scanning of a StringBuilder Signed-off-by: dotnet-bot-corefx-mirror <[email protected]>

danmoseley · 2018-05-22T23:24:55Z

src/System.Private.CoreLib/shared/System/Text/StringBuilder.cs

+                    _chunkPos = -1;
+                }
+
+                readonly StringBuilder[] _chunks;    // These are in normal order (first chunk first) 


Can you please put fields first, per coding guidelines, and add missing visibility specifiers (private)? in ManyChunkInfo and ChunkEnumerator

danmoseley

LGTM

…erator Adding EnumerateChunks which allow efficient scanning of a StringBuilder Signed-off-by: dotnet-bot <[email protected]>

…erator Adding EnumerateChunks which allow efficient scanning of a StringBuilder Signed-off-by: dotnet-bot-corefx-mirror <[email protected]>

vancem added the * NO MERGE * The PR is not ready for merge yet (see discussion for detailed reasons) label Apr 12, 2018

dotnet-bot added the 2 - In Progress label Apr 12, 2018

RussKeldorph added the area-System.Runtime label Apr 13, 2018

Adding EnumerateChunks which allow efficient scanning of a StringBuilder

9b1f2b2

vancem force-pushed the StringBuilderEnumerator branch from 338c1a6 to 9b1f2b2 Compare May 15, 2018 21:29

Change Span to Memory

8ae3544

vancem added 3 commits May 16, 2018 18:55

Add Asserts. Fix bug found in testing.

e646dd2

Fix typo

3159048

Better comments

009a549

danmoseley reviewed May 18, 2018

View reviewed changes

Improve comments, review feedback.

014c623

stephentoub reviewed May 18, 2018

View reviewed changes

vancem added 2 commits May 18, 2018 13:38

add readonly attributes

6eeb7b6

Add another readonly attribute

6198c74

vancem added 4 commits May 22, 2018 10:55

Collapse ChunkEnumerable and ChunkEnumerator.

1028e38

Fix attribute placement

57055f1

Fix whitespace

79cbf2e

Fixe whitespace

0ba7f8d

vancem removed the * NO MERGE * The PR is not ready for merge yet (see discussion for detailed reasons) label May 22, 2018

vancem merged commit f9bb9e1 into dotnet:master May 22, 2018

danmoseley reviewed May 22, 2018

View reviewed changes

vancem mentioned this pull request May 23, 2018

Expose StringBuilder.GetChunks() dotnet/corefx#29877

Merged

vancem changed the title ~~Adding EnumerateChunks which allow efficient scanning of a StringBuilder~~ Adding GetChunks which allow efficient scanning of a StringBuilder May 24, 2018

vancem deleted the StringBuilderEnumerator branch May 31, 2018 00:03

lewurm mentioned this pull request Feb 1, 2019

[2018-08] Bump corert mono/mono#12721

Merged

vancem mentioned this pull request Jan 31, 2020

Adding GetChunks which allow efficient scanning of a StringBuilder dotnet/runtime#26207

Closed

Adding GetChunks which allow efficient scanning of a StringBuilder #17530

Adding GetChunks which allow efficient scanning of a StringBuilder #17530

Conversation

vancem commented Apr 12, 2018 • edited Loading

vancem commented Apr 12, 2018

jaredpar commented Apr 12, 2018

vancem commented Apr 12, 2018

jaredpar commented Apr 12, 2018

vancem commented Apr 12, 2018

VSadov commented Apr 12, 2018

vancem commented Apr 12, 2018

VSadov commented Apr 12, 2018 • edited Loading

VSadov commented Apr 12, 2018 • edited Loading

VSadov commented Apr 12, 2018

vancem commented Apr 12, 2018

VSadov commented Apr 12, 2018

VSadov commented Apr 12, 2018 • edited Loading

vancem commented Apr 13, 2018

davidfowl commented Apr 13, 2018

danmoseley commented Apr 17, 2018

VSadov commented Apr 17, 2018 • edited Loading

vancem commented Apr 17, 2018

stephentoub commented May 15, 2018

vancem commented May 15, 2018

vancem commented May 17, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stephentoub May 18, 2018 • edited Loading

Choose a reason for hiding this comment

davidfowl commented May 19, 2018

vancem commented May 22, 2018

Choose a reason for hiding this comment

danmoseley left a comment

Choose a reason for hiding this comment

vancem commented Apr 12, 2018 •

edited

Loading

VSadov commented Apr 12, 2018 •

edited

Loading

VSadov commented Apr 12, 2018 •

edited

Loading

VSadov commented Apr 12, 2018 •

edited

Loading

VSadov commented Apr 17, 2018 •

edited

Loading

stephentoub May 18, 2018 •

edited

Loading