Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.

Adding GetChunks which allow efficient scanning of a StringBuilder #17530

Merged
merged 12 commits into from
May 22, 2018
130 changes: 128 additions & 2 deletions src/System.Private.CoreLib/shared/System/Text/StringBuilder.cs
Original file line number Diff line number Diff line change
@@ -557,6 +557,132 @@ public char this[int index]
}
}

/// <summary>
/// EnumerateChunks returns ChunkEnumerable that follows the IEnumerable pattern and
/// thus can be used in a C# 'foreach' statemens to retreive the data in the StringBuilder
/// as chunks (ReadOnlyMemory) of characters. An example use is:
///
/// foreach (ReadOnlyMemory<char> chunk in sb.EnumerateChunks())
/// { /* operate on chars using 'chunk' */ }
///
/// </summary>
public ChunkEnumerable EnumerateChunks() => new ChunkEnumerable(this);

/// <summary>
/// ChunkEnumerable supports the IEnumerable pattern so foreach works (see EnumerateChunks)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth explaining here why you didn't make String builder implement IEnumerable? (Now you moved off Span, you could?)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is not a strong barrier to implementing IEnumerable. The main reason for not doing so is that it does incur some cost in complexity (more methods (that mostly never get called) and testing, and does not seem to add important value. Its main potential value is that you could pass it to APIs that take a IEnumerable, but that definitely would decrease the performance (you would end up boxing the struct), and chunks are really not a semantic meaningful quantity (it is note the natural thing you would do Linq over). I don't we should be encouraging using Linq over chunks.

So while it may have some value (which is why this question raised in https://github.com/dotnet/corefx/issues/29770), given that we could add it later compatibly, it is lower risk not to include it.

I hesitate to modify this comment however, as it is public facing documentation (and probably should avoid design discussions (which are better in PRs and Issues).

/// It needs to be public (so the compiler can use it when building a foreach statement)
/// but users typically don't use it explicitly (which is why it is a nested type).
/// </summary>
public struct ChunkEnumerable
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: readonly?

{
/// <summary>
/// Implements the IEnumerable pattern.
/// </summary>
public ChunkEnumerator GetEnumerator() => new ChunkEnumerator(this._stringBuilder);

#region private
internal ChunkEnumerable(StringBuilder stringBuilder)
{
Debug.Assert(stringBuilder != null); // Because it is only called with at 'this' pointer.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment grammar?

_stringBuilder = stringBuilder;
}
private StringBuilder _stringBuilder; // We insure this is never null.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: insure => ensure

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: readonly

#endregion
}

/// <summary>
/// ChunkEnumerator supports the IEnumerator pattern so foreach works (see EnumerateChunks)
/// It needs to be public (so the compiler can use it when building a foreach statement)
/// but users typically don't use it explicitly (which is why it is a nested type).
/// </summary>
public struct ChunkEnumerator
{
/// <summary>
/// Implements the IEnumerator pattern.
/// </summary>
public bool MoveNext()
{
if (_currentChunk == _firstChunk)
return false;

if (_manyChunks != null)
return _manyChunks.MoveNext(ref _currentChunk);

StringBuilder next = _firstChunk;
while (next.m_ChunkPrevious != _currentChunk)
next = next.m_ChunkPrevious;
_currentChunk = next;
return true;
}
/// <summary>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit newline above

/// Implements the IEnumerator pattern.
/// </summary>
public ReadOnlyMemory<char> Current => new ReadOnlyMemory<char>(_currentChunk.m_ChunkChars, 0, _currentChunk.m_ChunkLength);

#region private
internal ChunkEnumerator(StringBuilder stringBuilder)
{
Debug.Assert(stringBuilder != null);
_firstChunk = stringBuilder;
_currentChunk = null; // MoveNext will find the last chunk if we do this.
_manyChunks = null;

// There is a performance-vs-allocation tradeoff. Because the chunks
// are a linked list with each chunk pointing to its PREDECESSOR, walking
// the list FORWARD is not efficient. If there are few chunks (< 8) we
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the number 8 is just a judgement call here? I don't have good intuition for the number of chunks in the real world.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, 8 is a judgement call. Bascially you want a number before N*N gets too big. Generally if you do 'nothing' StringBuilders start with 16 bytes and double each time, so this handles strings up to about 2K-4K if you don't ever specify a starting length. Thus 8 keeps things non-allocating for the common case of strings < 256 bytes (e.g. paths). On the other hand, at 2-4K of data, the cost of copying the data is probably now large with respect to the cost of allocating the array, so the value of making the nubmer bigger is probably not a big deal (especially because people using StringBuilder on large strings should be setting a good starting length).

We can tune it as necessary.

// simply scan from the start each time, and tolerate the N*N behavior.
// However above this size, we allocate an array to hold pointers to all
// the chunks and we can be efficient for large N.
int chunkCount = ChunkCount(stringBuilder);
if (8 < chunkCount)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit, this is sort of a Yoda conditional

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't follow the change you want.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Quite a while back, I adopted a convention of always making relational operators go from smallest to largest. (basically I never use > or >=). I find it slightly clearer and less error prone, as it eliminates one source of arbitrary variation, and allows visual position to help in reading the code (this is especially nice if you are testing for a range).

For what it is worth...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. I may do it.

_manyChunks = new ManyChunkInfo(stringBuilder, chunkCount);
}

private static int ChunkCount(StringBuilder stringBuilder)
{
int ret = 0;
while (stringBuilder != null)
{
ret++;
stringBuilder = stringBuilder.m_ChunkPrevious;
}
return ret;
}

/// <summary>
/// Used to hold all the chunks indexes when you have many chunks.
/// </summary>
private class ManyChunkInfo
{
public bool MoveNext(ref StringBuilder current)
{
int pos = ++_chunkPos;
if (_chunks.Length <= pos)
return false;
current = _chunks[pos];
return true;
}

public ManyChunkInfo(StringBuilder stringBuilder, int chunkCount)
{
_chunks = new StringBuilder[chunkCount];
while (0 <= --chunkCount)
{
_chunks[chunkCount] = stringBuilder;
stringBuilder = stringBuilder.m_ChunkPrevious;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assert stringBuilder.m_ChunkPrevious?

}
_chunkPos = -1;
}
StringBuilder[] _chunks; // These are in normal order (first chunk first)
Copy link
Member

@stephentoub stephentoub May 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: private readonly (I'll stop commenting on these things)

int _chunkPos;
}

StringBuilder _firstChunk; // The first Stringbuilder chunk (which is the end of the logical string)
StringBuilder _currentChunk; // The chunk that this enumerator is currently returning (Current).
ManyChunkInfo _manyChunks; // Only used for long string builders with many chunks (see constructor)
#endregion
}

/// <summary>
/// Appends a character 0 or more times to the end of this builder.
/// </summary>
@@ -1079,7 +1205,7 @@ public StringBuilder Append(ReadOnlySpan<char> value)
return this;
}

#region AppendJoin
#region AppendJoin

public unsafe StringBuilder AppendJoin(string separator, params object[] values)
{
@@ -1187,7 +1313,7 @@ private unsafe StringBuilder AppendJoinCore<T>(char* separator, int separatorLen
return this;
}

#endregion
#endregion

public StringBuilder Insert(int index, String value)
{