Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[API Proposal]: SplitFirst/SplitLast methods for strings and (ReadOnly)Span/Memory<T> slices #75317

Open
neon-sunset opened this issue Sep 9, 2022 · 16 comments
Labels
api-suggestion Early API idea and discussion, it is NOT ready for implementation area-System.Memory
Milestone

Comments

@neon-sunset
Copy link
Contributor

neon-sunset commented Sep 9, 2022

Background and motivation

While working with various solutions both internal and external, it appears there is a commonly used pattern to split strings just once which can be improved with an allocation-free helper.

Specifically, such patterns are especially common when processing headers, fields in JSON payloads, delimited pairs of values, etc.
For example, when handling OAuth2.0 headers, the often found pattern is the following:

var header = "Authorization: Bearer mF_9.B5f-4.1JqM";
var token = header.Split(':')[1].Trim();
// remove 'Bearer' and handle auth

While it has acceptable performance, we may be able to do better by providing convenience methods which would allow to handle "split once" cases idiomatically while simultaneously nudging the developers towards performance-oriented APIs.

In fact, when dealing with similar scenarios, Rust decided to explicitly have a dedicated function over consuming a split iterator: https://doc.rust-lang.org/std/primitive.str.html#method.split_once.

UPD: Proposal amended with SplitFirst/SplitLast suggestion by @MichalPetryka

API Proposal

Concerns and open questions

Q: Too many overloads / the API is unnecessarily complex
A: It's a suggested tradeoff in order to support string, {ReadOnly}Memory<T> and {ReadOnly}Span<T>. The rationale behind it is to reduce boilerplate and have good discoverability i.e. when writing string.Split it becomes very easy to discover and use .SplitFirst without having to "know" that such API exists beforehand.

Q: StringComparison overloads
A: They have been omitted for brevity. It may be profitable to either add them or delegate them to a more complex APIs suggested in #93 or #76186. If you have any concerns, please let me know

Q: Exception handling, .Empty span/memory vs Exception
A: Since the API is intended to be used as a short-form, the suggested behavior is to return matched slices in Segment and Remainder, with Remainder being IsEmpty when the sliced source has no separator

public static class MemoryExtensions
{
    public static (string, string) SplitFirst(this string source, char separator);
    public static (string, string) SplitFirst(this string source, ReadOnlySpan<char> separator);

    public static Split<T> SplitFirst<T>(this Span<T> source, T separator) where T : IEquatable<T>;
    public static Split<T> SplitFirst<T>(this Span<T> source, ReadOnlySpan<T> separator) where T : IEquatable<T>;

    public static ReadOnlySplit<T> SplitFirst<T>(this ReadOnlySpan<T> source, T separator) where T : IEquatable<T>;
    public static ReadOnlySplit<T> SplitFirst<T>(this ReadOnlySpan<T> source, ReadOnlySpan<T> separator) where T : IEquatable<T>;

    public static (Memory<T> Segment, Memory<T> Remainder) SplitFirst<T>(this Memory<T> source, T separator)  where T : IEquatable<T>;
    public static (Memory<T> Segment, Memory<T> Remainder) SplitFirst<T>(this Memory<T> source, ReadOnlySpan<T> separator) where T : IEquatable<T>;

    public static (ReadOnlyMemory<T> Segment, ReadOnlyMemory<T> Remainder) SplitFirst<T>(this ReadOnlyMemory<T> source, T separator) where T : IEquatable<T>;
    public static (ReadOnlyMemory<T> Segment, ReadOnlyMemory<T> Remainder) SplitFirst<T>(this ReadOnlyMemory<T> source, ReadOnlySpan<T> separator) where T : IEquatable<T>;

    public static (string, string) SplitLast(this string source, char separator);
    public static (string, string) SplitLast(this string source, ReadOnlySpan<char> separator);

    public static Split<T> SplitLast<T>(this Span<T> source, T separator) where T : IEquatable<T>;
    public static Split<T> SplitLast<T>(this Span<T> source, ReadOnlySpan<T> separator) where T : IEquatable<T>;

    public static ReadOnlySplit<T> SplitLast<T>(this ReadOnlySpan<T> source, T separator) where T : IEquatable<T>;
    public static ReadOnlySplit<T> SplitLast<T>(this ReadOnlySpan<T> source, ReadOnlySpan<T> separator) where T : IEquatable<T>;

    public static (Memory<T> Segment, Memory<T> Remainder) SplitLast<T>(this Memory<T> source, T separator) where T : IEquatable<T>;
    public static (Memory<T> Segment, Memory<T> Remainder) SplitLast<T>(this Memory<T> source, ReadOnlySpan<T> separator) where T : IEquatable<T>;

    public static (ReadOnlyMemory<T> Segment, ReadOnlyMemory<T> Remainder) SplitLast<T>(this ReadOnlyMemory<T> source, T separator) where T : IEquatable<T>;
    public static (ReadOnlyMemory<T> Segment, ReadOnlyMemory<T> Remainder) SplitLast<T>(this ReadOnlyMemory<T> source, ReadOnlySpan<T> separator) where T : IEquatable<T>;

    // RefValueTuple? :)
    public readonly ref struct Split<T>
    {
        public Span<T> Segment { get; }

        public Span<T> Remainder { get; }

        public void Deconstruct(out Span<T> segment, out Span<T> remainder);
    }

    public readonly ref struct ReadOnlySplit<T>
    {
        public ReadOnlySpan<T> Segment { get; }

        public ReadOnlySpan<T> Remainder { get; }

        public void Deconstruct(out ReadOnlySpan<T> segment, out ReadOnlySpan<T> remainder);
    }
}

Reference impl.: https://gist.github.com/neon-sunset/df6fb9fe6bb71f11c2b47fbeae55e6da

API Usage

Simple use

var (name, surname) = "John-Doe".SplitFirst('-');

Method chaining

var middleName = "Stanley Bartholomew Burnside"
    .AsMemory()
    .SplitFirst(' ')
    .Remainder
    .SplitLast(' ')
    .Segment;

Slicing bearer token out of full header string

// In async middleware
var header = "Authorization: Bearer mF_9.B5f-4.1JqM";
var token = header.AsMemory().SplitLast(' ').Remainder.Trim();

if (token.Length is 0)
{
  // set unauthorized
}

Retrieving arbitrary data from a text field

var message = "FinalNotification::Suspended\r\n";

var (messageType, status) = message 
    .AsSpan()
    .TrimEnd("\r\n")
    .SplitFirst("::");

Alternative Designs

Existing string.Split(...) overloads, #93 and #76186

Risks

Minimal, the API is intended to co-exist with more complex versions like split iterator suggested in an alternate design.
If there will be demand, it allows for further extension with .Split{First/Last}(this Span<T> source...) overloads.

@neon-sunset neon-sunset added the api-suggestion Early API idea and discussion, it is NOT ready for implementation label Sep 9, 2022
@ghost ghost added the untriaged New issue has not been triaged by the area owner label Sep 9, 2022
@ghost
Copy link

ghost commented Sep 9, 2022

Tagging subscribers to this area: @dotnet/area-system-memory
See info in area-owners.md if you want to be subscribed.

Issue Details

Background and motivation

While working with various solutions both internal and external, it appears there is a commonly used pattern to split strings just once which can be improved with an allocation-free helper similar to split_once() function in Rust.

Specifically, such patterns are especially common when processing headers and fields in JSON payloads.
For example, when handling OAuth2.0 headers, the often found pattern is the following:

var header = "Authorization: Bearer mF_9.B5f-4.1JqM";
var token = header.Split(':')[0].Trim();
// remove 'Bearer' and handle auth

While it has acceptable performance, we may be able to do better by providing convenience methods which would allow to handle "split once" cases idiomatically while simultaneously nudging the developers towards performance-oriented APIs.

API Proposal

namespace System.Text:

public static class SplitHelpers
{
    public static (ReadOnlyMemory<char> Left, ReadOnlyMemory<char> Right) SplitOnce(
        this string source, char separator);

    public static (ReadOnlyMemory<char> Left, ReadOnlyMemory<char> Right) SplitOnce(
        this string source, ReadOnlySpan<char> separator);

    public static (ReadOnlyMemory<char> Left, ReadOnlyMemory<char> Right) SplitOnce(
        this ReadOnlyMemory<char> source, char separator);

    public static (ReadOnlyMemory<char> Left, ReadOnlyMemory<char> Right) SplitOnce(
        this ReadOnlyMemory<char> source, ReadOnlySpan<char> separator);

    public static ReadOnlySpanSplit<char> SplitOnce(this ReadOnlySpan<char> source, char separator);

    public static ReadOnlySpanSplit<char> SplitOnce(
        this ReadOnlySpan<char> source, ReadOnlySpan<char> separator);

    public static SpanSplit<char> SplitOnce(this Span<char> source, char separator);

    public static SpanSplit<char> SplitOnce(this Span<char> source, ReadOnlySpan<char> separator);

    // RefValueTuple? :)
    public readonly ref struct ReadOnlySpanSplit<T>
    {
        public readonly ReadOnlySpan<T> Left;

        public readonly ReadOnlySpan<T> Right;
        ...
        public void Deconstruct(out ReadOnlySpan<T> left, out ReadOnlySpan<T> right)
        {
            left = Left;
            right = Right;
        }

    public readonly ref struct SpanSplit<T>
    {
        public readonly ReadOnlySpan<T> Left;

        public readonly ReadOnlySpan<T> Right;
        ...
        public void Deconstruct(out ReadOnlySpan<T> left, out ReadOnlySpan<T> right)
        {
            left = Left;
            right = Right;
        }
    }
}

API Usage

var (name, surname) = "John Doe".SplitOnce(' ');

var messageHeader = "FinalNotification::Suspended\r\n";
var (notificationKind, status) = messageHeader
    .AsSpan()
    .TrimEnd("\r\n")
    .SplitOnce("::");

Header example:

var header = "Authorization: Bearer mF_9.B5f-4.1JqM";

// Get token slice out of string. Because split consists of two ROMs,
// the operations are both allocation free and work in scenarios like async middleware
var token = header.SplitOnce(':').Right.Trim();

Alternative Designs

Either keep using currently available methods or rely on hand-written/nuget extensions.

Risks

Because string.SplitOnce(... will return a tuple of two ReadOnlyMemory<char> slices, the users that haven't worked with spans/memory previously might have to check documentation.

Author: neon-sunset
Assignees: -
Labels:

api-suggestion, area-System.Memory

Milestone: -

@neon-sunset neon-sunset changed the title [API Proposal] SplitOnce() convenience methods for strings and Span/Memory<char> slices [API Proposal]: SplitOnce() convenience methods for strings and Span/Memory<char> slices Sep 9, 2022
@neon-sunset
Copy link
Contributor Author

neon-sunset commented Sep 9, 2022

Here are the numbers for prototype impl. that can only handle char separator from https://gist.github.com/neon-sunset/df6fb9fe6bb71f11c2b47fbeae55e6da

[MemoryDiagnoser]
public class SplitBenchmarks
{
    [Params(
        "Authorization: test-token",
        "longstringlongstringlongstringlongstringlongstring:longstringlongstring")]
    public string Value = string.Empty;

    [Benchmark(Baseline = true)]
    public int Split() => Value.Split(':')[1].Length;

    [Benchmark]
    public int SplitOnce() => Value.SplitOnce(':').Right.Length;
}
BenchmarkDotNet=v0.13.1, OS=macOS 13.0 (22A5331f) [Darwin 22.1.0]
Apple M1 Pro, 1 CPU, 8 logical and 8 physical cores
.NET SDK=7.0.100-rc.2.22452.3
  [Host]     : .NET 7.0.0 (7.0.22.42212), Arm64 RyuJIT
  DefaultJob : .NET 7.0.0 (7.0.22.42212), Arm64 RyuJIT
Method Value Mean Error StdDev Ratio Gen 0 Allocated
Split Autho(...)token [25] 27.656 ns 0.0579 ns 0.0484 ns 1.00 0.0026 136 B
SplitOnce Autho(...)token [25] 9.679 ns 0.0789 ns 0.0700 ns 0.35 - -
Split longs(...)tring [71] 38.610 ns 0.1104 ns 0.1033 ns 1.00 0.0045 232 B
SplitOnce longs(...)tring [71] 11.267 ns 0.0596 ns 0.0557 ns 0.29 - -

@neon-sunset neon-sunset changed the title [API Proposal]: SplitOnce() convenience methods for strings and Span/Memory<char> slices [API Proposal]: SplitFirst/SplitLast convenience methods for strings and Span/Memory<char> slices Sep 9, 2022
@ovska
Copy link

ovska commented Sep 9, 2022

Seems like a large and complicated api surface, considering it can be done with an IndexOf and two slices, as is done the linked implementation.

@stephentoub
Copy link
Member

There's also an existing proposal with a lot of discussion and work around splitting support for spans here:
#934

@neon-sunset
Copy link
Contributor Author

neon-sunset commented Sep 9, 2022

There's also an existing proposal with a lot of discussion and work around splitting support for spans here:
#934

Thanks, I have seen the linked proposal but thought the above use case is a "low hanging fruit" that can serve one of the most often used .Split() scenarios sufficient to warrant a separate proposal.

@stephentoub
Copy link
Member

but thought the above use case is a "low hanging fruit"

It just doesn't seem like it saves a lot. Presumably all of these would have to throw an exception if the separator wasn't found (putting everything into one of the two sides seems arbitrary and likely bug-inducing), and I'm not sure that actually maps to the majority use case where this would be used (you have a header parsing example, but I struggle to imagine real-world usage for header parsing that would be ok with the proposed semantics when the separator wasn't found). And even then it's not saving a whole lot, e.g.

(ReadOnlySpan<char> firstName, ReadOnlySpan<char> lastName) = name.SplitFirst(' ');

instead of:

int pos = name.IndexOf(' ');
ReadOnlySpan<char> firstName = name[0..pos];
ReadOnlySpan<char> lastName = name[pos+1..];

where the latter is also more flexible, doesn't require learning about a new method, etc.

Personally, it doesn't seem worth it to me. Happy to hear opinions from others, though.

@neon-sunset
Copy link
Contributor Author

neon-sunset commented Sep 9, 2022

Presumably all of these would have to throw an exception if the separator wasn't found (putting everything into one of the two sides seems arbitrary and likely bug-inducing), and I'm not sure that actually maps to the majority use case where this would be used (you have a header parsing example, but I struggle to imagine real-world usage for header parsing that would be ok with the proposed semantics when the separator wasn't found).

Agree, I don't have a strong opinion on which is better, throwing an exception or returning the right slice as empty. Implementation-wise, not all code wants EH and the expectation is to just check right.Length is 0 since there was no separator therefore there is nothing to "split off".

The intention is to nudge developers who don't know or don't use slice-like types to ROM<char> and etc. towards more performant code. Someone would just write var (_, trailer) = string.SplitLast(...); without thinking twice and get the desired behavior.

@jeffhandley jeffhandley added this to the Future milestone Sep 16, 2022
@ghost ghost removed the untriaged New issue has not been triaged by the area owner label Sep 16, 2022
@neon-sunset
Copy link
Contributor Author

neon-sunset commented Sep 28, 2022

The API proposal has been updated with the following changes:

  1. Align this string source overload to return (string, string) which is consistent with BCL convention for string.Trim() vs span.Trim();
  2. Add short Q&A section to provide additional context
  3. Reformat API definition for readability

@neon-sunset
Copy link
Contributor Author

Is there anything that can be done to move forward with the proposal?

The pattern appears to be quite popular: https://grep.app/search?q=%5C.Split%5C%28%27.%28.%29%27%5C%29%5C%5B%28.%29%5C%5D&regexp=true&case=true&filter[lang][0]=C%23

@neon-sunset neon-sunset changed the title [API Proposal]: SplitFirst/SplitLast convenience methods for strings and Span/Memory<char> slices [API Proposal]: SplitFirst/SplitLast methods for strings and (ReadOnly)Span/Memory<T> slices Jul 18, 2023
@neon-sunset
Copy link
Contributor Author

@dotnet/area-system-memory is there any work or discussion required for this proposal to become ready for review?

@stephentoub
Copy link
Member

stephentoub commented Apr 27, 2024

@dotnet/area-system-memory is there any work or discussion required for this proposal to become ready for review?

I continue to not see it as being worthwhile. A lot of surface area for something that's already supported in multiple ways.

@neon-sunset
Copy link
Contributor Author

neon-sunset commented Apr 27, 2024

Working with strings in advanced implementations in C# is very nice through spans, and such a scenario is well-covered there, yes.

However, the average user will not be reaching out for those, and may even be (incorrectly) advised to avoid them. I believe there is surprisingly big gap between an analyzer suggesting to replace text.Split(',', 2)[0] with text.SplitFirst(',').Segment and reaching for the three-liner with a span, in how the user arrives upon these.

It seems not doing so would go against overall trend of C# to offer more APIs to streamline common operations in a performant way.

I understand that it is trivial and not worth it for a caliber of developer that works on optimizing .NET, but that's not the case for the average codebase more preoccupied with moving a json from an incoming request to an entity in DB, so the cost accrues.

Please close/reject the proposal if you believe that such API does not make sense in .NET.

@MaxMahem
Copy link

FWIW I independently came up with a similar implementation, which I found very useful for working with Spans. So I at least found this API valuable.

@jilles-sg
Copy link

I think this can be valuable if the API is improved to make it easier to check whether the separator was found. For example (using ReadOnlySpan<char> since it's most restrictive):

  • Returning a boolean and passing the segments via out parameters.
  • Returning a ref struct that pretends to be an array of one or two ReadOnlySpan<char> with a Length property and an indexer.

The latter can be consumed by indexing [0], [1] or [^1] or using a list pattern.

Although the number of lines saved per use of the API is low, it is common in some kinds of code.

It might be enough to have methods for ReadOnlySpan<T> where T: IEquatable<T> only (so ReadOnlySpan<char> and ReadOnlySpan<byte>) since the other cases would be rarer, and often when splitting a string it's not needed to materialize both segments as a string.

@MaxMahem
Copy link

MaxMahem commented Jan 28, 2025

As for the "how to handle delimiter not found" case I think returning the entire string in the first span and an empty segment in the second is the most logical.

  1. This matches the behavior of String.Split, which returns the entire string as the only element of an array if the delimiter is not found.
  2. It feels like a logical response to the instruction given by the method. "Split this span into portions before and after this delimiter." Since the marker is not found, all elements lie before it.
  3. If the result is some sort of deconstructable container, an empty span can handled via pattern matching, ala:
var (left, right) = span.SplitFirst(delimiter) is (_, not []) split ? split : throw new Exception("Delimiter Not Found");

@neon-sunset
Copy link
Contributor Author

neon-sunset commented Jan 28, 2025

@jilles-sg funnily enough, the indexer segment access (with deconstruction alternative) is what I settled on in the latest iteration of internal helpers to deal with this too. Works nicely whenever the index is a constant. I don't distinguish whether the separator was found for SplitOnce however. If someone needs a stricter implementation, they can do it with IndexOf. The common case usually either cares about slicing off a segment or splitting key-value pairs into two where input is either known to be valid or the downstream logic would fail regardless when e.g. integer parsing. The choices can be yak shaved here but ultimately it is most important for the extension to be completely optimal and easy to use in the common scenario.

Alternatively, we can also mirror parsing: throwing SplitOnce and non-throwing TrySplitOnce variants. Rust implementation does this by returning Option<(&str, &str)> - closer to the latter.

What matters most is getting anything that let's us finally do away with every new codebase having tons of text.Split('=')[0] and text.Split('=')[1]s that need to be fixed :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-suggestion Early API idea and discussion, it is NOT ready for implementation area-System.Memory
Projects
None yet
Development

No branches or pull requests

6 participants