Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[API Proposal]: IndexOfAnyValues<string> #85573

Closed
MihaZupan opened this issue Apr 30, 2023 · 13 comments · Fixed by #88394
Closed

[API Proposal]: IndexOfAnyValues<string> #85573

MihaZupan opened this issue Apr 30, 2023 · 13 comments · Fixed by #88394
Assignees
Labels
api-approved API was approved in API review, it can be implemented area-System.Memory
Milestone

Comments

@MihaZupan
Copy link
Member

MihaZupan commented Apr 30, 2023

Background and motivation

Earlier in .NET 8, we added an IndexOfAnyValues<T> type that represents an immutable set of values optimized for efficient {Last}IndexOfAny{Except} searching: #68328 (comment).
You obtain an instance of IndexOfAnyValues by passing the full set of values to Create, which picks the most efficient algorithm for that set of values and capabilities of the current platform.

We've already exposed Create methods for searching for any byte or char in a given set.
This proposal would allow you to create IndexOfAnyValues<string> instances for searching for multiple substrings in a given text, instead of just individual characters.
Like with bytes and chars, Create for IndexOfAnyValues<string> can analyze the values in advance and pick the most optimal algorithm (e.g., Aho-Corasick, Rabin-Karp, Teddy, ...).

As we're working with strings, we're proposing adding a StringComparison parameter to Create.
This method would only work for Ordinal/OrdinalIgnoreCase. It's unlikely that we can meaningfully accelerate culture-aware multi-substring searching, and the proposed APIs assume that matches are of equal lengths.

Semantically, IndexOfAny would behave the same as doing an IndexOf of every value, and taking the minimum index.

While a *Last*IndexOfAny variant is possible, we are currently only proposing the left-to-right IndexOfAny.

Regex is likely to be the main consumer of this API.

cc: @stephentoub
cc: @tarekgh, @GrabYourPitchforks as we're working with strings

API Proposal

namespace System.Buffers;

public static class IndexOfAnyValues
{
    // Existing
    public static IndexOfAnyValues<byte> Create(ReadOnlySpan<byte> values);
    public static IndexOfAnyValues<char> Create(ReadOnlySpan<char> values);

    // Proposed
    public static IndexOfAnyValues<string> Create(ReadOnlySpan<string> values, StringComparison comparisonType);
}
namespace System;

public static class MemoryExtensions
{
    // Existing
    public static int IndexOfAny<T>(this ReadOnlySpan<T> span, IndexOfAnyValues<T> values);
    public static int IndexOfAnyExcept<T>(this ReadOnlySpan<T> span, IndexOfAnyValues<T> values);
    public static int LastIndexOfAny<T>(this ReadOnlySpan<T> span, IndexOfAnyValues<T> values);
    public static int LastIndexOfAnyExcept<T>(this ReadOnlySpan<T> span, IndexOfAnyValues<T> values);

    // Proposed
    public static int IndexOfAny(this ReadOnlySpan<char> span, IndexOfAnyValues<string> values);
}

API Usage

private static readonly IndexOfAnyValues<string> s_names = IndexOfAnyValues.Create(
    new[] { "Sherlock", "Holmes", "Watson" }, StringComparison.Ordinal);

public static int CountNames(ReadOnlySpan<char> text)
{
    int count = 0;

    while (!text.IsEmpty)
    {
        int matchOffset = text.IndexOfAny(s_names);

        if ((uint)matchOffset >= (uint)text.Length) break;

        int matchLength = text[matchOffset] == 'S' ? 8 : 6;

        text = text.Slice(matchOffset + matchLength);
        count++;
    }

    return count;
}

Alternative Designs

Emphasize that only Ordinal{IgnoreCase} works:

public static IndexOfAnyValues<string> CreateOrdinal(ReadOnlySpan<string> values, StringComparison comparisonType);
// or
public static IndexOfAnyValues<string> CreateOrdinal(ReadOnlySpan<string> values, bool ignoreCase);

An API that returns both a match offset and length (instead of just the offset).
So far we've rejected this variant as it adds a policy to IndexOfAny semantics -- are we returning the leftmost-first vs leftmost-longest match.

public static (int MatchOffset, int MatchLength) IndexOfAny(this ReadOnlySpan<char> span, IndexOfAnyValues<string> values);

Related issues: #69682, #62447

@MihaZupan MihaZupan added api-suggestion Early API idea and discussion, it is NOT ready for implementation area-System.Memory labels Apr 30, 2023
@MihaZupan MihaZupan added this to the 8.0.0 milestone Apr 30, 2023
@ghost
Copy link

ghost commented Apr 30, 2023

Tagging subscribers to this area: @dotnet/area-system-memory
See info in area-owners.md if you want to be subscribed.

Issue Details

Background and motivation

Earlier in .NET 8, we added an IndexOfAnyValues<T> type that represents an immutable set of values optimized for efficient {Last}IndexOfAny{Except} searching: #68328 (comment).
You obtain an instance of IndexOfAnyValues by passing the full set of values to Create, which picks the most efficient algorithm for that set of values and capabilities of the current platform.

We've already exposed Create methods for searching for any byte or char in a given set.
This proposal would allow you to create IndexOfAnyValues<string> instances for searching for multiple substrings in a given text, instead of just individual characters.
Like with bytes and chars, Create for IndexOfAnyValues<string> can analyze the values in advance and pick the most optimal algorithm (e.g., Aho-Corasick, Rabin-Karp, Teddy, ...).

As we're working with strings, we're proposing adding a StringComparison parameter to Create.
This method would only work for Ordinal/OrdinalIgnoreCase. It's unlikely that we can meaningfully accelerate culture-aware multi-substring searching, and the proposed APIs assume that matches are of equal lengths.

While a *Last*IndexOfAny variant is possible, we are currently only proposing the left-to-right IndexOfAny.

Regex is likely to be the main consumer of this API.

cc: @stephentoub
cc: @tarekgh, @GrabYourPitchforks as we're working with strings

API Proposal

namespace System.Buffers;

public static class IndexOfAnyValues
{
    // Existing
    public static IndexOfAnyValues<byte> Create(ReadOnlySpan<byte> values);
    public static IndexOfAnyValues<char> Create(ReadOnlySpan<char> values);

    // Proposed
    public static IndexOfAnyValues<string> Create(ReadOnlySpan<string> values, StringComparison comparisonType);
}
namespace System;

public static class MemoryExtensions
{
    // Existing
    public static int IndexOfAny<T>(this ReadOnlySpan<T> span, IndexOfAnyValues<T> values);
    public static int IndexOfAnyExcept<T>(this ReadOnlySpan<T> span, IndexOfAnyValues<T> values);
    public static int LastIndexOfAny<T>(this ReadOnlySpan<T> span, IndexOfAnyValues<T> values);
    public static int LastIndexOfAnyExcept<T>(this ReadOnlySpan<T> span, IndexOfAnyValues<T> values);

    // Proposed
    public static int IndexOfAny(this ReadOnlySpan<char> span, IndexOfAnyValues<string> values);
}

API Usage

private static readonly IndexOfAnyValues<string> s_names = IndexOfAnyValues.Create(
    new[] { "Sherlock", "Holmes", "Watson" }, StringComparison.Ordinal);

public static int CountNames(ReadOnlySpan<char> text)
{
    int count = 0;

    while (!text.IsEmpty)
    {
        int matchOffset = text.IndexOfAny(s_names);

        if ((uint)matchOffset >= (uint)text.Length) break;

        int matchLength = text[matchOffset] == 'S' ? 8 : 6;

        text = text.Slice(matchOffset + matchLength);
        count++;
    }

    return count;
}

Alternative Designs

Emphasize that only Ordinal{IgnoreCase} works:

public static IndexOfAnyValues<string> CreateOrdinal(ReadOnlySpan<string> values, StringComparison comparisonType);
// or
public static IndexOfAnyValues<string> CreateOrdinal(ReadOnlySpan<string> values, bool ignoreCase);

Risks

No response

Author: MihaZupan
Assignees: -
Labels:

api-suggestion, area-System.Memory

Milestone: 8.0.0

@stephentoub stephentoub added api-ready-for-review API is ready for review, it is NOT ready for implementation and removed api-suggestion Early API idea and discussion, it is NOT ready for implementation labels Apr 30, 2023
@DaZombieKiller
Copy link
Contributor

Would it also make sense to extend this to arrays in the future? For example, if you wanted to search for UTF-8 substrings:

private static readonly IndexOfAnyValues<byte[]> s_names = IndexOfAnyValues.Create(
    new[] { "Sherlock"u8.ToArray(), "Holmes"u8.ToArray(), "Watson"u8.ToArray() });

public static int CountNames(ReadOnlySpan<byte> text)
{
    int count = 0;

    while (!text.IsEmpty)
    {
        int matchOffset = text.IndexOfAny(s_names);

        if ((uint)matchOffset >= (uint)text.Length) break;

        int matchLength = text[matchOffset] == 'S' ? 8 : 6;

        text = text.Slice(matchOffset + matchLength);
        count++;
    }

    return count;
}

Maybe byte[] isn't the best fit, but I think it's valuable to have UTF-8 support in some form here.

@stephentoub
Copy link
Member

Would it also make sense to extend this to arrays in the future?

It's possible. I think we should gain some experience with the string versions first, though. Strings are also the variant we'll immediately use elsewhere in the core libraries (Miha and I have an end-to-end implementation with regex).

@panost
Copy link

panost commented May 2, 2023

Your proposal is another example against the closed nature of the current implementation. I did mention some in my proposal of Scanner<T>, for example searching using a table lookup (tables similar to the ones in CharUnicodeInfo for example) or using regular expressions

Also it is not clear to me what is the expected behavior of the consumers of those classes. How are they going to be used?

Are they invoking the IndexOf method once, to determinate the existence of a some characters (whether a string contains invalid path characters for example) or can be used in general searching?
For example, using it for splitting a string, where the IndexOf and perhaps the IndexOfExcept will be called repeatably. If it is the latter then perhaps a Match length is required? To advance the search index for the next match or it is always assumed to be 1?

Finally a few thoughts on your proposal

  1. I do not know how are you planning to implement the Except variant when searching for strings.
  2. What T[] GetValues() returns (In my opinion a useless method used only for DebugView
  3. How you advance the search index after a successful search? I mean searching in a string
    "commented" with the words "comment" and "mented" will you find one or two matches?
  4. I hope common variant "Match whole words" is supported as an option
  5. Since you are using a multi-character matcher note that those require special handling for buffered searches. I mean when you search using a buffer and reading repeatably from a stream for example. You may need a property "WindowSize" or something like this.

@MihaZupan
Copy link
Member Author

I did mention some in my #59649, for example searching using a table lookup

Are you referring to IndexOfAnyValues<char>? These are related, but the usages are different from IndexOfAnyValues<string>. If you want to search for any single character in a set of characters, you wouldn't want to use IndexOfAnyValues<string>.

Are they invoking the IndexOf method once, to determinate the existence of a some characters (whether a string contains invalid path characters for example) or can be used in general searching?

If you want to check things like "does a text contain a forbidden character", you would use IndexOfAnyValues<char>.

For checks like "does a text contain any of these forbidden words", you could use IndexOfAnyValues<string> and only call IndexOfAny once.

For "general searching" (like Regex), they can use the IndexOfAnyValues<string> instance to quickly find the next possible start location of a match, and then do any custom verification after the fact before calling IndexOfAny again (including things like deciding how many characters to advance in the input).

  1. I do not know how are you planning to implement the Except variant when searching for strings.

I don't. Note that IndexOfAny is the only proposed Span<char>, IndexOfAnyValues<string> extension.

  1. What T[] GetValues() returns

The set of values you specified in Create, likely with removed duplicates.

(In my opinion a useless method used only for DebugView

It's an internal method -- an implementation detail of the debug view.

  1. I hope common variant "Match whole words" is supported as an option

Whole value matching is the only thing it supports. It doesn't support other prefix-tree-like operations.

  1. How you advance the search index after a successful search? I mean searching in a string
    "commented" with the words "comment" and "mented" will you find one or two matches?

If you ask for "commented".IndexOfAny( { "comment", "mented" } ), the answer you would get back is 0.
It is up to the caller to decide how to advance the input, or to decide which exact value matched if there are ambiguities (e.g., leftmost-first vs leftmost-longest match).

@panost
Copy link

panost commented May 2, 2023

If you ask for "commented".IndexOfAny( { "comment", "mented" } ), the answer you would get back is 0

I mean how do you proceed after that. The first call returns 0. How do you call it to find the next. You only have an index with a value of 0. You don't know that a word with a length of 7 was matched.

It is up to the caller to decide how to advance the input,

Then it is not an API. Given that I do not know why the IndexOfAnyValues family is a public one and not an internal in the first place.
That or is something fundamental in those proposals that I do not grasp. Is not a generic "find-something" in a string or byte array API. The chosen implementations are limiting it's usage. It is a search in a string based on some values. If you want a generic "find-something" you cannot rely on it. You have to implement another layer of abstraction that sometimes is implemented with an array of values, sometimes is not. Or write multiple implementations. One using IndexOfAnyValue, another using Regex and another using lookup tables

@bartonjs
Copy link
Member

bartonjs commented May 2, 2023

Video

We had a long discussion about the implications of just returning the int match and felt it was OK (we could extend it later to out a match length or the matching term).

Looks good as proposed.

namespace System.Buffers;

public static partial class IndexOfAnyValues
{
    public static IndexOfAnyValues<string> Create(ReadOnlySpan<string> values, StringComparison comparisonType);
}

namespace System;

public static partial class MemoryExtensions
{
    public static int IndexOfAny(this ReadOnlySpan<char> span, IndexOfAnyValues<string> values);

    // From https://github.com/dotnet/runtime/issues/86528
    public static bool ContainsAny(this ReadOnlySpan<char> span, SearchValues<string> values);
}

@bartonjs bartonjs added api-approved API was approved in API review, it can be implemented and removed api-ready-for-review API is ready for review, it is NOT ready for implementation labels May 2, 2023
@danmoseley
Copy link
Member

@MihaZupan related, is there/was there a proposal for string.ContainsAny(char[]) ?

eg., for use places like this which currently have to use IndexOf. https://github.com/dotnet/msbuild/blob/4ffba3fe0dd35a30cc892bc8c202a006acb8f20a/src/Build/Evaluation/Expander.cs#L401

IIRC, we have some optimizations for Contains that can make it faster than IndexOf.

@stephentoub
Copy link
Member

is there/was there a proposal for string.ContainsAny(char[]) ?

If we were going to add something there, it would be more like:

public static bool ContainsAny(this ReadOnlySpan<char> span, SearchValues<char> values);

Leaving the possibility of adding such methods in the future is why we renamed IndexOfAnyValues to SearchValues.

That is unrelated to this issue, though.

we have some optimizations for Contains that can make it faster than IndexOf.

It's relatively minor, though. Basically once we find that there is a match, for Contains we can just return true whereas for IndexOf we need to determine which element in the vector matched in order to return the exact right index.

@MihaZupan
Copy link
Member Author

ContainsAny is something we've thought about as the pattern is quite common, I just never got around to opening an API proposal issue.
I can measure it just to see, but I expect we could save a few instructions for the current SearchValues<char>/SearchValues<byte>.

For something like SearchValues<string> with multiple strings, there would likely be no difference at all.

@MihaZupan
Copy link
Member Author

@danmoseley I opened a separate issue to track that #86528

@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Jul 4, 2023
@tannergooding
Copy link
Member

@stephentoub, @MihaZupan; is this critical to land for .NET 8 or is it fine if it slips to .NET 9?

@stephentoub
Copy link
Member

stephentoub commented Jul 25, 2023

is this critical to land for .NET 8 or is it fine if it slips to .NET 9?

It's not critical, but it does complete a key feature we targeted for .NET 8, and it's inches from merging (there's also a follow-up PR already reviewed that'll also go in once that merges).

@MihaZupan MihaZupan modified the milestones: 8.0.0, 9.0.0 Aug 2, 2023
@ghost ghost added in-pr There is an active PR which will close this issue when it is merged and removed in-pr There is an active PR which will close this issue when it is merged labels Aug 20, 2023
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Aug 20, 2023
@ghost ghost locked as resolved and limited conversation to collaborators Sep 19, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
api-approved API was approved in API review, it can be implemented area-System.Memory
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants