Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add SearchValues<string> #88394

Merged
merged 43 commits into from
Aug 20, 2023
Merged

Add SearchValues<string> #88394

merged 43 commits into from
Aug 20, 2023

Conversation

MihaZupan
Copy link
Member

@MihaZupan MihaZupan commented Jul 4, 2023

Closes #85573
Closes #86528
Contributes to #62447

As discussed during API review, only Ordinal and OrdinalIgnoreCase semantics are available.
Supports OrdinalIgnoreCase fully (including running under Invariant/NLS mode).
Includes Avx512 variants for vectorized paths.

This is an initial version of SearchValues<string>, which includes the following specialized implementations:

  • Teddy-based approach for searching for length=2 or 3 prefixes. Includes a bucketized variant when dealing with more than 8 values.
  • Rabin-Karp-based approach that is only used as a fallback for Teddy for very short inputs.
  • Aho-Corasick-based approach that we use when dealing with many values (e.g. searching for a blocklist of words in a text), or as a fallback when we can't use Teddy for other reasons.
  • SingleStringSearchValuesThreeChars is a variant of http://0x80.pl/articles/simd-strfind.html#algorithm-1-generic-simd that uses 3 precomputed anchor points. On average provides lower per-call overheads than calling text.IndexOf("foo").
  • SingleStringSearchValuesFallback for when we don't have anything fancier available and falling back to Aho-Corasick would be noticeably slower.

These variants have additional generic specializations for different case sensitivity semantics of values: CaseSensitive, CaseInsensitiveAsciiLetters, CaseInsensitiveAscii, and CaseInsensitiveUnicode.

I'll share actual perf numbers when we're happier with the overall shape since there are a lot of interesting needle+haystack combinations to look at.

cc: @stephentoub @teo-tsirpanis @danmoseley

@MihaZupan MihaZupan added this to the 8.0.0 milestone Jul 4, 2023
@MihaZupan MihaZupan requested a review from stephentoub July 4, 2023 22:01
@MihaZupan MihaZupan self-assigned this Jul 4, 2023
@dotnet-issue-labeler
Copy link

Note regarding the new-api-needs-documentation label:

This serves as a reminder for when your PR is modifying a ref *.cs file and adding/modifying public APIs, please make sure the API implementation in the src *.cs file is documented with triple slash comments, so the PR reviewers can sign off that change.

@ghost
Copy link

ghost commented Jul 4, 2023

Tagging subscribers to this area: @dotnet/area-system-buffers
See info in area-owners.md if you want to be subscribed.

Issue Details

Closes #85573
Contributes to #86528
Contributes to #62447

As discussed during API review, only Ordinal and OrdinalIgnoreCase semantics are available.
Supports OrdinalIgnoreCase fully (including running under Invariant/NLS mode).
Includes Avx512 variants for vectorized paths.

This is an initial version of SearchValues<string>, which includes the following specialized implementations:

  • Teddy-based approach for searching for length=2 or 3 prefixes. Includes a bucketized variant when dealing with more than 8 values.
  • Rabin-Karp-based approach that is only used as a fallback for Teddy for very short inputs.
  • Aho-Corasick-based approach that we use when dealing with many values (e.g. searching for a blocklist of words in a text), or as a fallback when we can't use Teddy for other reasons.
  • SingleStringSearchValuesThreeChars is a variant of http://0x80.pl/articles/simd-strfind.html#algorithm-1-generic-simd that uses 3 precomputed anchor points. On average provides lower per-call overheads than calling text.IndexOf("foo").
  • SingleStringSearchValuesFallback for when we don't have anything fancier available and falling back to Aho-Corasick would be noticeably slower. Has some length-based specializations as well.

These variants have additional generic specializations for different case sensitivity semantics of values: CaseSensitive, CaseInensitiveAsciiLetters, CaseInensitiveAscii, and CaseInensitiveUnicode.

There are still some less-critical TODOs left, and the code could use more comments. Opening this early in case we want to make any more significant changes.

I'll share actual perf numbers when we're happier with the overall shape since there are a lot of interesting needle+haystack combinations to look at.

cc: @stephentoub @teo-tsirpanis @danmoseley

Author: MihaZupan
Assignees: MihaZupan
Labels:

area-System.Buffers

Milestone: 8.0.0

@danmoseley
Copy link
Member

danmoseley commented Jul 5, 2023

do we need anything added to the THIRD-PARTY-NOTICES.txt such as something for http://0x80.pl/articles/simd-strfind.html#algorithm-1-generic-simd or existing implementations like https://github.com/jneem/teddy if they were particularly helpful?

@MihaZupan
Copy link
Member Author

do we need anything added to the THIRD-PARTY-NOTICES.txt

If we don't have anything for http://0x80.pl/articles/simd-strfind.html#algorithm-1-generic-simd yet, then yes, we should add that as we're already using it in our string IndexOf implementations.

It would also be fair to add a mention of https://github.com/BurntSushi/aho-corasick as I've spent a lot of time digging through its code initially to understand how the Teddy algorithm works.

@danmoseley

This comment was marked as resolved.

@danmoseley
Copy link
Member

It would also be fair to add a mention of https://github.com/BurntSushi/aho-corasick as I've spent a lot of time digging through its code initially to understand how the Teddy algorithm works.

kudos to @BurntSushi here.

}

[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static Vector512<byte> TransformInput(Vector512<byte> input)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tannergooding, this PR feels like a really good candidate for exploring ways of reducing duplication between Vector128/256/512.

@MihaZupan MihaZupan force-pushed the searchvalues-string branch from e8f9307 to 419a352 Compare July 23, 2023 06:04
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[API Proposal]: MemoryExtensions.ContainsAny{Except} [API Proposal]: IndexOfAnyValues<string>
6 participants