Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.

StartsWith Linux perf improvements #26621

Merged
merged 10 commits into from
Sep 30, 2019
Merged

Conversation

adamsitnik
Copy link
Member

@adamsitnik adamsitnik commented Sep 10, 2019

This PR improves the performance of Culture-aware string.StartsWith on Linux.

Unfortunately, ICU does not expose a method that allows for optimal StartsWith using Collator API (there is a StartsWith that allows for comparing UnicodeStrings but without a possibility to specify culture).

I read some docs, articles and studied the code of ICU itself and came up with this proposal.

The longer the source text and the more unlucky case (miss) the bigger the gains.

It should help a lot with https://github.com/dotnet/corefx/issues/40674

public class Perf_StartsWith
{
    public IEnumerable<object[]> StartsWithArguments()
    {
        yield return new object[] { "a", "a" };
        yield return new object[] { "aaaaaaaaaaa", "a" };
        yield return new object[] { "a", "aaaaaaaaaaa" };
        yield return new object[] { new string('a', 512), "a" };
        yield return new object[] { new string('a', 512), "aaaaaaaaaaa" };
        yield return new object[] { new string('a', 512), new string('a', 512) };
        yield return new object[] { new string('a', 2048), "a" };
        yield return new object[] { new string('a', 2048), "aaaaaaaaaaa" };
        yield return new object[] { new string('a', 2048), new string('a', 2048) };
    }

    [GlobalSetup]
    public void Setup() => Thread.CurrentThread.CurrentCulture = CultureInfo.GetCultureInfo("fr-FR"); // isAsciiEqualityOrdinal=false

    [Benchmark]
    [ArgumentsSource(nameof(StartsWithArguments))]
    public bool StartsWith(string text, string prefix) => text.StartsWith(prefix);
}
Method Toolchain text prefix Mean Ratio
StartsWith /after/corerun a a 2.930 ns 1.04
StartsWith /before/corerun a a 2.807 ns 1.00
StartsWith /after/corerun a aaaaaaaaaaa 325.073 ns 0.05
StartsWith /before/corerun a aaaaaaaaaaa 6,422.332 ns 1.00
StartsWith /after/corerun aaaaaaaaaaa a 362.019 ns 0.06
StartsWith /before/corerun aaaaaaaaaaa a 5,589.542 ns 1.00
StartsWith /after/corerun aaaaa(...)aaaaa [512] a 510.155 ns 0.08
StartsWith /before/corerun aaaaa(...)aaaaa [512] a 6,032.578 ns 1.00
StartsWith /after/corerun aaaaa(...)aaaaa [512] aaaaaaaaaaa 664.491 ns 0.09
StartsWith /before/corerun aaaaa(...)aaaaa [512] aaaaaaaaaaa 7,143.208 ns 1.00
StartsWith /after/corerun aaaaa(...)aaaaa [512] aaaaa(...)aaaaa [512] 2.934 ns 1.03
StartsWith /before/corerun aaaaa(...)aaaaa [512] aaaaa(...)aaaaa [512] 2.844 ns 1.00
StartsWith /after/corerun aaaaa(...)aaaaa [2048] a 551.200 ns 0.08
StartsWith /before/corerun aaaaa(...)aaaaa [2048] a 6,754.139 ns 1.00
StartsWith /after/corerun aaaaa(...)aaaaa [2048] aaaaaaaaaaa 760.843 ns 0.11
StartsWith /before/corerun aaaaa(...)aaaaa [2048] aaaaaaaaaaa 7,071.727 ns 1.00
StartsWith /after/corerun aaaaa(...)aaaaa [2048] aaaaa(...)aaaaa [2048] 2.787 ns 0.96
StartsWith /before/corerun aaaaa(...)aaaaa [2048] aaaaa(...)aaaaa [2048] 2.876 ns 1.00

@GrabYourPitchforks
Copy link
Member

GrabYourPitchforks commented Sep 10, 2019

I don't think the ordinal optimization is valid when performing a culture-aware comparison. Digraphs and denormalized forms of accented characters are two places where this could show up.

Examples:

// should print "False"
Console.WriteLine(CultureInfo.GetCultureInfo("hu-HU").CompareInfo.IsPrefix("dz", "d"));

// should print "False"
Console.WriteLine(CultureInfo.InvariantCulture.CompareInfo.IsPrefix("o\u0308", "o"));

@adamsitnik
Copy link
Member Author

@GrabYourPitchforks thanks for pointing this out! I've created dotnet/corefx#41016 to make sure we don't break it

@adamsitnik
Copy link
Member Author

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 3 pipeline(s), but failed to run 1 pipeline(s).

@adamsitnik adamsitnik changed the title [Draft] StartsWith Linux perf improvements StartsWith Linux perf improvements Sep 13, 2019
@adamsitnik adamsitnik marked this pull request as ready for review September 13, 2019 14:43
@tarekgh
Copy link
Member

tarekgh commented Sep 13, 2019

Thanks @adamsitnik for working on that,

Can we hold a little on this change. It need to be reviewed carefully as I am not sure if there is specific cases need to be checked manually for accuracy. ligatures is one example of that. Also, we are not distinguishing between the error cases and the end of the string cases. I am not sure how this handled here, that is why need to look carefully to the change.

Side point, we are adding more complex code which will make touching this code or related code be more challenging. We still have the option try to optimize ICU itself if we are really seeing this is very critical.

I am not really pushing back on the change, I am just trying to avoid going to issues that maybe not clear to us now.

@tarekgh
Copy link
Member

tarekgh commented Sep 13, 2019

CC @eerhardt

int32_t idx = USEARCH_DONE;
result = ucol_next(pIterator, pErrorCode);
}
while (result == UCOL_IGNORABLE); // we don't check errorCode because on error the result is set to UCOL_NULLORDER
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while (result == UCOL_IGNORABLE); // we don't check errorCode because on error the result is set to UCOL_NULLORDER [](start = 4, length = 114)

not checking the error code can cause a weird behavior if it happen. this method will return UCOL_NULLORDER which can make SimpleStartsWith_Iterators return true and that is wrong. may be you just OR all error codes and then check it one time at the end?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very good point, I am going to add a check for that to the calling method

}
}

int32_t SimpleStartsWith(const UCollator* pCollator, UErrorCode* pErrorCode, const UChar* pPattern, int32_t patternLength, const UChar* pText, int32_t textLength)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SimpleStartsWith [](start = 8, length = 16)

Did you test this logic with the Surrogate characters (well and malformed characters)? I am not expecting any problem but just want to ensure we are not missing any case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you test this logic with the Surrogate characters (well and malformed characters)?

I was relying on the existing tests and the edge cases that I've found and added in dotnet/corefx#41016 and dotnet/corefx#41017

I am going to send one more PR with the Surrogate characters and malformed ones.

@tarekgh
Copy link
Member

tarekgh commented Sep 16, 2019

I don't think the ordinal optimization is valid when performing a culture-aware comparison. Digraphs and denormalized forms of accented characters are two places where this could show up.

The optimization here is not ordinal. It is just enumerating over the strings as text items which I believe should be OK.

The only thing is, do we think it is worth it adding this complexity to the code for this perf gain?

@adamsitnik are you planning to remove IsFaseSort too?

@adamsitnik
Copy link
Member Author

The only thing is, do we think it is worth it adding this complexity to the code for this perf gain?

I agree that the complexity does not come for free, but it really improves the perf for all non-US cultures when we are searching for a prefix in a big string that does not start with the given prefix.

Example:

    public class Perf_StartsWith
    {
        public IEnumerable<object[]> StartsWithArguments()
        {
            yield return new object[] { new string('a', 2048), "i" };
        }

        [GlobalSetup]
        public void Setup() => Thread.CurrentThread.CurrentCulture = CultureInfo.GetCultureInfo("fr-FR"); // isAsciiEqualityOrdinal=false

        [Benchmark]
        [ArgumentsSource(nameof(StartsWithArguments))]
        public bool StartsWith(string text, string prefix) => text.StartsWith(prefix);
    }

Before: 90,562.510 ns
After: 474.823 ns

are you planning to remove IsFaseSort too?

Yes. I also do plan to apply similar optimizations to EndsWith once I am done with StartsWith.

@adamsitnik
Copy link
Member Author

@tarekgh thanks for a great code review!

@tarekgh
Copy link
Member

tarekgh commented Sep 17, 2019

CC @ahsonkhan as he mentioned he had a test case which we can use in the validation too.

@tarekgh
Copy link
Member

tarekgh commented Sep 18, 2019

@adamsitnik
Copy link
Member Author

@tarekgh I've added surrogate and malformed Unicode test cases in dotnet/corefx#41227 and all tests are passing.

Could you PTAL?

@tarekgh
Copy link
Member

tarekgh commented Sep 20, 2019

Personally, I am not a big fan of asserts, I prefer to add a lot of unit tests. If we ever set the strength to UCOL_PRIMARY all the tests are going to fail so I don't think that we need it.

I am afraid in the future if anyone added any code break this logic will not be easy to catch. test cases may not be enough as we'll not know what case we are not covering can break.

@adamsitnik
Copy link
Member Author

I am afraid in the future if anyone added any code break this logic will not be easy to catch. test cases may not be enough as we'll not know what case we are not covering can break.

I think that we have really good test coverage now. Also, the Asserts are executed only for Checked builds? I always build everything in Release when I work on perf so they are never executed for me.

If you really want me to add an assert I can add it for the method that creates the Collators to make sure that we never end up with one that has strength=UCOL_PRIMARY

@tarekgh
Copy link
Member

tarekgh commented Sep 20, 2019

If you really want me to add an assert I can add it for the method that creates the Collators to make sure that we never end up with one that has strength=UCOL_PRIMARY

I'll leave it to you. what I am trying to say, if we are going with the simple path we shouldn't having UCOL_PRIMARY strength. This is more asserting non UCOL_PRIMARY strength with simple path. It is not really about how we create the collators. Thanks @adamsitnik for going through all of this.

@eerhardt
Copy link
Member

Just my 2 cents on asserts.

Sometimes the value of the assert is for new people coming into the code and reading it for the first time. It is a nice way to document "this should always hold true". If at a future time it isn't true, the new person knows that the original author never anticipated this situation.

@ahsonkhan
Copy link
Member

CC @ahsonkhan as he mentioned he had a test case which we can use in the validation too.

I'll dig up the scenario and share.

@adamsitnik
Copy link
Member Author

adamsitnik commented Sep 23, 2019

@tarekgh I've added proper handling for things like "o\u0000\u0308".StartsWith("o") (great catch BTW!). Everything works as expected, including the test case sent offline by Ahson.

I've also added the asserts.

@tarekgh
Copy link
Member

tarekgh commented Sep 23, 2019

@adamsitnik thanks for all tests you are doing. this increase our confidence!

@adamsitnik
Copy link
Member Author

All the tests are green, I am going to sync with master to make sure that they are still green after merging #26759

@adamsitnik
Copy link
Member Author

@tarekgh all the tests are passing, do you think that the PR is ready to merge? I would like to apply similar optimization to EndsWith in a separate PR.

@tarekgh tarekgh merged commit 8e4050a into dotnet:master Sep 30, 2019
@tarekgh
Copy link
Member

tarekgh commented Sep 30, 2019

Thanks @adamsitnik

@adamsitnik
Copy link
Member Author

@tarekgh thanks for all the great reviews, test cases and patience to my Unicode learning process!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants