-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Conversation
I don't think the ordinal optimization is valid when performing a culture-aware comparison. Digraphs and denormalized forms of accented characters are two places where this could show up. Examples: // should print "False"
Console.WriteLine(CultureInfo.GetCultureInfo("hu-HU").CompareInfo.IsPrefix("dz", "d"));
// should print "False"
Console.WriteLine(CultureInfo.InvariantCulture.CompareInfo.IsPrefix("o\u0308", "o")); |
@GrabYourPitchforks thanks for pointing this out! I've created dotnet/corefx#41016 to make sure we don't break it |
/azp run |
Azure Pipelines successfully started running 3 pipeline(s), but failed to run 1 pipeline(s). |
Thanks @adamsitnik for working on that, Can we hold a little on this change. It need to be reviewed carefully as I am not sure if there is specific cases need to be checked manually for accuracy. ligatures is one example of that. Also, we are not distinguishing between the error cases and the end of the string cases. I am not sure how this handled here, that is why need to look carefully to the change. Side point, we are adding more complex code which will make touching this code or related code be more challenging. We still have the option try to optimize ICU itself if we are really seeing this is very critical. I am not really pushing back on the change, I am just trying to avoid going to issues that maybe not clear to us now. |
CC @eerhardt |
int32_t idx = USEARCH_DONE; | ||
result = ucol_next(pIterator, pErrorCode); | ||
} | ||
while (result == UCOL_IGNORABLE); // we don't check errorCode because on error the result is set to UCOL_NULLORDER |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
while (result == UCOL_IGNORABLE); // we don't check errorCode because on error the result is set to UCOL_NULLORDER [](start = 4, length = 114)
not checking the error code can cause a weird behavior if it happen. this method will return UCOL_NULLORDER which can make SimpleStartsWith_Iterators return true and that is wrong. may be you just OR all error codes and then check it one time at the end?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very good point, I am going to add a check for that to the calling method
} | ||
} | ||
|
||
int32_t SimpleStartsWith(const UCollator* pCollator, UErrorCode* pErrorCode, const UChar* pPattern, int32_t patternLength, const UChar* pText, int32_t textLength) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SimpleStartsWith [](start = 8, length = 16)
Did you test this logic with the Surrogate characters (well and malformed characters)? I am not expecting any problem but just want to ensure we are not missing any case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you test this logic with the Surrogate characters (well and malformed characters)?
I was relying on the existing tests and the edge cases that I've found and added in dotnet/corefx#41016 and dotnet/corefx#41017
I am going to send one more PR with the Surrogate characters and malformed ones.
The optimization here is not ordinal. It is just enumerating over the strings as text items which I believe should be OK. The only thing is, do we think it is worth it adding this complexity to the code for this perf gain? @adamsitnik are you planning to remove IsFaseSort too? |
I agree that the complexity does not come for free, but it really improves the perf for all non-US cultures when we are searching for a prefix in a big string that does not start with the given prefix. Example: public class Perf_StartsWith
{
public IEnumerable<object[]> StartsWithArguments()
{
yield return new object[] { new string('a', 2048), "i" };
}
[GlobalSetup]
public void Setup() => Thread.CurrentThread.CurrentCulture = CultureInfo.GetCultureInfo("fr-FR"); // isAsciiEqualityOrdinal=false
[Benchmark]
[ArgumentsSource(nameof(StartsWithArguments))]
public bool StartsWith(string text, string prefix) => text.StartsWith(prefix);
} Before: 90,562.510 ns
Yes. I also do plan to apply similar optimizations to |
@tarekgh thanks for a great code review! |
CC @ahsonkhan as he mentioned he had a test case which we can use in the validation too. |
@tarekgh I've added surrogate and malformed Unicode test cases in dotnet/corefx#41227 and all tests are passing. Could you PTAL? |
I am afraid in the future if anyone added any code break this logic will not be easy to catch. test cases may not be enough as we'll not know what case we are not covering can break. |
I think that we have really good test coverage now. Also, the Asserts are executed only for Checked builds? I always build everything in Release when I work on perf so they are never executed for me. If you really want me to add an assert I can add it for the method that creates the Collators to make sure that we never end up with one that has strength=UCOL_PRIMARY |
I'll leave it to you. what I am trying to say, if we are going with the simple path we shouldn't having UCOL_PRIMARY strength. This is more asserting non UCOL_PRIMARY strength with simple path. It is not really about how we create the collators. Thanks @adamsitnik for going through all of this. |
Just my 2 cents on asserts. Sometimes the value of the assert is for new people coming into the code and reading it for the first time. It is a nice way to document "this should always hold true". If at a future time it isn't true, the new person knows that the original author never anticipated this situation. |
I'll dig up the scenario and share. |
@tarekgh I've added proper handling for things like I've also added the asserts. |
@adamsitnik thanks for all tests you are doing. this increase our confidence! |
All the tests are green, I am going to sync with master to make sure that they are still green after merging #26759 |
@tarekgh all the tests are passing, do you think that the PR is ready to merge? I would like to apply similar optimization to EndsWith in a separate PR. |
Thanks @adamsitnik |
@tarekgh thanks for all the great reviews, test cases and patience to my Unicode learning process! |
This PR improves the performance of Culture-aware
string.StartsWith
on Linux.Unfortunately, ICU does not expose a method that allows for optimal
StartsWith
using Collator API (there is aStartsWith
that allows for comparingUnicodeStrings
but without a possibility to specify culture).I read some docs, articles and studied the code of ICU itself and came up with this proposal.
The longer the source text and the more unlucky case (miss) the bigger the gains.
It should help a lot with https://github.com/dotnet/corefx/issues/40674