Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve utf8_to_utf16 speed for common path #892

Merged
merged 3 commits into from
Oct 26, 2018
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 19 additions & 4 deletions Release/src/utilities/asyncrt_utils.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -352,14 +352,22 @@ inline size_t count_utf8_to_utf16(const std::string& s)
const size_t sSize = s.size();
const char* const sData = s.data();
size_t result{ sSize };

for (size_t index = 0; index < sSize;)
{
const char c{ sData[index++] };
if ((c & BIT8) == 0)
if( sData[index] > 0 )
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This transformation doesn't work on platforms where char is unsigned -- e.g., most non-Windows platforms. Clearly this is missing some test coverage if that was missed...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, I missed that. I fixed the code, still thinking about testability. This would probably require a separate test binary with different char defaults.

Copy link
Member

@BillyONeal BillyONeal Oct 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the tests might be avoiding this with the __GLIBCXX__ bits? (GCC defaults to char being unsigned)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

__GLIBCXX__ seems only to be used as a means to determine whether the test code should compare byte-wise (using unsigned char) or via the results of conversion derived from codecvt_utf8_utf16. Where I'm not sure about the reason for this, especially since codecvt_utf8_utf16 is deprecated in C++17 (->https://en.cppreference.com/w/cpp/locale/codecvt_utf8_utf16).

With regard to the test, I was considering whether there is an elegant way to make use of the effects of -funsigned-char and -fsigned-char with GCC (-> http://gcc.gnu.org/onlinedocs/gcc-4.8.0/gcc/C-Dialect-Options.html) or /J Option in VS (-> https://msdn.microsoft.com/en-us/library/0d294k5z.aspx). Where elegant would mean that we at least avoid building the tests with their dependecies twice (once signed once unsigned).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmmm I stand corrected on non-Windows platforms. Maybe it was just an Android thing. https://twitter.com/ubsanitizer/status/1052343628201259008

A static_assert(std::is_signed<char>::value, "Untested for unsigned char") is probably sufficient then.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, but this would prevent the change to work on unsigned char platforms without changes. Or is this just a marker to ensure unsigned char checks are done before next release?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thought: We could reinterpret_cast the pointer retrieved from s.data() to signed char (or unsigned whatever fits better to the project) and avoid the platform dependency. From what I see, this should also be fine with regard to strict aliasing rules since we're only working on the buffer via our pointers and are not using any std::string operations.
What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That should be OK (because signed char/unsigned char are character types)

{
continue;
// use fast inner loop to skip single byte code points (which are
// expected to be the most frequent)
while ((++index < sSize) && (sData[index] > 0))
;

if (index >= sSize) break;
}

// start special handling for multi-byte code points
const char c{ sData[index++] };

if ((c & BIT7) == 0)
{
throw std::range_error("UTF-8 string character can never start with 10xxxxxx");
Expand Down Expand Up @@ -477,7 +485,14 @@ utf16string __cdecl conversions::utf8_to_utf16(const std::string &s)
}
break;
default: // single byte character, 0x0 to 0x7F
destData[destIndex++] = static_cast<utf16string::value_type>(src);
// try to use a fast inner loop for following single by characters,
// since they quite probable
do
{
destData[destIndex++] = static_cast<utf16string::value_type>(srcData[index++]);
} while (index < srcSize && srcData[index] > 0);
// adjust index since it will be incremented by the for loop
--index;
}
}
return dest;
Expand Down
25 changes: 25 additions & 0 deletions Release/tests/functional/utils/strings.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -205,6 +205,31 @@ TEST(utf8_to_utf16)
#else
VERIFY_ARE_EQUAL(conversion.from_bytes(input), result);
#endif


// 1 byte character followed by 4 byte character
input.clear();
input.push_back( 51u); // 00110011
// U+10000
input.push_back(244u); // 11110100
input.push_back(128u); // 10000000
input.push_back(128u); // 10000000
input.push_back(128u); // 10000000
// U+10FFFF
input.push_back(244u); // 11110100
input.push_back(143u); // 10001111
input.push_back(191u); // 10111111
input.push_back(191u); // 10111111
result = utility::conversions::utf8_to_utf16(input);
#if defined(__GLIBCXX__)
VERIFY_ARE_EQUAL(51, result[0]);
VERIFY_ARE_EQUAL(56256, result[1]);
VERIFY_ARE_EQUAL(56320, result[2]);
VERIFY_ARE_EQUAL(56319, result[3]);
VERIFY_ARE_EQUAL(57343, result[4]);
#else
VERIFY_ARE_EQUAL(conversion.from_bytes(input), result);
#endif
}

TEST(utf16_to_utf8_errors)
Expand Down