-
Notifications
You must be signed in to change notification settings - Fork 168
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix handling of 4-byte UTF8 values on Windows #5803
Changes from 8 commits
c47cbdf
284bdcd
479b026
5f9f51c
cee09ea
0058012
9f380e5
a7697c5
22c91fa
950ee90
e1195ba
fc85d52
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -274,79 +274,6 @@ bool utf8_compare(StringData string1, StringData string2) | |
return false; | ||
} | ||
|
||
// Here is a version for Windows that may be closer to what is ultimately needed. | ||
/* | ||
bool case_map(const char* begin, const char* end, StringBuffer& dest, bool upper) | ||
{ | ||
const int wide_buffer_size = 32; | ||
wchar_t wide_buffer[wide_buffer_size]; | ||
|
||
dest.resize(end-begin); | ||
size_t dest_offset = 0; | ||
|
||
for (;;) { | ||
int num_out; | ||
|
||
// Decode | ||
{ | ||
size_t num_in = end - begin; | ||
if (size_t(32) <= num_in) { | ||
num_out = MultiByteToWideChar(CP_UTF8, MB_ERR_INVALID_CHARS, begin, 32, wide_buffer, wide_buffer_size); | ||
if (num_out != 0) { | ||
begin += 32; | ||
goto convert; | ||
} | ||
if (GetLastError() != ERROR_INSUFFICIENT_BUFFER) return false; | ||
} | ||
if (num_in == 0) break; | ||
int n = num_in < size_t(8) ? int(num_in) : 8; | ||
num_out = MultiByteToWideChar(CP_UTF8, MB_ERR_INVALID_CHARS, begin, n, wide_buffer, wide_buffer_size); | ||
if (num_out != 0) { | ||
begin += n; | ||
goto convert; | ||
} | ||
return false; | ||
} | ||
|
||
convert: | ||
if (upper) { | ||
for (int i=0; i<num_out; ++i) { | ||
CharUpperW(wide_buffer + i); | ||
} | ||
} | ||
else { | ||
for (int i=0; i<num_out; ++i) { | ||
CharLowerW(wide_buffer + i); | ||
} | ||
} | ||
|
||
encode: | ||
{ | ||
size_t free = dest.size() - dest_offset; | ||
if (int_less_than(std::numeric_limits<int>::max(), free)) free = std::numeric_limits<int>::max(); | ||
int n = WideCharToMultiByte(CP_UTF8, WC_ERR_INVALID_CHARS, wide_buffer, num_out, | ||
dest.data() + dest_offset, int(free), 0, 0); | ||
if (i != 0) { | ||
dest_offset += n; | ||
continue; | ||
} | ||
if (GetLastError() != ERROR_INSUFFICIENT_BUFFER) return false; | ||
size_t dest_size = dest.size(); | ||
if (int_multiply_with_overflow_detect(dest_size, 2)) { | ||
if (dest_size == std::numeric_limits<size_t>::max()) return false; | ||
dest_size = std::numeric_limits<size_t>::max(); | ||
} | ||
dest.resize(dest_size); | ||
goto encode; | ||
} | ||
} | ||
|
||
dest.resize(dest_offset); | ||
return true; | ||
} | ||
*/ | ||
|
||
|
||
// Converts UTF-8 source into upper or lower case. This function | ||
// preserves the byte length of each UTF-8 character in following way: | ||
// If an output character differs in size, it is simply substituded by | ||
|
@@ -358,28 +285,41 @@ util::Optional<std::string> case_map(StringData source, bool upper) | |
result.resize(source.size()); | ||
|
||
#if defined(_WIN32) | ||
constexpr size_t tmp_buffer_size = 32; | ||
const char* begin = source.data(); | ||
const char* end = begin + source.size(); | ||
auto output = result.begin(); | ||
while (begin != end) { | ||
int n = static_cast<int>(sequence_length(*begin)); | ||
if (n == 0 || end - begin < n) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We still need something like this check don't we? What if there are random bytes at the end of the string which appear to be a multibyte character but are just garbage such that There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Changed the logic a bit so that trailing garbage should be handled. I did not have a chance to test it on a Windows machine today. I will do that tomorrow. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks, I'd be more convinced that this works if there was a test for trailing garbage as well 😄 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Perhaps we should not specify how trailing garbage would be treated and fall back to "garbage in -> garbage out". This is anyway how the non-windows version works. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sure, let's match the non-windows version. Can you add a test with trailing garbage to check the consistency though? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think I can make a test that has the same outcome on both Windows and Linux. On Linux we will just copy over anything that is not an ASCII letter, but on Windows it will fail if there is something that cannot be translated. That was what I meant with my comment: We should not specify how it behaves as is behaves differently. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ok, I will concede platform dependent behaviour here. However, I'd still request a (platform dependent) test of garbage characters to verify that we do not crash. |
||
return util::none; | ||
size_t n = end - begin; | ||
if (n > tmp_buffer_size) { | ||
// Break the input string into chunks - but don't break in the middle of a multibyte character | ||
const char* p = begin; | ||
const char* buffer_end = begin + tmp_buffer_size; | ||
while (p < buffer_end) { | ||
size_t len = sequence_length(*p); | ||
p += len; | ||
if (p > buffer_end) { | ||
p -= len; | ||
break; | ||
} | ||
} | ||
n = p - begin; | ||
} | ||
|
||
wchar_t tmp[2]; // FIXME: Why no room for UTF-16 surrogate | ||
wchar_t tmp[tmp_buffer_size]; | ||
|
||
int n2 = MultiByteToWideChar(CP_UTF8, MB_ERR_INVALID_CHARS, begin, n, tmp, 1); | ||
int n2 = MultiByteToWideChar(CP_UTF8, MB_ERR_INVALID_CHARS, begin, n, tmp, tmp_buffer_size); | ||
if (n2 == 0) | ||
return util::none; | ||
|
||
REALM_ASSERT(n2 == 1); | ||
tmp[n2] = 0; | ||
if (n2 < tmp_buffer_size) | ||
tmp[n2] = 0; | ||
|
||
// Note: If tmp[0] == 0, it is because the string contains a | ||
// null-chacarcter, which is perfectly fine. | ||
|
||
wchar_t mapped_tmp[2]; | ||
LCMapStringEx(LOCALE_NAME_INVARIANT, upper ? LCMAP_UPPERCASE : LCMAP_LOWERCASE, tmp, 1, mapped_tmp, 2, | ||
wchar_t mapped_tmp[tmp_buffer_size]; | ||
LCMapStringEx(LOCALE_NAME_INVARIANT, upper ? LCMAP_UPPERCASE : LCMAP_LOWERCASE, tmp, n2, mapped_tmp, tmp_buffer_size, | ||
nullptr, nullptr, 0); | ||
|
||
// FIXME: The intention is to use flag 'WC_ERR_INVALID_CHARS' | ||
|
@@ -388,7 +328,8 @@ util::Optional<std::string> case_map(StringData source, bool upper) | |
// the flag is specified, the function fails with error | ||
// ERROR_INVALID_FLAGS. | ||
DWORD flags = 0; | ||
int n3 = WideCharToMultiByte(CP_UTF8, flags, mapped_tmp, 1, &*output, static_cast<int>(end - begin), 0, 0); | ||
auto m = static_cast<int>(end - begin); | ||
int n3 = WideCharToMultiByte(CP_UTF8, flags, mapped_tmp, n2, &*output, m, 0, 0); | ||
if (n3 == 0 && GetLastError() != ERROR_INSUFFICIENT_BUFFER) | ||
return util::none; | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should have put my previous comment on naming here. Since these exceptions are thrown from the query engine now and not necessarily from the query parser, can we change this namespace to something more general such as
query
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's not change this in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok sure. In fact, I'd expect all these exceptions will changing soon with the upcoming error handling work.