Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix handling of 4-byte UTF8 values on Windows #5803

Merged
merged 12 commits into from
Sep 15, 2022

Conversation

kneth
Copy link
Contributor

@kneth kneth commented Aug 31, 2022

What, How & Why?

Fixes #5825

Some unused error handling is removed from ParentNode. Replaced by throwing.

☑️ ToDos

  • 📝 Changelog update
  • 🚦 Tests (or not relevant)
  • C-API, if public C++ API changed.

@cla-bot cla-bot bot added the cla: yes label Aug 31, 2022
@kneth kneth self-assigned this Aug 31, 2022
@jedelbo jedelbo force-pushed the kneth/bug/contains-with-emoiji branch from 38dabaa to 479b026 Compare September 6, 2022 10:00
@jedelbo jedelbo changed the title Try to reproduce https://github.com/realm/realm-js/issues/4844 Fix handling of 4-byte UTF8 values on Windows Sep 6, 2022
@jedelbo
Copy link
Contributor

jedelbo commented Sep 6, 2022

@astigsen perhaps you can have a look at the changes in unicode.cpp.

@jedelbo jedelbo requested review from ironage and astigsen September 6, 2022 11:59
@@ -1582,7 +1571,7 @@ class StringNode : public StringNodeBase {
auto upper = case_map(v, true);
auto lower = case_map(v, false);
if (!upper || !lower) {
error_code = "Malformed UTF-8: " + std::string(v);
throw query_parser::InvalidQueryError(util::format("Malformed UTF-8: %1", v));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are not "query parser" errors, they are general query errors. Can we use another name?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok - changed to std::runtime_error.

const char* begin = source.data();
const char* end = begin + source.size();
auto output = result.begin();
while (begin != end) {
int n = static_cast<int>(sequence_length(*begin));
if (n == 0 || end - begin < n)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still need something like this check don't we? What if there are random bytes at the end of the string which appear to be a multibyte character but are just garbage such that sequence_length() returns 4 but there is only one character left in the string?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the logic a bit so that trailing garbage should be handled. I did not have a chance to test it on a Windows machine today. I will do that tomorrow.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I'd be more convinced that this works if there was a test for trailing garbage as well 😄

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we should not specify how trailing garbage would be treated and fall back to "garbage in -> garbage out". This is anyway how the non-windows version works.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, let's match the non-windows version. Can you add a test with trailing garbage to check the consistency though?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I can make a test that has the same outcome on both Windows and Linux. On Linux we will just copy over anything that is not an ASCII letter, but on Windows it will fail if there is something that cannot be translated. That was what I meant with my comment: We should not specify how it behaves as is behaves differently.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I will concede platform dependent behaviour here. However, I'd still request a (platform dependent) test of garbage characters to verify that we do not crash.

@@ -176,6 +176,25 @@ class NoSubscriptionForWrite : public std::runtime_error {
NoSubscriptionForWrite(const std::string& msg);
};

namespace query_parser {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should have put my previous comment on naming here. Since these exceptions are thrown from the query engine now and not necessarily from the query parser, can we change this namespace to something more general such as query?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not change this in this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok sure. In fact, I'd expect all these exceptions will changing soon with the upcoming error handling work.

@jedelbo jedelbo force-pushed the kneth/bug/contains-with-emoiji branch from 7f3d7c3 to 91f94e9 Compare September 14, 2022 09:15
Copy link
Contributor

@ironage ironage left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, after addressing clang-format 👍

@jedelbo jedelbo force-pushed the kneth/bug/contains-with-emoiji branch from 91f94e9 to 22c91fa Compare September 15, 2022 09:53
@jedelbo jedelbo assigned jedelbo and unassigned kneth Sep 15, 2022
@jedelbo jedelbo merged commit 6b04398 into master Sep 15, 2022
@jedelbo jedelbo deleted the kneth/bug/contains-with-emoiji branch September 15, 2022 15:14
tgoyne added a commit that referenced this pull request Sep 16, 2022
…nification

* origin/master:
  Stop forcing enums to be 64 bits unnecessarily
  clean up documentation of internal fields in config structs
  SyncConfig should be default constructible
  Traversal functions use IteratorControl values rather than true/false which is more expressive (#5857)
  Fix handling of 4-byte UTF8 values on Windows (#5803)
  Encode links in a way the server can understand (#5835)
  expose `Group::remove_table` in the C API (#5860)
  Disable auto refresh for old realm during migration (#5856)
  Expose `list_find` in the c api (#5848)
  Do not allow asymmetric tables in pbs (#5853)
  Refactor link tracing code (#5796)
  Expose `Obj::get_parent_object` in the C API (#5851)
  Update app.hpp (#5854)
  Fix appending to list ignores existing query (#5850)
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Mar 21, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Realm query using CONTAINS with [c] case insensitive crashes Realm when an emoji is in search string
3 participants