-
Notifications
You must be signed in to change notification settings - Fork 170
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Trying to intern a string multiple times may lead to duplicates #5196
Conversation
src/realm/sync/changeset_encoder.cpp
Outdated
@@ -229,7 +229,9 @@ InternString ChangesetEncoder::intern_string(StringData str) | |||
size_t index = m_intern_strings_rev.size(); | |||
// FIXME: Assert might be able to be removed after refactoring of changeset_parser types? | |||
REALM_ASSERT_RELEASE_EX(index <= std::numeric_limits<uint32_t>::max(), index); | |||
it = m_intern_strings_rev.insert({std::string{str}, uint32_t(index)}).first; | |||
bool inserted; | |||
std::tie(it, inserted) = m_intern_strings_rev.insert({str, uint32_t(index)}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is dangerous. When using std::string, we take a copy of the string as we cannot be sure about the lifetime of the string str
points to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jbreams flagged the same to me. I will revert the change and use a std::string_view above in m_intern_strings_rev.find()
test/test_changeset_encoding.cpp
Outdated
// although it was just created before. | ||
encoder.intern_string("Program"); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure how this test demonstrates the problem. I would prefer that the test was failing before the fix is added.
CHANGELOG.md
Outdated
@@ -10,6 +10,7 @@ | |||
* Sending a QUERY message may fail with `Assertion failed: !m_unbind_message_sent` ([#5149](https://github.com/realm/realm-core/pull/5149), since v11.8.0) | |||
* Subscription names correctly distinguish an empty string from a nullptr ([#5160](https://github.com/realm/realm-core/pull/5160), since v11.8.0) | |||
* Converting floats/doubles into Decimal128 would yield imprecise results ([#5184](https://github.com/realm/realm-core/pull/5184), since v6.1.0) | |||
* Trying to intern a string multiple times may lead to duplicates ([#5196](https://github.com/realm/realm-core/pull/5196), since v10.2.0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For release notes, we aim to express the changes in end-user terms. Hopefully the SDK's should just be able to copy/paste these notes into their release notes that goes out to users. How would an enduser experience this bug?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point, thanks for noticing it. i'll rewrite it.
if (it == m_intern_strings_rev.end()) { | ||
size_t index = m_intern_strings_rev.size(); | ||
// FIXME: Assert might be able to be removed after refactoring of changeset_parser types? | ||
REALM_ASSERT_RELEASE_EX(index <= std::numeric_limits<uint32_t>::max(), index); | ||
it = m_intern_strings_rev.insert({std::string{str}, uint32_t(index)}).first; | ||
bool inserted; | ||
std::tie(it, inserted) = m_intern_strings_rev.insert({std::string{str}, uint32_t(index)}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In other places we just use auto [it, inserted]
for structured bindings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
structured bindings won't work if one of the bindings has already been defined as a variable. std::tie()
works around that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of course.
@@ -166,7 +166,7 @@ class StringData { | |||
explicit operator bool() const noexcept; | |||
explicit operator std::string_view() const noexcept | |||
{ | |||
return std::string_view(m_data); | |||
return std::string_view(m_data, m_size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💯
if (it == m_intern_strings_rev.end()) { | ||
size_t index = m_intern_strings_rev.size(); | ||
// FIXME: Assert might be able to be removed after refactoring of changeset_parser types? | ||
REALM_ASSERT_RELEASE_EX(index <= std::numeric_limits<uint32_t>::max(), index); | ||
it = m_intern_strings_rev.insert({std::string{str}, uint32_t(index)}).first; | ||
bool inserted; | ||
std::tie(it, inserted) = m_intern_strings_rev.insert({std::string{str}, uint32_t(index)}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
structured bindings won't work if one of the bindings has already been defined as a variable. std::tie()
works around that.
test/test_changeset_encoding.cpp
Outdated
encoder.intern_string("Program"); | ||
// Bug #5193 caused "Program" not to be found through the interned strings | ||
// although it was just created before. | ||
encoder.intern_string("Program"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the original failure here was that when you decoded the changeset produced by this it would fail with a BadChangeset error. Can we round-trip this and check that the changeset encodes and then successfully decodes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added validation on changeset decoding/parsing. Also improved the test.
093ec7a
to
36b58e2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few more comments. I had a few more ideas/suggestions, but I don't think they're blocking merging if the tests pass.
I think we should go further in ripping out StringData, but we don't need to do it here/now.
src/realm/sync/changeset_parser.cpp
Outdated
@@ -290,9 +293,18 @@ void ChangesetParser::State::parse_one() | |||
if (t == InstrTypeInternString) { | |||
uint32_t index = read_int<uint32_t>(); | |||
StringData str = read_string(); | |||
auto it1 = m_intern_strings.find(static_cast<std::string_view>(str)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is def a nit, but you can clean this iterator use up a little bit by doing:
if (auto it = m_intern_strings.find(static_cast<std::string_view>(str)); it != m_intern_strings.end()) {
parser_error(...);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My reason was to avoid long/multi-line if statements. I agree it looks cleaner though. I'll make the change.
@@ -30,6 +30,9 @@ struct ChangesetParser::State { | |||
|
|||
StringBuffer m_buffer; | |||
util::metered::set<uint32_t> m_valid_interned_strings; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't want this PR to go on for forever, but at this point it's probably cheaper to just make change these two util::metered::set's into a std::vector<std::pair<std::string, uint32_t>>
and use std::find()
when parsing InternString
instructions. The number of InternString
's per changeset is going to be bounded by the schema of the realm and is likely to be on the order of like 1000 in a large changeset. If we keep the list sorted by the intern string index, we could even use std::lower_bound
to get the same run time as the std::set
when parsing the non-InternString
instructions but with less memory used.
Something like:
std::vector<std::pair<std::string, uint32_t>> m_intern_strings;
...
uint32_t index = read_int<uint32_t>();
std::string_view str = static_cast<std::string_view>(read_string());
auto it = std::find_if(m_intern_strings.begin(), m_intern_strings.end(), [&](const auto& val) {
return (val.first == static_cast<std::string_view>(str) || val.second == index);
});
if (it != m_intern_strings.end()) {
if (it->first == str) {
parser_error("Unexpected intern string");
}
...
}
if (REALM_UNLIKELY(!m_intern_strings.empty() && m_intern_strings.back().second > index)) {
it = std::upper_bound(m_intern_strings.begin(), m_intern_strings.end(), index, [](uint32_t index, const auto& val) {
return val.second < index;
});
}
m_intern_strings.emplace(it, {std::string{str}, index});
Anyways, I know this PR has been going on for forever, so we don't need to do this here/now - just something that came to mind while I was reading this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure how having the vector sorted by the intern index will help. std::find_if
and emplace
have linear complexity (emplace could be constant if the intern strings are already sorted by their index when read). I agree it would use less memory though. But memory should not be a problem given the low number of intern strings.
We don't really need to have them sorted in any way so I would normally use an unordered_set
but it does not have support for heterogenous lookups until C++20.
ClientServerFixture fixture(dir, test_context, std::move(config)); | ||
fixture.start(); | ||
|
||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to do this step where we wait for download complete? I think this should be a no-op.
CHECK_EQUAL(sync::ProtocolError::bad_changeset, ec); | ||
CHECK(is_fatal); | ||
fixture.stop(); | ||
did_fail = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd move did_fail = true
to above fixture.stop()
to make sure we don't race to read did_fail on line 5782
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems some tests were inconsistent so I modified them as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few more comments/suggestions, but I think this should be good to merge after addressing them.
I think we should go further in ripping out StringData, but we don't need to block this bug fix to do it.
What, How & Why?
Internal strings were stored as std::string and looked-up as StringData. As it seems, the less than operator of realm::StringData behaves differently than the one of std::string (especially when it comes to UTF-8 strings). Storing the internal strings in a sorted collection led to wrongfully not finding existing intern strings.
This change makes sure we use std::string's to both look-up and insert intern strings.
☑️ ToDos