Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trying to intern a string multiple times may lead to duplicates #5196

Merged
merged 8 commits into from
Feb 4, 2022

Conversation

danieltabacaru
Copy link
Collaborator

@danieltabacaru danieltabacaru commented Jan 27, 2022

What, How & Why?

Internal strings were stored as std::string and looked-up as StringData. As it seems, the less than operator of realm::StringData behaves differently than the one of std::string (especially when it comes to UTF-8 strings). Storing the internal strings in a sorted collection led to wrongfully not finding existing intern strings.

This change makes sure we use std::string's to both look-up and insert intern strings.

☑️ ToDos

  • 📝 Changelog update
  • 🚦 Tests (or not relevant)

@cla-bot cla-bot bot added the cla: yes label Jan 27, 2022
@@ -229,7 +229,9 @@ InternString ChangesetEncoder::intern_string(StringData str)
size_t index = m_intern_strings_rev.size();
// FIXME: Assert might be able to be removed after refactoring of changeset_parser types?
REALM_ASSERT_RELEASE_EX(index <= std::numeric_limits<uint32_t>::max(), index);
it = m_intern_strings_rev.insert({std::string{str}, uint32_t(index)}).first;
bool inserted;
std::tie(it, inserted) = m_intern_strings_rev.insert({str, uint32_t(index)});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is dangerous. When using std::string, we take a copy of the string as we cannot be sure about the lifetime of the string str points to.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbreams flagged the same to me. I will revert the change and use a std::string_view above in m_intern_strings_rev.find()

// although it was just created before.
encoder.intern_string("Program");
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure how this test demonstrates the problem. I would prefer that the test was failing before the fix is added.

CHANGELOG.md Outdated
@@ -10,6 +10,7 @@
* Sending a QUERY message may fail with `Assertion failed: !m_unbind_message_sent` ([#5149](https://github.com/realm/realm-core/pull/5149), since v11.8.0)
* Subscription names correctly distinguish an empty string from a nullptr ([#5160](https://github.com/realm/realm-core/pull/5160), since v11.8.0)
* Converting floats/doubles into Decimal128 would yield imprecise results ([#5184](https://github.com/realm/realm-core/pull/5184), since v6.1.0)
* Trying to intern a string multiple times may lead to duplicates ([#5196](https://github.com/realm/realm-core/pull/5196), since v10.2.0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For release notes, we aim to express the changes in end-user terms. Hopefully the SDK's should just be able to copy/paste these notes into their release notes that goes out to users. How would an enduser experience this bug?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, thanks for noticing it. i'll rewrite it.

if (it == m_intern_strings_rev.end()) {
size_t index = m_intern_strings_rev.size();
// FIXME: Assert might be able to be removed after refactoring of changeset_parser types?
REALM_ASSERT_RELEASE_EX(index <= std::numeric_limits<uint32_t>::max(), index);
it = m_intern_strings_rev.insert({std::string{str}, uint32_t(index)}).first;
bool inserted;
std::tie(it, inserted) = m_intern_strings_rev.insert({std::string{str}, uint32_t(index)});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In other places we just use auto [it, inserted] for structured bindings.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

structured bindings won't work if one of the bindings has already been defined as a variable. std::tie() works around that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course.

@@ -166,7 +166,7 @@ class StringData {
explicit operator bool() const noexcept;
explicit operator std::string_view() const noexcept
{
return std::string_view(m_data);
return std::string_view(m_data, m_size);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯

if (it == m_intern_strings_rev.end()) {
size_t index = m_intern_strings_rev.size();
// FIXME: Assert might be able to be removed after refactoring of changeset_parser types?
REALM_ASSERT_RELEASE_EX(index <= std::numeric_limits<uint32_t>::max(), index);
it = m_intern_strings_rev.insert({std::string{str}, uint32_t(index)}).first;
bool inserted;
std::tie(it, inserted) = m_intern_strings_rev.insert({std::string{str}, uint32_t(index)});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

structured bindings won't work if one of the bindings has already been defined as a variable. std::tie() works around that.

encoder.intern_string("Program");
// Bug #5193 caused "Program" not to be found through the interned strings
// although it was just created before.
encoder.intern_string("Program");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the original failure here was that when you decoded the changeset produced by this it would fail with a BadChangeset error. Can we round-trip this and check that the changeset encodes and then successfully decodes?

Copy link
Collaborator Author

@danieltabacaru danieltabacaru Jan 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added validation on changeset decoding/parsing. Also improved the test.

@danieltabacaru danieltabacaru force-pushed the dt/intern_strings_may_be_duplicated branch from 093ec7a to 36b58e2 Compare February 2, 2022 13:03
Copy link
Contributor

@jbreams jbreams left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few more comments. I had a few more ideas/suggestions, but I don't think they're blocking merging if the tests pass.

I think we should go further in ripping out StringData, but we don't need to do it here/now.

@@ -290,9 +293,18 @@ void ChangesetParser::State::parse_one()
if (t == InstrTypeInternString) {
uint32_t index = read_int<uint32_t>();
StringData str = read_string();
auto it1 = m_intern_strings.find(static_cast<std::string_view>(str));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is def a nit, but you can clean this iterator use up a little bit by doing:

if (auto it = m_intern_strings.find(static_cast<std::string_view>(str)); it != m_intern_strings.end()) {
   parser_error(...);
}

Copy link
Collaborator Author

@danieltabacaru danieltabacaru Feb 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My reason was to avoid long/multi-line if statements. I agree it looks cleaner though. I'll make the change.

@@ -30,6 +30,9 @@ struct ChangesetParser::State {

StringBuffer m_buffer;
util::metered::set<uint32_t> m_valid_interned_strings;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want this PR to go on for forever, but at this point it's probably cheaper to just make change these two util::metered::set's into a std::vector<std::pair<std::string, uint32_t>> and use std::find() when parsing InternString instructions. The number of InternString's per changeset is going to be bounded by the schema of the realm and is likely to be on the order of like 1000 in a large changeset. If we keep the list sorted by the intern string index, we could even use std::lower_bound to get the same run time as the std::set when parsing the non-InternString instructions but with less memory used.

Something like:

std::vector<std::pair<std::string, uint32_t>> m_intern_strings;
...
    uint32_t index = read_int<uint32_t>();
    std::string_view str = static_cast<std::string_view>(read_string());
    auto it = std::find_if(m_intern_strings.begin(), m_intern_strings.end(), [&](const auto& val) {
        return (val.first == static_cast<std::string_view>(str) || val.second == index);
    });
    if (it != m_intern_strings.end()) {
        if (it->first == str) {
            parser_error("Unexpected intern string");
        }
        ...
    }
    if (REALM_UNLIKELY(!m_intern_strings.empty() && m_intern_strings.back().second > index)) {
        it = std::upper_bound(m_intern_strings.begin(), m_intern_strings.end(), index, [](uint32_t index, const auto& val) {
            return val.second < index;
        });
    }
    m_intern_strings.emplace(it, {std::string{str}, index});

Anyways, I know this PR has been going on for forever, so we don't need to do this here/now - just something that came to mind while I was reading this.

Copy link
Collaborator Author

@danieltabacaru danieltabacaru Feb 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure how having the vector sorted by the intern index will help. std::find_if and emplace have linear complexity (emplace could be constant if the intern strings are already sorted by their index when read). I agree it would use less memory though. But memory should not be a problem given the low number of intern strings.

We don't really need to have them sorted in any way so I would normally use an unordered_set but it does not have support for heterogenous lookups until C++20.

ClientServerFixture fixture(dir, test_context, std::move(config));
fixture.start();

{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to do this step where we wait for download complete? I think this should be a no-op.

CHECK_EQUAL(sync::ProtocolError::bad_changeset, ec);
CHECK(is_fatal);
fixture.stop();
did_fail = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd move did_fail = true to above fixture.stop() to make sure we don't race to read did_fail on line 5782

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems some tests were inconsistent so I modified them as well.

Copy link
Contributor

@jbreams jbreams left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few more comments/suggestions, but I think this should be good to merge after addressing them.

I think we should go further in ripping out StringData, but we don't need to block this bug fix to do it.

@danieltabacaru danieltabacaru merged commit 012ef28 into master Feb 4, 2022
@danieltabacaru danieltabacaru deleted the dt/intern_strings_may_be_duplicated branch February 4, 2022 07:49
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Mar 22, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants