Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trying to intern a string multiple times may lead to duplicates #5196

Merged
merged 8 commits into from
Feb 4, 2022
Merged
Show file tree
Hide file tree
Changes from 7 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@
* Subscription names correctly distinguish an empty string from a nullptr ([#5160](https://github.com/realm/realm-core/pull/5160), since v11.8.0)
* Converting floats/doubles into Decimal128 would yield imprecise results ([#5184](https://github.com/realm/realm-core/pull/5184), since v6.1.0)
* Fix some warnings when building with Xcode 13.3.
* Using accented characters in class and field names may end the session ([#5196](https://github.com/realm/realm-core/pull/5196), since v10.2.0)

### Breaking changes
* None.
Expand Down
2 changes: 1 addition & 1 deletion src/realm/string_data.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -166,7 +166,7 @@ class StringData {
explicit operator bool() const noexcept;
explicit operator std::string_view() const noexcept
{
return std::string_view(m_data);
return std::string_view(m_data, m_size);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯

}

/// If the StringData is NULL, the hash is 0. Otherwise, the function
Expand Down
8 changes: 5 additions & 3 deletions src/realm/sync/changeset_encoder.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -224,12 +224,14 @@ void ChangesetEncoder::operator()(const Instruction::SetErase& instr)

InternString ChangesetEncoder::intern_string(StringData str)
{
auto it = m_intern_strings_rev.find(str);
auto it = m_intern_strings_rev.find(static_cast<std::string_view>(str));
if (it == m_intern_strings_rev.end()) {
size_t index = m_intern_strings_rev.size();
// FIXME: Assert might be able to be removed after refactoring of changeset_parser types?
REALM_ASSERT_RELEASE_EX(index <= std::numeric_limits<uint32_t>::max(), index);
it = m_intern_strings_rev.insert({std::string{str}, uint32_t(index)}).first;
bool inserted;
std::tie(it, inserted) = m_intern_strings_rev.insert({std::string{str}, uint32_t(index)});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In other places we just use auto [it, inserted] for structured bindings.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

structured bindings won't work if one of the bindings has already been defined as a variable. std::tie() works around that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course.

REALM_ASSERT_RELEASE_EX(inserted, str);

StringBufferRange range = add_string_range(str);
set_intern_string(uint32_t(index), range);
Expand All @@ -248,7 +250,7 @@ void ChangesetEncoder::set_intern_string(uint32_t index, StringBufferRange range

StringBufferRange ChangesetEncoder::add_string_range(StringData data)
{
m_string_range = data;
m_string_range = static_cast<std::string_view>(data);
REALM_ASSERT(data.size() <= std::numeric_limits<uint32_t>::max());
return StringBufferRange{0, uint32_t(data.size())};
}
Expand Down
2 changes: 1 addition & 1 deletion src/realm/sync/changeset_encoder.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -70,7 +70,7 @@ struct ChangesetEncoder : InstructionHandler {

Buffer m_buffer;
util::metered::map<std::string, uint32_t> m_intern_strings_rev;
StringData m_string_range;
std::string_view m_string_range;
};

// Implementation
Expand Down
12 changes: 12 additions & 0 deletions src/realm/sync/changeset_parser.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,9 @@ struct ChangesetParser::State {

StringBuffer m_buffer;
util::metered::set<uint32_t> m_valid_interned_strings;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want this PR to go on for forever, but at this point it's probably cheaper to just make change these two util::metered::set's into a std::vector<std::pair<std::string, uint32_t>> and use std::find() when parsing InternString instructions. The number of InternString's per changeset is going to be bounded by the schema of the realm and is likely to be on the order of like 1000 in a large changeset. If we keep the list sorted by the intern string index, we could even use std::lower_bound to get the same run time as the std::set when parsing the non-InternString instructions but with less memory used.

Something like:

std::vector<std::pair<std::string, uint32_t>> m_intern_strings;
...
    uint32_t index = read_int<uint32_t>();
    std::string_view str = static_cast<std::string_view>(read_string());
    auto it = std::find_if(m_intern_strings.begin(), m_intern_strings.end(), [&](const auto& val) {
        return (val.first == static_cast<std::string_view>(str) || val.second == index);
    });
    if (it != m_intern_strings.end()) {
        if (it->first == str) {
            parser_error("Unexpected intern string");
        }
        ...
    }
    if (REALM_UNLIKELY(!m_intern_strings.empty() && m_intern_strings.back().second > index)) {
        it = std::upper_bound(m_intern_strings.begin(), m_intern_strings.end(), index, [](uint32_t index, const auto& val) {
            return val.second < index;
        });
    }
    m_intern_strings.emplace(it, {std::string{str}, index});

Anyways, I know this PR has been going on for forever, so we don't need to do this here/now - just something that came to mind while I was reading this.

Copy link
Collaborator Author

@danieltabacaru danieltabacaru Feb 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure how having the vector sorted by the intern index will help. std::find_if and emplace have linear complexity (emplace could be constant if the intern strings are already sorted by their index when read). I agree it would use less memory though. But memory should not be a problem given the low number of intern strings.

We don't really need to have them sorted in any way so I would normally use an unordered_set but it does not have support for heterogenous lookups until C++20.

// Cannot use StringData as key type since m_input_begin may start pointing
// to a new chunk of memory.
util::metered::set<std::string> m_intern_strings;


void parse_one(); // Throws
Expand Down Expand Up @@ -290,9 +293,18 @@ void ChangesetParser::State::parse_one()
if (t == InstrTypeInternString) {
uint32_t index = read_int<uint32_t>();
StringData str = read_string();
auto it1 = m_intern_strings.find(static_cast<std::string_view>(str));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is def a nit, but you can clean this iterator use up a little bit by doing:

if (auto it = m_intern_strings.find(static_cast<std::string_view>(str)); it != m_intern_strings.end()) {
   parser_error(...);
}

Copy link
Collaborator Author

@danieltabacaru danieltabacaru Feb 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My reason was to avoid long/multi-line if statements. I agree it looks cleaner though. I'll make the change.

if (it1 != m_intern_strings.end()) {
parser_error("Unexpected intern string");
}
auto it2 = m_valid_interned_strings.find(index);
if (it2 != m_valid_interned_strings.end()) {
parser_error("Unexpected intern index");
}
StringBufferRange range = m_handler.add_string_range(str);
m_handler.set_intern_string(index, range);
m_valid_interned_strings.emplace(index);
m_intern_strings.emplace(std::string{str});
return;
}

Expand Down
18 changes: 18 additions & 0 deletions test/test_changeset_encoding.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -259,3 +259,21 @@ TEST(ChangesetEncoding_Clear)
CHECK_EQUAL(changeset, parsed);
CHECK(**changeset.begin() == instr);
}

TEST(ChangesetEncoding_AccentWords)
{
sync::ChangesetEncoder encoder;

encoder.intern_string("Prógram");
encoder.intern_string("Program");
// Bug #5193 caused "Program" to not be found as an intern string
// although it was just created before.
encoder.intern_string("Program");
auto& buffer = encoder.buffer();

using realm::_impl::SimpleNoCopyInputStream;
SimpleNoCopyInputStream stream{buffer.data(), buffer.size()};
Changeset parsed;
// This will throw if a string is interned twice.
CHECK_NOTHROW(parse_changeset(stream, parsed));
}
50 changes: 50 additions & 0 deletions test/test_sync.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -5733,6 +5733,56 @@ TEST(Sync_BadChangeset)
}


TEST(Sync_GoodChangeset_AccentCharacterInFieldName)
{
TEST_DIR(dir);
TEST_CLIENT_DB(db);

bool did_fail = false;
{
ClientServerFixture::Config config;
config.disable_upload_compaction = true;
ClientServerFixture fixture(dir, test_context, std::move(config));
fixture.start();

{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to do this step where we wait for download complete? I think this should be a no-op.

Session session = fixture.make_bound_session(db);
session.wait_for_download_complete_or_client_stopped();
}

{
WriteTransaction wt(db);
TableRef table = wt.add_table("class_table");
table->add_column(type_Int, "prógram");
table->add_column(type_Int, "program");
auto obj = table->create_object();
obj.add_int("program", 42);
wt.commit();
}

auto listener = [&](ConnectionState state, const Session::ErrorInfo* error_info) {
if (state != ConnectionState::disconnected)
return;
REALM_ASSERT(error_info);
std::error_code ec = error_info->error_code;
bool is_fatal = error_info->is_fatal;
CHECK_EQUAL(sync::ProtocolError::bad_changeset, ec);
CHECK(is_fatal);
fixture.stop();
did_fail = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd move did_fail = true to above fixture.stop() to make sure we don't race to read did_fail on line 5782

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems some tests were inconsistent so I modified them as well.

};

Session session = fixture.make_session(db);
session.set_connection_state_change_listener(listener);
fixture.bind_session(session, "/test");

session.wait_for_upload_complete_or_client_stopped();
session.wait_for_download_complete_or_client_stopped();
}
CHECK_NOT(did_fail);
}


namespace issue2104 {

class IntegrationReporter : public _impl::ServerHistory::IntegrationReporter {
Expand Down