Explicit string size - Core library #67

kspangsege · 2013-02-26T17:57:32Z

For the last couple of weeks my agenda has been to "modernize" the way TightDB works with strings.

Up until now our API has assumed that strings are null-terminated, and therefore that a valid string never contains a null byte.

This is the classical picture of strings in C.

I have changed this such that string sizes are always specified and stored explicitly. This follows the modern trend of abandoning the notion of null-termination, and, crucially, is in line with the design of C++ STL strings.

Compatibility with C style strings is achieved by always storing a string in the database with a terminating null character. This again follows the general trend, and in particular it follows the design of C++ STL strings.

These changes constitute a breaking file format change, breaking binary level API changes, and several breaking source level API changes.

There are three branches involved:

https://github.com/kspangsege/tightdb/tree/save_string_sizes
https://github.com/kspangsege/tightdb/tree/fix_explicit_string_size
https://github.com/kspangsege/tightdb/tree/convert-tool

The first one contains the following two things:

A change of all API functions that either accept or return a string value.
A huge number of internal library changes to make TightDB work consistently with the new kind of string termination.

The second branch contains a small set of changes that updated the file format to be able to handle explicit string sizes. These changes are kept in a separate branch because their application may need to be postponed.

The third branch provides a tool that will convert a TightDB database file from any file format version to the newest one.

These are the pull-requests corresponding to the 2nd and 3rd branches:
#84
#76

Politically:

Most of the world has been trying to move away from null-terminated strings for a long time, and I would like not having TightDB contribute to the inertia.

Performance-wise:

It was not given from the outset that these changes would be beneficial towards performance. There are reasons to expect a shift in both directions. Luckily it seems that the shift is entirely in the direction of improved performance. Here are some measurements that are take from the recent GA case:

my_table.columns().foo.count("US") (running time for 10000 repetitions):
Master: 11216
Branch: 7581    (huge improvement here!)

my_table.where().country.equal("US").count() (running time for 1000 repetitions):
Master: 9295
Branch: 9239

Both count the number of occurrences of "US" in a column with 100,000 entries that alternate between "US" and "DK".

Here is the string related part of the standard benchmark (make benchmark):

Master:
Search (string): 78ms

Branch:
Search (string): 74ms

Technical:

To a very large degree the API changes consist of a systematic replacement of 'const char*' with 'StringData'. StringData is a new class with a 'const char *' member as well as a length member. The name was chosen as a courtesy to 'BinaryData'.

As is the case for BinaryData, StringData assumes no kind of ownership over the referenced memory.

StringData can be constructed implicitly from 'const char*'. This immediately gives us source level backwards compatibility with respect to passing strings as argument. Binary API compatibility is lost though.

With respect to returning strings, no good solution has been found that also retains backwards compatibility. The current consistent solution is to require the following changes of an application that wants to keep working with null-terminated strings:

Before                             After
---------------------------------------------------------------------------
group.get_table_name(...)          group.get_table_name(...).data()
table.get_column_name()            table.get_column_name().data()
table.get_string(...)              table.get_string(...).data()
table.get_mixed(...).get_string()  table.get_mixed(...).get_string().data()

For convenience, and for consistency with STL strings, getting a C style string via the statically typed API looks like this:

const char* my_string = my_table[7].foo.c_str();

The introduction of StringData also gives us a particularly convenient way of gaining full interoperability with C++ STL strings without the risk of problems due to ABI differences between STL implementations.

The trick consists of the following elements:

Offer implicit conversion between StringData and std::string (already done).
Avoid any usage of these conversion operations internally in the library (currently the case).
Avoid any other use of STL strings in the core library unless its use is completely and strictly hidden from the application compiler.

Note that we have already been using STL strings in various places, and we have had some concerns about potential problems due to ABI incompatibilities. A simple solution in all these cases will be to use StringData in place of std::string.

For completeness I should mention that my branch also adds full support for queries on binary data. The reason is that it came almost for free due to the symmetry between BinaryData and StringData.

/cc @astigsen @rrrlasse @bmunkholm

…d in an explicit way (using StringData) rather than relying on zero termination.

Conflicts: test/experiments/testcase.cpp

This reverts commit 4452403.

astigsen · 2013-03-22T06:54:26Z

This is a big commit, so I have not yet reviewed all of it, but from what I have looked at it looks good. One minor thing I noticed is that most functions that used to take const char* as argument now takes StringData. Would it not be more correct with const StringData&?

Otherwise that main issue is if we are willing to break the format right now. Since we do have some using it in production, it could cause problems.

kspangsege · 2013-03-22T09:36:54Z

Why do you think it would be better to use 'const StringData&' than just
'StringData'?

In my opinion 'StringData' is the natural form. In general, one should only
substitute that by 'const StringData&' when there is a specific reason to
do so.

First of all, much of C++'s performance lead comes from the fact that
arguments are generally passed by value rather than ny referenced. This
improves cache locality.

There are two common cases where it is a good idea to switch to the 'const
reference' form, one is when the passed object is big. The other is when a
copy is expensive.

For a typical implementation of an STL string, a copy is relatively
expensive because it manipulates a shared reference count using memory
fences. The mere fact that non-local memory is accessed makes the copy
operation relatively slow.

In summary, I don't think we should use 'const StringData&' in general, but
I reckon that there may be specific places where it would be good for
performance for involved reasons.

On Fri, Mar 22, 2013 at 10:20 AM, Kristian Spangsege [email protected] wrote:

Oops, please disregard my previous post. I misunderstood you, Alexander. I
overlooked your '&'. I'm sorry!

On Fri, Mar 22, 2013 at 10:17 AM, Kristian Spangsege [email protected]:

On Fri, Mar 22, 2013 at 7:54 AM, astigsen [email protected]:

Would it not be more correct with const StringData&?

No. StringData has a 'const char*' member.

Please note that the difference between 'StringData' and 'const
StringData' is the same as the difference between 'const char_' and 'const
char_ const'

In other words, the idea that 'const' on 'StringData' can affect the
constness of the referenced string is wrong. It is not possible.

That, of course, makes constness mandatory. In general, one should think
of StringData as representing an immutable string.

Conflicts: src/tightdb/column_string.hpp src/tightdb/query_engine.hpp src/tightdb/utf8.cpp test/testtable.cpp

bmunkholm · 2013-04-25T06:07:01Z

src/tightdb/array_blob.cpp

-    const size_t newsize = m_len - gapsize + size;
+    size_t remove_size = end - begin;
+    size_t add_size = size;
+    if (add_zero_term) ++add_size;


format: "if"-condition on one line, "action"-statement on next line.

bmunkholm · 2013-04-28T16:30:22Z

src/tightdb/index_string.hpp

-    StringIndex(size_t ref, ArrayParent* parent, size_t pndx, void* target_column, StringGetter get_func, Allocator& alloc);
+    StringIndex(void* target_column, StringGetter get_func, Allocator&);
+    StringIndex(Array::ColumnDef, Allocator&);
+    StringIndex(size_t ref, ArrayParent*, size_t pndx, void* target_column, StringGetter get_func, Allocator&);


Why remove a few of the parameter names? (and kepp the rest)? I suggest to keep the parameter names as they provide info ad to what should be passed.

The reason I remove "parent" from "ArrayParent* parent" is that "parent" adds no new information. It is a feature of C++ that the names can be skipped. Also, in C++ types are supposed to play a much more central role than in C.

bmunkholm · 2013-04-28T18:20:06Z

+1 - pheww - that was a long one....
Only format issues (not following coding standard regarding one statement on each line).
More importantly, testcases needs to be expanded for full coverage of new or changed methods.

…e it

rrrlasse · 2013-04-29T11:38:34Z

src/tightdb/array_string.cpp

+            size_t j = 0;
+            for (;;) {
+                if (TIGHTDB_LIKELY(data[j] != value[j])) break;
+                ++j;


This inner loop is not as tight as it could be - it disassembles into:

size_t j = 0; 000007F63D374498 xor ecx,ecx for (;;) { if (TIGHTDB_LIKELY(data[j] != value[j])) break; 000007F63D37449A cmp byte ptr [rdx],dil

The zero-out of ecx is unneccesary. It's a pity because this is the innermost loop if a frequently called function, and there is so little work taking part on each non-match that the zero out could be relatively significant

Explicit string size - Core library

kspangsege added 16 commits February 25, 2013 22:00

Cleanup session

79d3905

Another cleanup session

cc2a1e7

ArrayString::set() fixed

f171e9f

ArrayString now keeps track of string lengths

700a4d4

Merge branch 'master' into save_string_sizes

5ceced7

Merge branch 'master' into save_string_sizes

2b76909

Replication now supports floating point data

14a4fc7

Merge branch 'master' into save_string_sizes

4af6811

Merge branch 'master' into save_string_sizes

076293d

'StringRef' renamed to 'StringData' as a courtesy to 'BinaryData'

b9a545d

Mixed class is now able to handle strings with explicit length

7683410

Huge update of everything string related. String sizes are now manage…

6ad8c1c

…d in an explicit way (using StringData) rather than relying on zero termination.

Work in progress

625b7d2

Documentation added for Mixed class

a3b4b24

Merge branch 'master' into save_string_sizes

0046b5f

Conflicts: test/experiments/testcase.cpp

Merge branch 'master' into save_string_sizes

4123451

ghost assigned astigsen Mar 21, 2013

kspangsege added 2 commits March 21, 2013 19:26

Support: my_table[7].foo.c_str()

c1e06fd

More use of StringData

f6ee2af

kspangsege mentioned this pull request Mar 21, 2013

Explicit string size - Java realm/realm-java#76

Merged

kspangsege added 3 commits March 22, 2013 00:51

Various consistency fixes

4452403

Revert "Various consistency fixes"

25838dc

This reverts commit 4452403.

Various consistency fixes

7a3f8a8

kspangsege added 5 commits March 23, 2013 16:30

Proper documentation added for StringData class

68af6d8

Use fulle namespace qualification in table macros

3471616

Fixes for: Use fulle namespace qualification in table macros

5e528ab

Merge branch 'master' into save_string_sizes

c2d2c52

Conflicts: src/tightdb/column_string.hpp src/tightdb/query_engine.hpp src/tightdb/utf8.cpp test/testtable.cpp

Removing obsolete BinaryData API

387236d

Documentation updated

d176a47

bmunkholm reviewed Apr 25, 2013
View reviewed changes

kspangsege added 2 commits April 25, 2013 14:56

Break if-statements into two lines

611ec15

Unit tests added for StringData

779a724

bmunkholm reviewed Apr 28, 2013
View reviewed changes

Many more if statements broken up into two lines - I'm not usre I lik…

0c534ec

…e it

rrrlasse reviewed Apr 29, 2013
View reviewed changes

bmunkholm pushed a commit that referenced this pull request Apr 29, 2013

Merge pull request #67 from kspangsege/save_string_sizes

1706791

Explicit string size - Core library

bmunkholm merged commit 1706791 into realm:master Apr 29, 2013

gvravikumar mentioned this pull request Oct 30, 2023

Failure to open a Realm file on Android 14 #7037

Open

sync-by-unito bot mentioned this pull request Feb 16, 2024

App crashing when starting Realm: [realm-core-11.0.2] Assertion failed: r == 0 && "File::unlock() " ... #4777

Closed

github-actions bot locked as resolved and limited conversation to collaborators Mar 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Explicit string size - Core library #67

Explicit string size - Core library #67

kspangsege commented Feb 26, 2013

astigsen commented Mar 22, 2013

kspangsege commented Mar 22, 2013

bmunkholm Apr 25, 2013

bmunkholm Apr 28, 2013

kspangsege Apr 29, 2013

bmunkholm commented Apr 28, 2013

rrrlasse Apr 29, 2013

Explicit string size - Core library #67

Explicit string size - Core library #67

Conversation

kspangsege commented Feb 26, 2013

astigsen commented Mar 22, 2013

kspangsege commented Mar 22, 2013

bmunkholm Apr 25, 2013

Choose a reason for hiding this comment

bmunkholm Apr 28, 2013

Choose a reason for hiding this comment

kspangsege Apr 29, 2013

Choose a reason for hiding this comment

bmunkholm commented Apr 28, 2013

rrrlasse Apr 29, 2013

Choose a reason for hiding this comment