Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explicit string size - Core library #67

Merged
merged 54 commits into from
Apr 29, 2013
Merged

Explicit string size - Core library #67

merged 54 commits into from
Apr 29, 2013

Conversation

kspangsege
Copy link
Contributor

For the last couple of weeks my agenda has been to "modernize" the way TightDB works with strings.

Up until now our API has assumed that strings are null-terminated, and therefore that a valid string never contains a null byte.

This is the classical picture of strings in C.

I have changed this such that string sizes are always specified and stored explicitly. This follows the modern trend of abandoning the notion of null-termination, and, crucially, is in line with the design of C++ STL strings.

Compatibility with C style strings is achieved by always storing a string in the database with a terminating null character. This again follows the general trend, and in particular it follows the design of C++ STL strings.

These changes constitute a breaking file format change, breaking binary level API changes, and several breaking source level API changes.

There are three branches involved:

https://github.com/kspangsege/tightdb/tree/save_string_sizes
https://github.com/kspangsege/tightdb/tree/fix_explicit_string_size
https://github.com/kspangsege/tightdb/tree/convert-tool

The first one contains the following two things:

  • A change of all API functions that either accept or return a string value.
  • A huge number of internal library changes to make TightDB work consistently with the new kind of string termination.

The second branch contains a small set of changes that updated the file format to be able to handle explicit string sizes. These changes are kept in a separate branch because their application may need to be postponed.

The third branch provides a tool that will convert a TightDB database file from any file format version to the newest one.

These are the pull-requests corresponding to the 2nd and 3rd branches:
#84
#76

Politically:

Most of the world has been trying to move away from null-terminated strings for a long time, and I would like not having TightDB contribute to the inertia.

Performance-wise:

It was not given from the outset that these changes would be beneficial towards performance. There are reasons to expect a shift in both directions. Luckily it seems that the shift is entirely in the direction of improved performance. Here are some measurements that are take from the recent GA case:

my_table.columns().foo.count("US") (running time for 10000 repetitions):
Master: 11216
Branch: 7581    (huge improvement here!)

my_table.where().country.equal("US").count() (running time for 1000 repetitions):
Master: 9295
Branch: 9239

Both count the number of occurrences of "US" in a column with 100,000 entries that alternate between "US" and "DK".

Here is the string related part of the standard benchmark (make benchmark):

Master:
Search (string): 78ms

Branch:
Search (string): 74ms

Technical:

To a very large degree the API changes consist of a systematic replacement of 'const char*' with 'StringData'. StringData is a new class with a 'const char *' member as well as a length member. The name was chosen as a courtesy to 'BinaryData'.

As is the case for BinaryData, StringData assumes no kind of ownership over the referenced memory.

StringData can be constructed implicitly from 'const char*'. This immediately gives us source level backwards compatibility with respect to passing strings as argument. Binary API compatibility is lost though.

With respect to returning strings, no good solution has been found that also retains backwards compatibility. The current consistent solution is to require the following changes of an application that wants to keep working with null-terminated strings:

Before                             After
---------------------------------------------------------------------------
group.get_table_name(...)          group.get_table_name(...).data()
table.get_column_name()            table.get_column_name().data()
table.get_string(...)              table.get_string(...).data()
table.get_mixed(...).get_string()  table.get_mixed(...).get_string().data()

For convenience, and for consistency with STL strings, getting a C style string via the statically typed API looks like this:

const char* my_string = my_table[7].foo.c_str();

The introduction of StringData also gives us a particularly convenient way of gaining full interoperability with C++ STL strings without the risk of problems due to ABI differences between STL implementations.

The trick consists of the following elements:

  • Offer implicit conversion between StringData and std::string (already done).
  • Avoid any usage of these conversion operations internally in the library (currently the case).
  • Avoid any other use of STL strings in the core library unless its use is completely and strictly hidden from the application compiler.

Note that we have already been using STL strings in various places, and we have had some concerns about potential problems due to ABI incompatibilities. A simple solution in all these cases will be to use StringData in place of std::string.

For completeness I should mention that my branch also adds full support for queries on binary data. The reason is that it came almost for free due to the symmetry between BinaryData and StringData.

/cc @astigsen @rrrlasse @bmunkholm

@ghost ghost assigned astigsen Mar 21, 2013
@astigsen
Copy link
Contributor

This is a big commit, so I have not yet reviewed all of it, but from what I have looked at it looks good. One minor thing I noticed is that most functions that used to take const char* as argument now takes StringData. Would it not be more correct with const StringData&?

Otherwise that main issue is if we are willing to break the format right now. Since we do have some using it in production, it could cause problems.

@kspangsege
Copy link
Contributor Author

Why do you think it would be better to use 'const StringData&' than just
'StringData'?

In my opinion 'StringData' is the natural form. In general, one should only
substitute that by 'const StringData&' when there is a specific reason to
do so.

First of all, much of C++'s performance lead comes from the fact that
arguments are generally passed by value rather than ny referenced. This
improves cache locality.

There are two common cases where it is a good idea to switch to the 'const
reference' form, one is when the passed object is big. The other is when a
copy is expensive.

For a typical implementation of an STL string, a copy is relatively
expensive because it manipulates a shared reference count using memory
fences. The mere fact that non-local memory is accessed makes the copy
operation relatively slow.

In summary, I don't think we should use 'const StringData&' in general, but
I reckon that there may be specific places where it would be good for
performance for involved reasons.

On Fri, Mar 22, 2013 at 10:20 AM, Kristian Spangsege [email protected] wrote:

Oops, please disregard my previous post. I misunderstood you, Alexander. I
overlooked your '&'. I'm sorry!

On Fri, Mar 22, 2013 at 10:17 AM, Kristian Spangsege [email protected]:

On Fri, Mar 22, 2013 at 7:54 AM, astigsen [email protected]:

Would it not be more correct with const StringData&?

No. StringData has a 'const char*' member.

Please note that the difference between 'StringData' and 'const
StringData' is the same as the difference between 'const char_' and 'const
char_ const'

In other words, the idea that 'const' on 'StringData' can affect the
constness of the referenced string is wrong. It is not possible.

That, of course, makes constness mandatory. In general, one should think
of StringData as representing an immutable string.

const size_t newsize = m_len - gapsize + size;
size_t remove_size = end - begin;
size_t add_size = size;
if (add_zero_term) ++add_size;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

format: "if"-condition on one line, "action"-statement on next line.

StringIndex(size_t ref, ArrayParent* parent, size_t pndx, void* target_column, StringGetter get_func, Allocator& alloc);
StringIndex(void* target_column, StringGetter get_func, Allocator&);
StringIndex(Array::ColumnDef, Allocator&);
StringIndex(size_t ref, ArrayParent*, size_t pndx, void* target_column, StringGetter get_func, Allocator&);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why remove a few of the parameter names? (and kepp the rest)? I suggest to keep the parameter names as they provide info ad to what should be passed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I remove "parent" from "ArrayParent* parent" is that "parent" adds no new information. It is a feature of C++ that the names can be skipped. Also, in C++ types are supposed to play a much more central role than in C.

@bmunkholm
Copy link
Contributor

+1 - pheww - that was a long one....
Only format issues (not following coding standard regarding one statement on each line).
More importantly, testcases needs to be expanded for full coverage of new or changed methods.

size_t j = 0;
for (;;) {
if (TIGHTDB_LIKELY(data[j] != value[j])) break;
++j;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This inner loop is not as tight as it could be - it disassembles into:

        size_t j = 0;
        000007F63D374498  xor         ecx,ecx  
        for (;;) {
                    if (TIGHTDB_LIKELY(data[j] != value[j])) break;
        000007F63D37449A  cmp         byte ptr [rdx],dil  

The zero-out of ecx is unneccesary. It's a pity because this is the innermost loop if a frequently called function, and there is so little work taking part on each non-match that the zero out could be relatively significant

bmunkholm pushed a commit that referenced this pull request Apr 29, 2013
Explicit string size - Core library
@bmunkholm bmunkholm merged commit 1706791 into realm:master Apr 29, 2013
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Mar 25, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants