-
Notifications
You must be signed in to change notification settings - Fork 168
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Explicit string size - Core library #67
Conversation
…d in an explicit way (using StringData) rather than relying on zero termination.
Conflicts: test/experiments/testcase.cpp
This is a big commit, so I have not yet reviewed all of it, but from what I have looked at it looks good. One minor thing I noticed is that most functions that used to take const char* as argument now takes StringData. Would it not be more correct with const StringData&? Otherwise that main issue is if we are willing to break the format right now. Since we do have some using it in production, it could cause problems. |
Why do you think it would be better to use 'const StringData&' than just In my opinion 'StringData' is the natural form. In general, one should only First of all, much of C++'s performance lead comes from the fact that There are two common cases where it is a good idea to switch to the 'const For a typical implementation of an STL string, a copy is relatively In summary, I don't think we should use 'const StringData&' in general, but On Fri, Mar 22, 2013 at 10:20 AM, Kristian Spangsege [email protected] wrote:
|
Conflicts: src/tightdb/column_string.hpp src/tightdb/query_engine.hpp src/tightdb/utf8.cpp test/testtable.cpp
const size_t newsize = m_len - gapsize + size; | ||
size_t remove_size = end - begin; | ||
size_t add_size = size; | ||
if (add_zero_term) ++add_size; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
format: "if"-condition on one line, "action"-statement on next line.
StringIndex(size_t ref, ArrayParent* parent, size_t pndx, void* target_column, StringGetter get_func, Allocator& alloc); | ||
StringIndex(void* target_column, StringGetter get_func, Allocator&); | ||
StringIndex(Array::ColumnDef, Allocator&); | ||
StringIndex(size_t ref, ArrayParent*, size_t pndx, void* target_column, StringGetter get_func, Allocator&); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why remove a few of the parameter names? (and kepp the rest)? I suggest to keep the parameter names as they provide info ad to what should be passed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason I remove "parent" from "ArrayParent* parent" is that "parent" adds no new information. It is a feature of C++ that the names can be skipped. Also, in C++ types are supposed to play a much more central role than in C.
+1 - pheww - that was a long one.... |
size_t j = 0; | ||
for (;;) { | ||
if (TIGHTDB_LIKELY(data[j] != value[j])) break; | ||
++j; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This inner loop is not as tight as it could be - it disassembles into:
size_t j = 0;
000007F63D374498 xor ecx,ecx
for (;;) {
if (TIGHTDB_LIKELY(data[j] != value[j])) break;
000007F63D37449A cmp byte ptr [rdx],dil
The zero-out of ecx is unneccesary. It's a pity because this is the innermost loop if a frequently called function, and there is so little work taking part on each non-match that the zero out could be relatively significant
Explicit string size - Core library
For the last couple of weeks my agenda has been to "modernize" the way TightDB works with strings.
Up until now our API has assumed that strings are null-terminated, and therefore that a valid string never contains a null byte.
This is the classical picture of strings in C.
I have changed this such that string sizes are always specified and stored explicitly. This follows the modern trend of abandoning the notion of null-termination, and, crucially, is in line with the design of C++ STL strings.
Compatibility with C style strings is achieved by always storing a string in the database with a terminating null character. This again follows the general trend, and in particular it follows the design of C++ STL strings.
These changes constitute a breaking file format change, breaking binary level API changes, and several breaking source level API changes.
There are three branches involved:
https://github.com/kspangsege/tightdb/tree/save_string_sizes
https://github.com/kspangsege/tightdb/tree/fix_explicit_string_size
https://github.com/kspangsege/tightdb/tree/convert-tool
The first one contains the following two things:
The second branch contains a small set of changes that updated the file format to be able to handle explicit string sizes. These changes are kept in a separate branch because their application may need to be postponed.
The third branch provides a tool that will convert a TightDB database file from any file format version to the newest one.
These are the pull-requests corresponding to the 2nd and 3rd branches:
#84
#76
Politically:
Most of the world has been trying to move away from null-terminated strings for a long time, and I would like not having TightDB contribute to the inertia.
Performance-wise:
It was not given from the outset that these changes would be beneficial towards performance. There are reasons to expect a shift in both directions. Luckily it seems that the shift is entirely in the direction of improved performance. Here are some measurements that are take from the recent GA case:
Both count the number of occurrences of "US" in a column with 100,000 entries that alternate between "US" and "DK".
Here is the string related part of the standard benchmark (
make benchmark
):Technical:
To a very large degree the API changes consist of a systematic replacement of 'const char*' with 'StringData'. StringData is a new class with a 'const char *' member as well as a length member. The name was chosen as a courtesy to 'BinaryData'.
As is the case for BinaryData, StringData assumes no kind of ownership over the referenced memory.
StringData can be constructed implicitly from 'const char*'. This immediately gives us source level backwards compatibility with respect to passing strings as argument. Binary API compatibility is lost though.
With respect to returning strings, no good solution has been found that also retains backwards compatibility. The current consistent solution is to require the following changes of an application that wants to keep working with null-terminated strings:
For convenience, and for consistency with STL strings, getting a C style string via the statically typed API looks like this:
The introduction of StringData also gives us a particularly convenient way of gaining full interoperability with C++ STL strings without the risk of problems due to ABI differences between STL implementations.
The trick consists of the following elements:
Note that we have already been using STL strings in various places, and we have had some concerns about potential problems due to ABI incompatibilities. A simple solution in all these cases will be to use StringData in place of std::string.
For completeness I should mention that my branch also adds full support for queries on binary data. The reason is that it came almost for free due to the symmetry between BinaryData and StringData.
/cc @astigsen @rrrlasse @bmunkholm