-
Notifications
You must be signed in to change notification settings - Fork 109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix ambiguity between null strings and zero-length strings #264
Conversation
I implemented option no. 2:
However you will need to take a thorough look, because I basically only looked at the points where tests failed. There may be some forgotten points. Unfortunately, when we only store |
Codecov Report
@@ Coverage Diff @@
## master #264 +/- ##
==========================================
+ Coverage 96.35% 96.38% +0.03%
==========================================
Files 22 22
Lines 7991 8060 +69
==========================================
+ Hits 7700 7769 +69
Misses 291 291
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
@biojppm Can you look into this? |
I've finally had some occasion to look at this. I've refined the tests and the code for it to pass those tests. While doing this I've found a parser bug that was also fixed in this PR. I pushed those changes to your branch. Basically an important point is that for any given string, (whether null or otherwise), the result of There is a thorny ambiguity with So for now I've left As for the location issue, I have yet to look at it. It is possible that we have to reconsider the whole approach, and adding a flag as you suggest is something that could help. |
@Gei0r sorry for the long delay in looking at this. I hope to be able to do it in the coming week or so. |
@Gei0r I've rebased on master (which has breaking changes). Let me know if I can push to the PR (ie to your remote branch). |
Re biojppm#263 To differentiate between "null" zero-length scalars, these are now stored in different ways: - null values are stored with str == nullptr. Unfortunately we lose the location information for these. - zero-length strings are stored with (str != nullptr), either their location in the source buffer or the arena. If the arena is empty (nullptr) and a zero-length scalar is stored, some space is reserved for this string to have a non-nullptr str for this scalar. (However, the reserved space will not actually be used up by the scalar.)
@Gei0r I went ahead and pushed to the PR. The handling of locations was improved such that when the original node is null, in most cases we get a location "close enough" to the original node. Only in some pathological cases does this yield to a null val. It's not perfect but it's an improvement. |
Yeah, likewise sorry for not being more engaged with this issue. |
The locations still need some more work; there is still a case where it goes null and didn't need to. But right now I'm really bothered by what must be a compiler error in gcc's Release builds. Given this version of #define ______(id) \
printf("here --- %s --- a.len==%zu a.str=%p(%zu) (a.str==nullptr)==%d (a.str!=nullptr)==%d\n", \
#id, a.len, a.str, (intptr_t)a.str, \
(a.str == nullptr), (a.str != nullptr))
csubstr Tree::to_arena(csubstr a)
{
______(0);
substr rem(m_arena.sub(m_arena_pos));
size_t num = to_chars(rem, a);
______(1);
if(num > rem.len)
{
______(2);
rem = _grow_arena(num);
num = to_chars(rem, a);
RYML_ASSERT(num <= rem.len);
}
else if(num == 0u)
{
______(3);
if(a.str == nullptr) // ?????? a null string must enter this branch!
{
______(3.1);
return csubstr{};
}
else if(m_arena.str == nullptr)
{
______(3.2);
// Arena is empty and we want to store a non-null
// zero-length string.
// Even though the string has zero length, we need
// some "memory" to store a non-nullptr string
rem = _grow_arena(1);
}
}
______(4);
rem = _request_span(num);
return rem;
} ... I'm getting wrong behavior in gcc Release builds (Debug builds are ok). If I pass an empty TEST(empty_scalar, gcc_error)
{
Tree tree;
csubstr nullstr = {};
ASSERT_EQ(nullstr.str, nullptr);
ASSERT_EQ(nullstr.len, 0);
std::cout << "\nserializing with empty arena...\n";
csubstr result = tree.to_arena(nullstr);
EXPECT_EQ(result.str, nullptr); // fails!
EXPECT_EQ(result.len, 0);
std::cout << "\nserializing with nonempty arena...\n";
result = tree.to_arena(nullstr);
EXPECT_EQ(result.str, nullptr); // fails!
EXPECT_EQ(result.len, 0);
} This is the output I'm getting:
So despite the fact that the 3.1 branch condition The problem goes away in Debug builds, and everywhere with clang and msvc. So I think this is a GCC optimizer error. |
I reshuffled the branches, and it now works: modified src/c4/yml/tree.hpp
@@ -997,15 +997,19 @@ public:
* @see alloc_arena() */
csubstr to_arena(csubstr a)
{
- substr rem(m_arena.sub(m_arena_pos));
- size_t num = to_chars(rem, a);
- if(num > rem.len)
+ if(a.len > 0)
{
- rem = _grow_arena(num);
- num = to_chars(rem, a);
- RYML_ASSERT(num <= rem.len);
+ substr rem(m_arena.sub(m_arena_pos));
+ size_t num = to_chars(rem, a);
+ if(num > rem.len)
+ {
+ rem = _grow_arena(num);
+ num = to_chars(rem, a);
+ RYML_ASSERT(num <= rem.len);
+ }
+ return _request_span(num);
}
- else if(num == 0u)
+ else
{
if(a.str == nullptr) // ?????? should enter this branch!
{
@@ -1017,11 +1021,10 @@ public:
// zero-length string.
// Even though the string has zero length, we need
// some "memory" to store a non-nullptr string
- rem = _grow_arena(1);
+ _grow_arena(1);
}
+ return _request_span(0);
}
- rem = _request_span(num);
- return rem;
} |
Right, that's one out of the way. This one sucked. |
Three issues remain to be addressed:
I hope to do these in the coming days. Regarding the I am inclined to make the resulting
|
This is a little over my head, but I suspect this is because in in Reminds me of this blog post by Raymond Chen. |
Here's a short example of what I mean: https://godbolt.org/z/KaYxa1qKG As you can see, with |
What's important to me is that an empty (e.g. default-constructed) Semantically, I'd prefer to make a csubstr from such a The thing with the dangling pointer if |
test/test_empty_scalar.cpp
Outdated
|
||
// See also: | ||
// https://github.com/biojppm/rapidyaml/issues/263 | ||
// https://github.com/biojppm/rapidyaml/pulls/264 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
URL is wrong (pull
instead of pulls
)
Kudos and thanks, that is exactly the problem. Eg, looking at cppreference, it does say that:
... which is exactly the case that I had: count was zero, so I assumed it was ok to still do the call to avoid the branch, while not aware I was incurring UB by doing so. I will need to check the library for this problem, as there are quite a few calls to |
I'm curious. Why is it important, what is the use-case warranting that distinction?
Wanting a default-constructed Consider this: TEST(empty_scalar, std_string)
{
std::string stdstr;
csubstr stdss = to_csubstr(stdstr);
csubstr nullss;
ASSERT_EQ(stdss, nullptr);
ASSERT_EQ(stdss.str, nullptr);
ASSERT_EQ(stdss.len, 0u);
ASSERT_EQ(nullss, nullptr);
ASSERT_EQ(nullss.str, nullptr);
ASSERT_EQ(nullss.len, 0u);
Tree tree = parse_in_arena("{ser: {}, eq: {}}");
tree["ser"]["stdstr"] << stdss;
tree["ser"]["nullss"] << nullss;
tree["eq"]["stdstr"] = stdss;
tree["eq"]["nullss"] = nullss;
EXPECT_EQ(emitrs_yaml<std::string>(tree),
"ser:\n"
" stdstr: \n"
" nullss: \n"
"eq:\n"
" stdstr: \n"
" nullss: \n"
);
} This is the current situation. Everything is null, including a default-constructed EXPECT_EQ(emitrs_yaml<std::string>(tree),
"ser:\n"
" stdstr: ''\n"
" nullss: \n"
"eq:\n"
" stdstr: ''\n"
" nullss: \n"
); |
I think what you called "my way" is the correct way.
But the yaml standard says:
So these asserts are wrong imo: ASSERT_EQ(stdss, nullptr);
ASSERT_EQ(stdss.str, nullptr); My use case: The yaml I emit is read by a different program, which checks the datatype. Some fields must be strings. However, sometimes these strings are empty. The way I see it, currently there is no way to emit an empty string using rapidyaml? |
Understood. Let me think about it. I do agree with the rust note, but I want to give some thought to the several issues. But whatever happens, whatever defaults end up on rapidyaml, you should be aware that you need not be constrained by those choices. Eg, even if rapidyaml would have the incoming For example, you could do something like this in your code (assuming you're using <<): node << nonnull(str); where struct nonnull { std::string const& subject; };
NodeRef& operator<< (NodeRef node, nonnull nn)
{
if(nn.subject.empty())
node << csubstr(""); // use a non-empty string
else
node << nn.subject;
return node;
} Of course, then you would need to remember to ensure use of template<class T>
C4_ALWAYS_INLINE void myserialize(NodeRef &n, T const& var) { n << var; }
C4_ALWAYS_INLINE void myserialize(NodeRef &n, std::string const& var) { ... }
// then
myserialize(n, str);
myserialize(n, intvar); Or you could choose to not use any of Or you could even ensure that I'm highlighting this because it is a conscious design decision for rapidyaml that although it may provide the basic facilities, it is not a framework forcing you to use its approach. It is not a pact with the devil; you are not selling your soul if you choose to use it. The point of rapidyaml is that it should be very easy (and fast) to do what you need, but if it isn't, you still have the freedom to do what's best for you. And if there is something constraining you for which there is no recourse, I would consider it a design bug of rapidyaml. |
I checked and apparently since c++11 it is no longer undefined behavior to call
So I will change |
fixes #263