Skip to content
This repository has been archived by the owner on Oct 15, 2024. It is now read-only.

Binary metadata vs flag #4194

Closed
kodebach opened this issue Nov 25, 2021 · 13 comments
Closed

Binary metadata vs flag #4194

kodebach opened this issue Nov 25, 2021 · 13 comments

Comments

@kodebach
Copy link
Member

kodebach commented Nov 25, 2021

We already had this discussion once (*), but I think we need to have it again before we finish the Rust implementation or release 1.0

Current situation

Currently, a binary key is identified by the presence of the meta:/binary meta key. This is a very simply solution, but it has many problems IMO.

Problems

Side effects

Certain functions modify metadata as a side-effect. Apart from the possible performance and memory issues (which are discussed below), this also means that KEY_FLAG_RO_VALUE and KEY_FLAG_RO_META are tied together for binary keys.

Memory and performance impact

Checking if a metakey is present may not be significantly slower than checking for a flag (*), but everything else is very problematic.

Setting a metakey (that isn't present already) in the best case means creating a new Key and doing a ksAppendKey. Even without any benchmarks I can guarantee that this is significantly slower than the single x86 instruction (**) you need to set a bit flag.
In the worst case, we also need to create the metadata KeySet, before we can add the the new Key. This adds even more mallocs.

And then there is the memory impact. A new flag would use a single bit in theory and no memory at all in practice, because we still have space in the current flags field. The metakey on the other hand needs memory for both the Key itself and the KeySet containing it. In the best case, this means we use

// Key + KeySet + Key* in KeySet
sizeof (struct _Key) + sizeof (struct _KeySet) + sizeof (Key *) = 64 B + 64 B + 8B = 136 B

However, in the current setup a KeySet will always allocate at least 16 elements in it's array, so we actually use 256 B of memory to store 1 bit of information. In other words, we use 2048 times more memory than necessary.

This may not matter, if binary keys are rare. But if a user creates a simple config with 16 binary keys and no metadata, this will use 4096 B of memory just for the meta:/binary keys. To make this even more clear: We are only storing 16 bits (= 2 B) of information here. We are wasting 4094 B or 99.95 % (Note: actually we are wasting all 4096B because the flag doesn't need additional memory).
Of course the ratio will improve as soon as other metadata is used (then we "only" waste ~72 B per Key). But I would argue that actually storing binary data in the KDB is not that common, while using a binary Key internally as a generic wrapper for data is quite common. For example, we use it for every plugin's contract (via the KEY_FUNC shortcut). Such keys will almost never use any actual metadata. For all of these keys, we are wasting huge amounts of memory.

(*) I suspect the difference will still be significant. Checking the bit-flag should only be 1-2 instructions.
(**) RISC ISAs will need ~3 instructions (load, or, store)

Metadata vs Data

IMO meta:/binary is not actually metadata, but part of the data itself.

The distinction between metadata and data (in my mind) is that, if you strip away all metadata the data itself can still be interpreted and does not loose its meaning. Without the metadata some information may be lost, but the core information must remain.

As an analog take a JPEG. There is some raw pixel data in the file. Alone it doesn't make much sense, so we also store the width and height of the image (and various bits of compression data, but let's ignore that). The raw pixel data, together with width and height is enough to display the image. So this is the data.
Some JPEGs also have an EXIF section, which may contain the DPI value of the image. This can be useful, in certain contexts (e.g. when printing the image), but the core information (i.e. the image itself) remains the same, whether we know the DPI or not. Therefore, the DPI is metadata.

Now, imaging we have a binary key in Elektra and remove all metadata. Suddenly, it is no longer a binary key. If its value contains a zero byte, it will be interpreted totally differently and we loose core information. Therefore, whether or not a key is binary cannot be metadata.

Exposing implementation details

The fact that in Elektra Key can contain a string value or binary value is part of the public API. However, how a Key is marked as string/binary should be up to the implementation. It is an implementation detail, that internally we just use a union and a single pointer together with some other tag. In another language you might want to use another mechanism (e.g. inheritance/subtyping, sum types, etc.). The fact that all implementations are forced to use the meta:/binary metakey as a marker, exposes an implementation detail.

Broken API

You can do this:

char value[] = "some\0data";
Key * k = keyNew ("user:/foo", KEY_BINARY, KEY_SIZE, sizeof(value), KEY_VALUE, value, KEY_END);
keySetMeta (k, "meta:/binary", NULL);

This is simply wrong and broken. It shouldn't be possible to change the value of a key, by changing it's metadata. But here we remove meta:/binary and therefore the data part of the value will be ignored by anyone using k. Effectively the value is now just some.

Solution

The solution to all of this is incredibly simple. We don't use meta:/binary anymore and instead use a new KEY_FLAG_BINARY.

Notes

Many of the arguments above, can also be made for meta:/type. Especially, when the high-level API is considered. However, since we also store numbers as strings, the type of the key is actually metadata. The core information remains the same. Only in certain contexts (i.e. when converting the key value to a native type like int or double) does the interpretation change.


(*) I didn't find the original issue, if someone does, please link it

cc @markus2330 @lawli3t

@markus2330
Copy link
Contributor

markus2330 commented Nov 28, 2021

I agree the binary keys are a bit awkward as two different interfaces are mixed. Using KEY_FLAG_BINARY, however, does not solve the issue: if you have an interface to unset flags, you create the same problem.

Solutions would be:

1, disable the possibility to remove meta:/binary
2. remove from the API the automatic setting of meta:/binary, KEY_BINARY etc.; and write in the docu that API users are responsible to mark binary keys accordingly

These are the two extremes: 1. basically restricts users more so that they cannot do anything wrong, 2. basically allows this usage but makes it less awkward as the API does not automatically do anything

I don't agree that it is not metadata, imho it is perfectly metadata: it describes how the value should be interpreted., e.g.

value: 5, meta:/type=octet without meta:/binary we interpret the value as 5
value: 5, meta:/type=octet with meta:/binary we interpret the value as 53

@kodebach
Copy link
Member Author

Using KEY_FLAG_BINARY, however, does not solve the issue: if you have an interface to unset flags, you create the same problem.

Yes, the core issue is that you can remove the "binary marker" (whatever that is). Switching to a flag is how I would solve this problem, because:

  1. Disallowing manual modification of meta:/binary creates a weird inconsistency in the API. You can add/remove any other metakey, so a user might wonder why that is not the case for meta:/binary. The easiest way to prevent this confusion is to use a different mechanism for the "binary marker".

    Note: This solution would make it even clearer that meta:/binary is not like other metadata and therefore should be considered something different. This just reinforces my "meta:/binary is not metadata" argument.

  2. Flags are already something that isn't fully transparent to the user. For example, users have no direct control over KEY_FLAG_SYNC, and only limited control over KEY_FLAG_RO_NAME.

These are the two extremes

Yes, and both are equally bad. As stated above, 1 creates an inconsistent API (not all metadata is equal), and 2 almost removes the idea of string vs. binary keys entirely. The worst thing would be that mean that this very innocent piece of code would fail:

Key * k;
keySetBinary (k, "abc", 4);
char buf[4];
keyGetBinary (k, buf, 4); // fails, because k is not marked as binary

Removing the automatic handling of the "binary marker" would only work, if we remove the distinction between string and binary keys entirely.

I don't agree that it is not metadata, imho it is perfectly metadata: it describes how the value should be interpreted., e.g. [...]

Regarding your example: meta:/type and meta:/binary are not compatible. For anything other than meta:/type = any the type plugin expects an actual string key, or simply interprets the value as a string regardless of what the key is marked as.

IMO meta:/binary only becomes metadata, if the API has zero knowledge of string vs. binary. So the API for the key value would just be:

const void *keyValue (const Key *key);
ssize_t keyGetValueSize (const Key *key);

ssize_t keySetValue (Key *key, const void *newValue, size_t valueSize);

However, the moment we automatically set the marker OR check the marker for any reason, it is no longer truely metadata.


PS. You only addressed the "Metadata vs. data" and "Broken API" issues. I also added a section about the performance issues to the top post. I originally ignored performance, but I now think at least the memory impact can be a big problem. In fact the memory impact of meta:/binary is probably second only to the impact of storing key names escaped and unescaped. In any case, the memory wasted by meta:/binary is trivial to reclaim compared to the other memory problems we have.

@markus2330
Copy link
Contributor

I don't like the flag because it bakes a rarely-used feature even more strongly into Elektra. This is also why I did not respond to the memory concerns: the feature does not need any memory if not used.

So the API for the key value would just be:

I like it, maybe better we use char, as strings are the common way to be used as value:

const char *keyValue (const Key *key);
ssize_t keyGetValueSize (const Key *key);

ssize_t keySetValue (Key *key, const void *newValue, size_t valueSize);

Then we can have a feature where valueSize == 0 automatically determines the length by strlen(newValue).

BUT: this would be a huge change...

@kodebach
Copy link
Member Author

I don't like the flag because it bakes a rarely-used feature even more strongly into Elektra

I'd argue that

  1. Having a separate keySetBinary function already makes binary keys explicitly part of the API.
  2. Using a flag actually removes some of the strong ties between this API-side idea of "binary keys" and the actual implementation. (see "Side effects" and "Exposing implementation details" above)

However, in principle I agree that rarely used features should not affect the API (if possible).

the feature does not need any memory if not used.

But it is used; extensively and within a very core part of Elektra. Every plugin creates a keyset with binary keys for its contract. There are also other more internal uses, but this case is very much a public API. So we can't really call binary keys a "rarely-used feature".

I like it

You misunderstood. The code snippet, was NOT a proposal for a change. I just wanted to explain that this would be the only case, in which I'd consider meta:/binary to be actual metadata.

maybe better we use char, as strings are the common way to be used as value:

IMO const char * implies string data in C, while const void * clearly says "any data". Also const char * s = keyValue (k) works, even if keyValue returns a const void *. In fact that's why you use void *, because it can be implicitly cast to any pointer type.

Then we can have a feature where valueSize == 0 automatically determines the length by strlen(newValue).

We can still keep keySetString as an alias in this case:

ssize_t keySetString (Key * key, const char * newString)
{
    return keySetValue (key, newString, 0);
}

BUT: this would be a huge change...

Yes, it would be a massive time investment. I agree that the simple keyValue API would be better, if we were writing Elektra from scratch, but changing it now is not really feasible, unless we also get rid of 80% of plugins and bindings, to limit the work.

So in the end, I still think KEY_FLAG_BINARY is the best solution. It avoids all the problems I mentioned. The change is almost trivial (maybe there is some code that manually checks meta:/binary without keyIsBinary). And most of all, I can't think of any concrete problems with KEY_FLAG_BINARY, unless you can explain in more detail why and how "it bakes [the] feature even more strongly into Elektra".

@markus2330
Copy link
Contributor

I agree that the API in its current form would work better with a KEY_FLAG_BINARY. This is also because the API was designed with a KEY_FLAG_BINARY, metadata was introduced later.

"it bakes [the] feature even more strongly into Elektra".

Because it would need a special handling for binary keys at many places, in particular in storage plugins. Currently xmltool, dump and so on "simply work" as they write out value&metadata, so binary keys are supported without any additional code.

@kodebach
Copy link
Member Author

Because it would need a special handling for binary keys at many places, in particular in storage plugins

That is a pretty weak argument IMHO, considering that storage plugins are already very complex. Adding this little bit of extra code to the very few plugins that don't already need to handle binary keys separately wouldn't make much of a difference.

many places

Do you have more than one example for these "many places"?

I can't think of another use case, where binary vs. string makes a difference, but you don't need extra code. The only thing that comes to mind is keyCopy (d, s, KEY_CP_ALL), where copying the binary marker is already handled by copying the metadata. But that doesn't apply because we also have keyCopy (d, s, KEY_CP_VALUE), where we need to copy the marker separately anyway.

Essentially, the only issue would be if (in some form) A gives a key value (and size) and key metadata, but not the whole Key, to B and B must be know whether the key value is binary or not. This is such a niche case, that I don't think it does come up anywhere but a storage plugin.

Currently xmltool, dump and so on "simply work" as they write out value&metadata, so binary keys are supported without any additional code.

This is not true. dump already handles binary keys differently and in fact the code would work without change for KEY_FLAG_BINARY:

std::string type;
if (binary)
{
type = "binary";
}
else
{
type = "string";
valuesize -= 1;
}
os << "$key " << type << " " << namesize << " " << valuesize << std::endl;
if (namesize > 0)
{
os << &keyName (cur)[rootOffset];
}
os << std::endl;
if (binary)
{
os.write (static_cast<const char *> (keyValue (cur)), valuesize);
os << std::endl;
}
else
{
os << keyString (cur) << std::endl;
}

As to xmltool, if it doesn't handle binary values separately it would be a bug. XML is a text based format, we cannot simply write out binary data. In fact we can't even write out string data without escaping certain special characters. So if xmltool effectively does write (fd, keyValue (k), keyGetValueSize (k) - 1), it would be seriously broken.

The only two storage plugins I know of, where string vs. binary value could be ignored, are mmapstorage and quickdump. In mmapstorage flags are saved as well, so that's not a problem and quickdump does something similar to dump:

if (keyIsBinary (cur))
{
if (fputc ('b', file) == EOF)
{
fclose (file);
return ELEKTRA_PLUGIN_STATUS_ERROR;
}
kdb_unsigned_long_long_t valueSize = keyGetValueSize (cur);
char * value = NULL;
if (valueSize > 0)
{
value = elektraMalloc (valueSize);
if (keyGetBinary (cur, value, valueSize) == -1)
{
fclose (file);
elektraFree (value);
return ELEKTRA_PLUGIN_STATUS_ERROR;
}
}
if (!writeData (file, value, valueSize, parentKey))
{
fclose (file);
elektraFree (value);
return ELEKTRA_PLUGIN_STATUS_ERROR;
}
elektraFree (value);
}
else
{
if (fputc ('s', file) == EOF)
{
fclose (file);
return ELEKTRA_PLUGIN_STATUS_ERROR;
}
kdb_unsigned_long_long_t valueSize = keyGetValueSize (cur) - 1;
if (!writeData (file, keyString (cur), valueSize, parentKey))
{
fclose (file);
return ELEKTRA_PLUGIN_STATUS_ERROR;
}
}

There may be another example, but AFAIK all other formats are text-based and therefore already need to encode binary values somehow, if they support them at all. Even if there were such a plugin, I still think the argument "we can't change the core API, because a handful of plugins need to change" is pretty weak.

@markus2330
Copy link
Contributor

It demonstrates very well how different mindsets on these issues can be and how they reflect on code. It looks like you already implemented dumpv2 as-if binary would not be metadata. dump.cpp now has 589 lines, a few years ago it had only 300 lines.

In my mindset, binary differs in

  1. that it might contain null bytes
  2. that the value might be NULL
  3. the interpretation of values

With 3. being the most important point but we currently do not specify how the types are actually interpreted as the feature is not used (except of storing some function pointers).

Obviously, XML always needs escaping, for both string and binary values.

I cannot follow the argument "it is already complicated, so let us make it more complicated" and prefer the mindset "it is complicated, so let us make it more simple".

Imho to have binary as metadata made handling binary keys more uniform, especially in serialization/deserialization which is a bigger topic and not only storage plugins. Also UIs need less special treatment for binary keys. I agree it is not perfect, especially not in the API.

Is your point that the API cannot be fixed because binary is fundamentally not metadata?

@kodebach
Copy link
Member Author

dump.cpp now has 589 lines, a few years ago it had only 300 lines.

That doesn't mean much. This could also just be formatting changes (e.g. single line if without braces vs with braces). But yes, dump v2 is a bit more complex. Most of this is however, the code handling the fallback to v1. unserialiseVersion2 and unserialiseVersion1+decodeLine are about the same size.

that it might contain null bytes

IMO this is the main difference.

that the value might be NULL

A secondary restriction to make working with string values easier. But not really relevant in practice, since you can use the empty string "" as a null-value in most cases. Behind the scenes both are stored as a NULL pointer.

the interpretation of values

The interpretation of values isn't fully fixed for string values either. But yes, a string value can safely be treated as a string. A binary is opaque unless you have additional context.

I cannot follow the argument "it is already complicated, so let us make it more complicated" and prefer the mindset "it is complicated, so let us make it more simple".

In principle I agree, however, in the specific case of storage plugins it is not possible to "make it more simple". A lot of the complexity comes from parsing/writing the storage format and the subtle (or sometimes not so subtle) differences and conflicts between the storage format and Elektra's internal format.

Imho to have binary as metadata made handling binary keys more uniform, especially in serialization/deserialization which is a bigger topic and not only storage plugins. Also UIs need less special treatment for binary keys. I agree it is not perfect, especially not in the API.

I think this is a false conclusion. It seems to be more uniform, but it actually causes subtle issues and often doesn't make things easier. For example, a UI can't display non-print characters without extra code anyway, so it will always need to check for binary values. The fact that in one case, you get an automatic "visual" indicator for a binary key because you display meta:/binary doesn't help much.

Is your point that the API cannot be fixed ...

I actually don't have a problem with the API, but with the (leaky) implementation. keySetString, keySetBinary, etc. are fine, using meta:/binary for the distinction creates problems (as described above).

... because binary is fundamentally not metadata?

I'd say it is not metadata, IF we keep the current API. Fundamentally, the distinction between data and metadata depends on the context IMHO. For example, in the high-level API meta:/type is not really metadata either. But this creates less issues, and because we depend on the low-level API, can't be changed anyway.

Most of all, I don't think it is worth sacrificing large amounts of memory (and some performance) and accepting the "Side effects" and "Broken API" issues, as a trade of for "slightly more uniform and sometimes easier to handle".

@markus2330
Copy link
Contributor

markus2330 commented Nov 30, 2021

UIs always need to do escaping of various escape characters, the null character is only one of many.

Imho removing keyIsBinary, keyIsString, keySetString and keySetBinary would be the simplest solution. People simply add the metadata themselves as needed. Then there won't be any expectation that removing the metadata converts the key to non-binary.

The main argument for keeping meta:/binary is that currently no flags need to be persisted. With KEY_FLAG_BINARY we would have a flag that needs persistence, creating an inconsistency which is difficult to understand and document. Inconsistencies in the APIs are ugly but at least you have a clear place where to document it. Inconsistencies in concepts (like metadata and flags) are imho much more problematic.

If the binary data in /elektra/modules is a problem, we should fix it in the backend API: stop passing it by contract and instead only by ELEKTRA_PLUGIN_EXPORT. But first we should do a benchmark if normal applications are effected by this at all. And this is probably a 2.0 topic anyway.

@kodebach if you don't fundamentally disagree, I would close this issue and create a new issue to remove keyIsBinary, keyIsString, keySetString and keySetBinary and automatic (un)setting of meta:/binary.

@lawli3t What are your thoughts on that issue?

@kodebach
Copy link
Member Author

kodebach commented Dec 1, 2021

As far as I can tell, your main problem is that to you conceptually a "flag" is an in-memory-only thing. The fact that a value is a string/binary is clearly not in-memory-only, so it cannot be a flag. However, consider it like this: whether a value is string or binary is part of the value itself, without this information the value is incomplete. Now it is obvious that this distinction must be preserved with the value (a fact that is not obvious with meta:/binary, see "Side effects").

It seems to me that once again we have a debate of philosophical reasons vs. technical reasons. With the key names I could accept the technical issue that come from allowing any key name part whatsoever, but here I feel like the technical issues are far too much to accept for philosophical reasons.

More detailed answers below:


Imho removing keyIsBinary, keyIsString, keySetString and keySetBinary would be the simplest solution.

But IMHO this would only make sense, if the new "default" is binary, i.e. a key can contain any chunk of bytes. How these bytes should be interpreted is not defined. Metadata can be used to give some specs.

Also this solution (even if the "default" remains string values) is the simplest in principle, but requires by far the biggest time investment.

The main argument for keeping meta:/binary is that currently no flags need to be persisted. With KEY_FLAG_BINARY we would have a flag that needs persistence, creating an inconsistency which is difficult to understand and document.

IMO flags are an implementation detail. I wouldn't say KEY_FLAG_BINARY needs to be persisted. Instead I'd simply say a key can have either a string value or a binary value (which is obvious from the API, keySetString and keySetBinary both exist). This fact must be persisted by storage plugins and remembered by any implementation of a key. How this is done doesn't matter.

Inconsistencies in concepts (like metadata and flags) are imho much more problematic.

Again, I'd say flags are an implementation detail. The binary vs. string thing to me is actually an attribute of the key value. It should therefore be documented wherever key values are explained (which probably already contains a note about string vs. bianry).

If the binary data in /elektra/modules is a problem, we should fix it in the backend API: stop passing it by contract and instead only by ELEKTRA_PLUGIN_EXPORT. But first we should do a benchmark if normal applications are effected by this at all.

  1. How would you fix this? We use binary keys in a KeySet to allow exporting arbitrary functions.
  2. Last time we talked about this we wanted to go the other way. Only export via the contract to avoid the duplication and possible inconsistencies.
  3. It probably doesn't matter too much (maybe in embedded applications), but it is still ridiculous to me that we use 256 B of memory to store a single bit of information.

f you don't fundamentally disagree, I would close this issue and create a new issue to remove keyIsBinary, keyIsString, keySetString and keySetBinary and automatic (un)setting of meta:/binary.

I don't think this is the best (in terms of cost vs. benefit) solution here. That is unless, you are volunteering to actually do the PR for this massive change.

Like I said, I'm fine with just a simple keyValue/keySetValue API, but the time required for this change is just not feasible IMHO, when the alternative (the flag) could be implemented within a few hours and has basically no technical downsides.

And this is probably a 2.0 topic anyway.

Not sure what part exactly you are referring to, but the root issue of meta:/binary should not be postponed until after 1.0 IMO. It is simply to big a change, with too far reaching consequences to tackle when application might already rely on a stable API for Elektra.

Changing just the /elektra/modules thing after 1.0 would be fine, but I don't even know how to fix it now (without getting rid of `meta:/binary).

@markus2330
Copy link
Contributor

But IMHO this would only make sense, if the new "default" is binary

Good idea, I totally agree that this the current situation is conceptually wrong. I added a draft for a proposal in #4201 (to be discussed on Monday).

How would you fix this? We use binary keys in a KeySet to allow exporting arbitrary functions.

Did you check which test breaks if we do not export them anymore? I think it is a very rare feature to actually call such functions.

Anyway, I don't think we should stop offering binary keys.

Last time we talked about this we wanted to go the other way. Only export via the contract to avoid the duplication and possible inconsistencies.

Unfortunately we didn't write a decision about this. Do you remember where we discussed this? It would be good if you can add this to https://github.com/ElektraInitiative/libelektra/tree/new-backend

It probably doesn't matter too much (maybe in embedded applications), but it is still ridiculous to me that we use 256 B of memory to store a single bit of information.

This doesn't need to be that way, it would be possible to have "special static keys" like type = string which are reused for all keys. (The pointer still needs 4 bytes, though.)

Like I said, I'm fine with just a simple keyValue/keySetValue API

👍

when the alternative (the flag) could be implemented within a few hours and has basically no technical downsides.

Complexity is a big technical downside. The binary/string API is unnecessarily difficult to use and has many weird problems like #3797.

@kodebach
Copy link
Member Author

Did you check which test breaks if we do not export them anymore? I think it is a very rare feature to actually call such functions.

There are at least the checkconf and checkfile functions that are used by the mount tooling. Also we did have the idea of removing the ELEKTRA_PLUGIN_EXPORT exported symbol with just a KeySet. Then every plugin would rely on this functionality.

Do you remember where we discussed this?

I think it was just a sidenote in one of the comments in #3693. IIRC we more or less agreed the issue is independent of the new backend system and we postponed it.

This doesn't need to be that way, it would be possible to have "special static keys" like type = string which are reused for all keys. (The pointer still needs 4 bytes, though.)

Of course hacky optimizations are possible. But such work around make things a lot more complicated to think about, required a lot more code, create more potential for errors, etc.

Complexity is a big technical downside.

I don't see how the flag would be any more complex than the metakey. The complexity is in different places, but it is there in both cases.

Like I said, I'm fine with just a simple keyValue/keySetValue API

+1

Please note, because of the enormous effort involved in actually implementing this change, I still prefer the solution with a flag. Even though that is not the optimal solution.

@markus2330
Copy link
Contributor

As discussed, see #4201

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants