-
Notifications
You must be signed in to change notification settings - Fork 108
CBOR: Data model serialization in Rust #323
Comments
You could consider dropping everything but double precision, even on decode. Our ideal is to have just one way, although our current decoders are quite sloppy in what they expect. The JavaScript encoder currently does a smallest-possible float encode while Go does an always 64-bit. We're moving to the latter (as per the recent revision to the DAG-CBOR spec) and we don't anticipate there's a whole lot of DAG-CBOR in the wild with floats anyway so hopefully this isn't a pain for anyone. My advice on building a new DAG-CBOR implementation is to make it minimal and strict and see how far it gets you before it breaks on real-world data. I would hope that you wouldn't find any breakage and as we move our other encoders to a stricter form the odds will just improve over time. The |
Your Data Model aligns with the one outlines here https://github.com/ipfs-rust/rust-ipld/blob/a2b6f30631bfd45a0ffdc78a9aa7e12c2b809f1e/core/src/ipld.rs#L8. The only difference is naming one things |
@volker thanks for this reference. Actually thanks for the many references ;) I am kind of making this quick draft implementation to understand the specs in finer details. Other than that nothing much. |
In situation similar to IoT, where small embedded devices are expected to communicate using DAG-CBOR, do we assume double-precision float on those devices. It might be okay to expect low-end devices falling short in double-precision support, but an adequately powered computing node, ignoring single-precision might be a bad excuse. Other than this I prefer float-64 only design.
Do we have real-world data, say from goland, that we can work with ?
Yes, I definitely feel it is a overkill. Actually thanks for bringing it up. Found this in the CBOR spec,
From the above description we understand that, if both Major-type-0 and Major-type-1 are to be used we are encouraged to use it as signed-64. And may be, if only Major-type-0 is to be used it is left to the application choice.
But the above description that comes as part of the main part of the spec. is not clear about this. May be it is the bane of designing a serialization spec., we have to keep it a little open-ended. I will book this as a TODO for now. |
Yeah, some of these details get a bit too in-the-weeds for the data model and we have to be careful not to drive the data model according to one codec's quirks (or one language's quirks!). It's a juggling act of compromise everywhere. If you read https://github.com/ipld/specs/blob/master/block-layer/codecs/codecs-and-completeness.md carefully you'll see that basically nothing we have now is up to standard, and in some places it's not even clear what the standard is (let's not even get in to data model map "ordering" questions ...). |
As a tree of bullet points, forgive me if it's terse:
|
Trait in rust is type-system-only concept. But there is something called
I am yet to study the higher-level abstractions in IPLD. But I think I get it.
Sounds like a good idea.
+1
Rust error handling is cooler than most other language. With some glue logic, we can chain error across components and only noise we might see in our functional logic is the
+1 Overall I agree to the idea of keeping Kind as an abstract type. Now the implementation has two choice, a. use type-parameters and define traits that match with the operational behaviour of Kind. enum Kind<V, M>
where
V: Node,
M: Node,
{
Null,
Bool(bool),
Integer(i128),
Float(f64),
Text(String),
Bytes(Vec<u8>),
Link(Cid),
List(V),
Dict(M),
} b. using trait-objects. Leaning towards (a) now. Also keeping Text as String, String's underlying value can be either checked-utf8 or unchecked binary. API can be added on top of kind for utf8 checks, and other type-conversions. I think I might have to re-visit all this after studying more on IPLD. Thanks for the comments. |
My bad, didn't realize that I will end up with HKT issue with (a), that is, parameterize |
After looking at the /// Kind of data in data-model.
pub enum Kind {
Null,
Bool,
Integer,
Float,
Text,
Bytes,
Link,
List,
Map,
} IPLD basic data-model can be implemented as, /// Basic defines IPLD data-model.
pub enum Basic {
Null,
Bool(bool),
Integer(i128), // TODO: i128 might an overkill, 8 more bytes than 64-bit !!
Float(f64),
Text(String),
Bytes(Vec<u8>),
Link(Cid),
List(Box<dyn Node + 'static>),
Map(Box<dyn Node + 'static>),
} which is now truly a recursive-type, where sub-trees are implemented as Node. list/map lookup type (key) is implemented as sum-type, this way we can keep it simple and at the same time give scope for future additions. /// A subset of Basic, that can be used to index into recursive type, like
/// list and map. Can be seen as the path-segment.
#[derive(Clone)]
pub enum Key {
Null,
Bool(bool),
Offset(usize),
Text(String),
Bytes(Vec<u8>),
} Key type implement Nevertheless, Cbor type still holds on to Finally Node interface{} is defined in Rust as: /// Every thing is a Node, almost.
pub trait Node {
/// return the kind.
fn to_kind(&self) -> Kind;
/// if kind is recursive type, key lookup.
fn get(&self, key: &Key) -> Result<&dyn Node>;
/// iterate over values.
fn iter<'a>(&'a self) -> Box<dyn Iterator<Item = &dyn Node> + 'a>;
/// iterate over (key, value) entry, in case of list key is index
/// offset of value within the list.
fn iter_entries<'a>(&'a self) -> Box<dyn Iterator<Item = (Key, &dyn Node)> + 'a>;
/// if kind is container type, return the length.
fn len(&self) -> Option<usize>;
fn is_null(&self) -> bool;
fn to_bool(&self) -> Option<bool>;
fn to_int(&self) -> Option<i128>;
fn to_float(&self) -> Option<f64>;
fn as_string(&self) -> Option<Result<&str>>;
fn as_ffi_string(&self) -> Option<&str>;
fn as_bytes(&self) -> Option<&[u8]>;
fn as_link(&self) -> Option<&Cid>;
} |
I'm probably going to have to look at that repeatedly, but to first glance, I think that looks like it'll fly! |
The
One more thought I hadn't mentioned earlier in this issue, but realized during other discussions today might be interesting/important... in IPLD Schemas, there is the ability to specify maps with typed keys and typed values. So, in golang, this ended up with us having the MapIterator types having return signiture of The features IPLD Schemas introduced for typed keys go even further: schemas can have maps with complex keys, as long as they fit certain rules. Namely: you can use even structs and unions as map keys on the condition that they have representation strategies which can reduce them to strings (because that means we can still serialize them, when we get that data all the way to the bottom of the stack). So again, having map iterators use |
Any pointers, say in goland, where to find the schema type information ? May be with some real examples I can understand this better. |
This comment has been minimized.
This comment has been minimized.
Keeping the pub enum Key {
Bool(bool),
Offset(usize),
Text(String),
Bytes(Vec<u8>),
Keyable(Box<dyn ToString>),
} Note that we are making the ToString as a hard contraint for pub trait Node {
fn as_key(&self) -> Option<Key>;
...
} |
Please note that the "strings" @warpfork mentions can contain arbitrary bytes. So in Rust it's a |
@vmx Thanks for bringing it up. If The way I understood:
|
I don't understand that question.
Yes. I think this is how it would need to be done. I had hoped the Data Model could just define Strings being valid Unicode.
No, those always need to be bytes. There is currently an ongoing discussion on how to exactly specify it in a clean way, but those should really be bytes. |
@warpfork Correct me if I'm wrong here, my understanding is that the primary reason for the I'm not certain that this is necessary in Rust, we could take some inspiration from |
I think its similar to what rust-ipld is doing here with the The DAG-CBOR implementation using those traits is here: https://github.com/ipfs-rust/rust-ipld/tree/a2b6f30631bfd45a0ffdc78a9aa7e12c2b809f1e/dag-cbor/src You go straight from codec to native types an back. |
I am using std::str::from_utf8_unchecked() so I guess it is similar to
Right now as_string() and as_ffi_string() do a positive match for match self {
Basic::Text(val) => Some(err_at!(FailConvert, from_utf8(val.as_bytes()))),
Basic::Bytes(val) => Some(err_at!(FailConvert, from_utf8(val.as_slice()))),
_ => None,
}
I am guessing the key is going to come from upper layers, say via a path-segment. I might visualize that as,
Note that in the above table we can replace utf8-bytes with 64-bit-uint-big-endian-bytes, but the situation is the same, which is - Can we guarantee consistent key index into the same map-value for all 4 combinations ? We need a precise definition for the key so that incoming key and the map-keys fall within the same value-domain, aka type, aka kind, like utf8-bytes or u64-big-endian-bytes. And when we are dealing with mixed value-domain, we may have to define a sort order across value-domains (if ordering/iteration is to be implemented on map). Now, it is also possible to have something called collation-encoding which is a form of encoding multiple-types into a seamless collection of bytes. Please note that I am not assuming map as ordered here, just that there is a collation algorithm across multiple kinds (especially those that are defined by data-model) that not only give point-lookup into map, but also can maintain map as an ordered set, if that is required, and it is byte-represented, memcmp-compatible and isomorphic. |
BTW: I just re-read the CBOR spec about major types. The CBOR type for strings (major type 3, text strings) must be encoded as UTF-8. So even if the IPLD Data Model currently supports arbitrary bytes, CBOR does not. So in case you only care about CBOR encoding it might change things. |
So, part of the reason for having a I don't know exactly how some of this will translate to Rust, because I hear that traits are not quite the same as interfaces in other languages. But it's hard for me to imagine getting good foundations to build towards these higher level features without having some kind of a unifying interface or contract like this, in some way or another. |
This writeup focus on the data-model definition in Rust language
and its (de)serialization in CBOR format. Draft implementation can be
found here,
Data model
kinds.
treated as errors.
Cid
type is defined as enumerationover cid-versions, hence it can be considered future proof, atleast
from the point of serialization and de-serialization.
String
as key.Additionally,
equality
operation due to the presence of floating-point.Cbor serialization
This section is essentially going to carve out a subset of CBOR specification
that is required to have as much
completeness
and as muchfittedness
as possible, to say it IPLD parlance.
represent -(unsigned-64-bit-number + 1) that comes on the wire.
strict sorting order for map-entries.
tag 42
is used for IPLD links (cid).A note on recursion, recursive type naturally fit recursive encoding and
decoding implementation. But this can lead to stack overflow issue if
data-model values are recursively composed of large number of list and dict
values. Similar problem exists when deserializing incoming wire-data.
To avoid this, either we have to convert the recursive encoding and
decoding implementation into a loop, or we may have to add cap on recursion
depth.
Open points:
we have an option of doing byte-sort or confirming to collation spec.
even if same codec is used but implemented by different language, can
it be isomorphic ?
The text was updated successfully, but these errors were encountered: