Serialization for commonly used types #770

achirkin · 2022-08-02T13:40:50Z

This draft pull request adds a simple serialization to commonly used types in raft.

Serialized layout
Simple here refers to the way the data is stored: at the moment, no safety/type/versioning information is preserved; only a bare minimum of information is stored to a byte array and read back.

Target type
In the most basic form, the target of serialization is a raw pointer. Currently, I also provide overloads to store data to vector<uint8_t>.

ContextArgs... args
Plain structures and arithmetic types do not require any context to serialize/deserialize (zero args). The data on gpu devices needs, at least, a cuda stream (one arg). mdarrays / rmm vectors may need a memory resource to tell where to keep the deserialized live object (two args). To allow all of these in a single template, I chose to use a parameter pack in this iteration.

Closes: #752

trivialfis · 2022-08-02T13:46:58Z

Hi, have you considered using a schema like onnx that can be reused by others and provide version info? I think versioning is quite important if we want to keep some stability over serialized object.

mhoemmen

Thanks for asking me to review! I've marked the code that's definitely incorrect with [INCORRECT]. It will be wrong for layout_stride if ! is_exhaustive().

mhoemmen · 2022-08-02T14:37:35Z

cpp/include/raft/core/serialization.hpp

+ * @return the number of bytes (to be) written by the pointer
+ */
+template <typename T, typename... ContextArgs>
+auto serialize(uint8_t* out, const T& obj, ContextArgs&&... args) -> size_t


Would you consider using std::span as the output argument? This would better express how many elements the output array has.

template <typename T, size_t OutputExtent, typename... ContextArgs> auto serialize(std::span<uint8_t, OutputExtent> out, const T& obj, ContextArgs&&... args) -> size_t;

See C++ Core Guidelines F.24.

I'd be happy to add overloads to have span as the output, but raft is currently set to C++17. It's not available till C++20, right?

What if we returned a raft::span?

Ah, great, I can add overloads that accept raft::span!

What if we returned a raft::span?

It does not seem to have python bindings at the moment, right? I guess, we'd need at least one version of serialize that could be easily wrapped by cython for #752

It should be straightforward to expose raft::span in a pxd file. We can also rename the functions as needed (for e.g. to_span/from_span and to_bytes/from_bytes).

mhoemmen · 2022-08-02T14:56:23Z

cpp/include/raft/core/serialization.hpp

+ * @return the serialized state
+ */
+template <typename T, typename... ContextArgs>
+auto serialize(const T& obj, ContextArgs&&... args) -> std::vector<uint8_t>


Would you consider an interface that writes to existing storage, instead of performing a new allocation?

If you mean to perform a new allocation, would you consider letting the user optionally pass in a custom allocator?

The first option would imply a state machine that takes an output buffer, and returns the "rest" of the output buffer and whether it finished with output (or some other error occurred, like running out of space).

struct SerializeNoError; struct SerializeOutOfSpace { size_t remaining_space_required; } struct SerializeOtherError { /* ... */ }; using SerializeErrorCode = std::variant<SerializeOutOfSpace, SerializeOtherError>; struct SerializeState { std::span<uint8_t> rest_of_buffer; SerializeInternalStateRepresentation state; bool done = false; SerializeErrorCode error_code{SerializeNoError{}}; }; bool keep_going(const SerializeReturn& r) { return ! result.done && std::holds_alternative<SerializeNoError>(result.error_code); } template<typename T, typename... ContextArgs> serialize(SerializeState r, const T& obj, ContextArgs&&... args) -> SerializeState;

You then call it in a loop.

const size_t initial_buffer_size = 100; std::vector<uint8_t> buffer(initial_buffer_size); SerializeState state{span{buffer.data()}}; while(keep_going(state)) { state = serialize(state, obj, args...); };

The base implementation (to_bytes) at the moment writes to existing storage, but I don't have any checks whatsoever to see if there is enough space allocated. In the current state, the user is supposed to call serialize with nullptr to find out how much space it needs.

@achirkin Would you consider a design that names the space query as a separate function from serialization? Compare to cuSPARSE, which separates *_bufferSize functions from *_solve functions.

I'd be totally fine separating them, though for some reason I've had an impression that in raft/cuml we follow the CUB approach of using the same function. @cjnolet , which one do you prefer?

This one is hard. @mhoemmen, we've been trying to keep the APIs in RAFT a little more C++ friendly, so generally trying to avoid the two-stage compute buffer size -> perform computation where at all possible. I'm usually not particularly in love w/ returning objects from functions either, unless we are explicitly using something like a factory for a complex object, because then the user cannot control the memory allocation.

In this particular case, if we are only ever expecting the serialized buffer to be in host memory, I could see an argument for having serialize return a vector, or more preferably, accepting a vector and populating it. I think if I was to choose between the various options presented, I most like the idea of returning/populating some sort of streamable/iterable type if we can. I find that to be the most flexible design because a large stream can go right to disk and have to buffer only a small portion into memory at any given time. Further, the stream is also nice because you can collect it into a byte buffer at any point if needed.

@cjnolet wrote:

I think if I was to choose between the various options presented, I most like the idea of returning/populating some sort of streamable/iterable type if we can.

I do like the idea of writing to some kind of output range. If it's a random access range, it would be nice to test if it's long enough and report an error if not, rather than just clobbering it. This is always detectable, and is generally recoverable (allocate more memory) or at least worth reporting ("insufficient disk space").

Please do consider not allocating the return buffer each time. This would make it impossible to write sequential records to contiguous storage. It can also be surprisingly expensive in a tight loop.

Please do consider not allocating the return buffer each time. This would make it impossible to write sequential records to contiguous storage. It can also be surprisingly expensive in a tight loop.

I do wholeheartedly agree with this.

mhoemmen · 2022-08-02T15:05:56Z

cpp/include/raft/core/serialization.hpp

+ * @return the number of bytes read by the pointer
+ */
+template <typename T, typename... ContextArgs>
+auto deserialize(T* p, const void* in, ContextArgs&&... args) -> size_t


What happens if deserialization fails (e.g., because the representation is not correct)? How would you report the error?

It sounds like the intent is for deserialize to do placement new in p. Is that correct?

Would you instead consider an interface that writes to an existing T via T& out? The issue with a raw pointer interface is that users may forget the invariant "an uninitialized host pointer" and pass in a pointer to an existing object. In that case, your interface can totally break any invariants of T, e.g., possibly leaking any heap allocations that T might have inside.

At the moment, there are no safety checks at all. That being said, I think, we could add them in to_bytes/from_bytes, i.e. make it the responsibility of a person who adds serialization to a particular class - just throw one of the raft errors. What do you think? CC @cjnolet

Yes

I'd be fine with that, but I'm not sure how to do that better, while keeping the serial::from_bytes the same. Shall I add an overload that accepts T& out, manually calls its destructor and then passes the &out pointer further?.. sounds hackish.

In general, the reason why I'm hesitant to change serial::from_bytes to take a reference to an initialized object is that some of the classes in raft/cuml don't seem to have constructors that do nothing. On the one hand, the pre-initialization would likely mean some extra overhead and a minor nuisance of passing dummy arguments if there is no nullary constructor available. On the other hand, that would make impossible the value-returning overload of deserialize (deserialize(const std::vector<uint8_t>& in, ContextArgs&&... args) -> T) for the types that don't have the nullary constructor.

Shall I add an overload that accepts T& out, manually calls its destructor and then passes the &out pointer further?.. sounds hackish.

Please do not : - )

On the other hand, that would make impossible the value-returning overload ... for the types that don't have the nullary constructor.

It's possible if you have assignment, but ugly -- https://godbolt.org/z/774Ebfvh5 -- so I understand if you don't want to do it.

I meant to say, it would be impossible if I changed serial::from_bytes to accept the reference instead of the pointer (deserialize_raw in your example). Also, deserialize_using_raw in your example does not call the destructor, does it? :) (nb my union trick to address this and copy elision)

@achirkin Right, I definitely forgot the important detail of manually invoking the destructor on a placement new'd thing.

cpp/include/raft/core/serialization.hpp

mhoemmen · 2022-08-02T15:29:05Z