-
Notifications
You must be signed in to change notification settings - Fork 197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Serialization for commonly used types #770
Conversation
Hi, have you considered using a schema like onnx that can be reused by others and provide version info? I think versioning is quite important if we want to keep some stability over serialized object. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for asking me to review! I've marked the code that's definitely incorrect with [INCORRECT]
. It will be wrong for layout_stride
if ! is_exhaustive()
.
* @return the number of bytes (to be) written by the pointer | ||
*/ | ||
template <typename T, typename... ContextArgs> | ||
auto serialize(uint8_t* out, const T& obj, ContextArgs&&... args) -> size_t |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you consider using std::span
as the output argument? This would better express how many elements the output array has.
template <typename T, size_t OutputExtent, typename... ContextArgs>
auto serialize(std::span<uint8_t, OutputExtent> out, const T& obj, ContextArgs&&... args) -> size_t;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be happy to add overloads to have span as the output, but raft is currently set to C++17
. It's not available till C++20, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if we returned a raft::span?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, great, I can add overloads that accept raft::span
!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if we returned a raft::span?
It does not seem to have python bindings at the moment, right? I guess, we'd need at least one version of serialize that could be easily wrapped by cython for #752
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be straightforward to expose raft::span
in a pxd
file. We can also rename the functions as needed (for e.g. to_span
/from_span
and to_bytes
/from_bytes
).
* @return the serialized state | ||
*/ | ||
template <typename T, typename... ContextArgs> | ||
auto serialize(const T& obj, ContextArgs&&... args) -> std::vector<uint8_t> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Would you consider an interface that writes to existing storage, instead of performing a new allocation?
- If you mean to perform a new allocation, would you consider letting the user optionally pass in a custom allocator?
The first option would imply a state machine that takes an output buffer, and returns the "rest" of the output buffer and whether it finished with output (or some other error occurred, like running out of space).
struct SerializeNoError;
struct SerializeOutOfSpace { size_t remaining_space_required; }
struct SerializeOtherError { /* ... */ };
using SerializeErrorCode = std::variant<SerializeOutOfSpace, SerializeOtherError>;
struct SerializeState {
std::span<uint8_t> rest_of_buffer;
SerializeInternalStateRepresentation state;
bool done = false;
SerializeErrorCode error_code{SerializeNoError{}};
};
bool keep_going(const SerializeReturn& r) {
return ! result.done && std::holds_alternative<SerializeNoError>(result.error_code);
}
template<typename T, typename... ContextArgs>
serialize(SerializeState r, const T& obj, ContextArgs&&... args) -> SerializeState;
You then call it in a loop.
const size_t initial_buffer_size = 100;
std::vector<uint8_t> buffer(initial_buffer_size);
SerializeState state{span{buffer.data()}};
while(keep_going(state)) {
state = serialize(state, obj, args...);
};
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The base implementation (to_bytes
) at the moment writes to existing storage, but I don't have any checks whatsoever to see if there is enough space allocated. In the current state, the user is supposed to call serialize
with nullptr
to find out how much space it needs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@achirkin Would you consider a design that names the space query as a separate function from serialization? Compare to cuSPARSE, which separates *_bufferSize
functions from *_solve
functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be totally fine separating them, though for some reason I've had an impression that in raft/cuml we follow the CUB approach of using the same function. @cjnolet , which one do you prefer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one is hard. @mhoemmen, we've been trying to keep the APIs in RAFT a little more C++ friendly, so generally trying to avoid the two-stage compute buffer size
-> perform computation
where at all possible. I'm usually not particularly in love w/ returning objects from functions either, unless we are explicitly using something like a factory for a complex object, because then the user cannot control the memory allocation.
In this particular case, if we are only ever expecting the serialized buffer to be in host memory, I could see an argument for having serialize return a vector, or more preferably, accepting a vector and populating it. I think if I was to choose between the various options presented, I most like the idea of returning/populating some sort of streamable/iterable type if we can. I find that to be the most flexible design because a large stream can go right to disk and have to buffer only a small portion into memory at any given time. Further, the stream is also nice because you can collect it into a byte buffer at any point if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cjnolet wrote:
I think if I was to choose between the various options presented, I most like the idea of returning/populating some sort of streamable/iterable type if we can.
I do like the idea of writing to some kind of output range. If it's a random access range, it would be nice to test if it's long enough and report an error if not, rather than just clobbering it. This is always detectable, and is generally recoverable (allocate more memory) or at least worth reporting ("insufficient disk space").
Please do consider not allocating the return buffer each time. This would make it impossible to write sequential records to contiguous storage. It can also be surprisingly expensive in a tight loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please do consider not allocating the return buffer each time. This would make it impossible to write sequential records to contiguous storage. It can also be surprisingly expensive in a tight loop.
I do wholeheartedly agree with this.
* @return the number of bytes read by the pointer | ||
*/ | ||
template <typename T, typename... ContextArgs> | ||
auto deserialize(T* p, const void* in, ContextArgs&&... args) -> size_t |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- What happens if deserialization fails (e.g., because the representation is not correct)? How would you report the error?
- It sounds like the intent is for
deserialize
to do placement new inp
. Is that correct? - Would you instead consider an interface that writes to an existing T via
T& out
? The issue with a raw pointer interface is that users may forget the invariant "an uninitialized host pointer" and pass in a pointer to an existing object. In that case, your interface can totally break any invariants ofT
, e.g., possibly leaking any heap allocations thatT
might have inside.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- At the moment, there are no safety checks at all. That being said, I think, we could add them in
to_bytes
/from_bytes
, i.e. make it the responsibility of a person who adds serialization to a particular class - just throw one of the raft errors. What do you think? CC @cjnolet - Yes
- I'd be fine with that, but I'm not sure how to do that better, while keeping the
serial::from_bytes
the same. Shall I add an overload that acceptsT& out
, manually calls its destructor and then passes the&out
pointer further?.. sounds hackish.
In general, the reason why I'm hesitant to change serial::from_bytes
to take a reference to an initialized object is that some of the classes in raft/cuml don't seem to have constructors that do nothing. On the one hand, the pre-initialization would likely mean some extra overhead and a minor nuisance of passing dummy arguments if there is no nullary constructor available. On the other hand, that would make impossible the value-returning overload of deserialize (deserialize(const std::vector<uint8_t>& in, ContextArgs&&... args) -> T
) for the types that don't have the nullary constructor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall I add an overload that accepts
T& out
, manually calls its destructor and then passes the&out
pointer further?.. sounds hackish.
Please do not : - )
On the other hand, that would make impossible the value-returning overload ... for the types that don't have the nullary constructor.
It's possible if you have assignment, but ugly -- https://godbolt.org/z/774Ebfvh5 -- so I understand if you don't want to do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant to say, it would be impossible if I changed serial::from_bytes
to accept the reference instead of the pointer (deserialize_raw
in your example). Also, deserialize_using_raw
in your example does not call the destructor, does it? :) (nb my union
trick to address this and copy elision)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@achirkin Right, I definitely forgot the important detail of manually invoking the destructor on a placement new'd thing.
if (out) { | ||
out += extents_size; | ||
raft::copy( | ||
reinterpret_cast<ElementType*>(out), obj.data_handle(), obj.size(), handle.get_stream()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[INCORRECT]
This is only correct if the layout is exhaustive (contiguous in memory, not strided or otherwise skipping parts of memory). Otherwise, it will both write memory that are not part of the mdarray
's elements, and it will miss elements of the mdarray
.
The right thing to do would be to define an mdspan
copy operation, as in P1673. Then you can copy from the input mdspan
to a temporary output mdspan
that views out
with the desired output layout.
}; | ||
|
||
template <typename ElementType, typename Extents, typename LayoutPolicy> | ||
struct serial<mdarray<ElementType, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Serializing mdarray
instead of mdspan
implies code duplication. Would you consider instead only defining serialization for mdspan
?
rmm::mr::device_memory_resource* mr = nullptr) -> size_t | ||
{ | ||
Extents exts; | ||
auto extents_size = call_deserialize<Extents>(&exts, in); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is your representation of extents
if all the extents are static? Are you able to distinguish this case from simply not having any data at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, at this moment we don't have any sorts of RTTI at all. That is, in this version the caller should know it has serialized the static extents, by possibly not writing anything at all; then from_bytes
should read the same thing - possibly nothing.
in += extents_size; | ||
typename obj_t::mapping_type layout{exts}; | ||
typename obj_t::container_policy_type policy{handle.get_stream(), mr}; | ||
new (p) obj_t{layout, policy}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This approach first value-initializes all the elements of the mdarray, then deserializes into the mdarray. This iterates over the mdarray twice. If your mdarray has std::vector
as its container, you could replace the raft::copy
call with a range over the input, do ranges::to
initialization of a std::vector
, and then move the vector into the mdarray.
typename obj_t::mapping_type layout{exts}; | ||
typename obj_t::container_policy_type policy{handle.get_stream(), mr}; | ||
new (p) obj_t{layout, policy}; | ||
raft::copy( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[INCORRECT]
This is only correct if the layout is exhaustive (contiguous in memory, not strided or otherwise skipping parts of memory). Please see above note.
Co-authored-by: Mark Hoemmen <[email protected]>
Co-authored-by: Mark Hoemmen <[email protected]>
This PR has been labeled |
This PR implements serialization to file for `ivf_pq::index` and `ivf_flat::index` structures. Index building takes time, therefore downstream projects (like cuML) want to save the index (rapidsai/cuml#4743). But downstream project should not depend on the implementation details of the index, therefore RAFT provides methods to serialize and deserialize the index. This is still experimental: - ideally we want to use a general serialization method for mdspan #770, - instead of directly saving to file, raft should provide a byte string and let the downstream project decide how to save it (e.g. pickle for cuML). Python wrappers are provided for IVF-PQ to save/load the index. Authors: - Tamas Bela Feher (https://github.com/tfeher) - Corey J. Nolet (https://github.com/cjnolet) Approvers: - Corey J. Nolet (https://github.com/cjnolet) URL: #919
Closing in favor of #1173 |
This draft pull request adds a simple serialization to commonly used types in raft.
Serialized layout
Simple here refers to the way the data is stored: at the moment, no safety/type/versioning information is preserved; only a bare minimum of information is stored to a byte array and read back.
Target type
In the most basic form, the target of serialization is a raw pointer. Currently, I also provide overloads to store data to
vector<uint8_t>
.ContextArgs... args
Plain structures and arithmetic types do not require any context to serialize/deserialize (zero args). The data on gpu devices needs, at least, a cuda stream (one arg). mdarrays / rmm vectors may need a memory resource to tell where to keep the deserialized live object (two args). To allow all of these in a single template, I chose to use a parameter pack in this iteration.
Closes: #752