Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Serialization for commonly used types #770

Closed
wants to merge 9 commits into from

Conversation

achirkin
Copy link
Contributor

@achirkin achirkin commented Aug 2, 2022

This draft pull request adds a simple serialization to commonly used types in raft.

Serialized layout
Simple here refers to the way the data is stored: at the moment, no safety/type/versioning information is preserved; only a bare minimum of information is stored to a byte array and read back.

Target type
In the most basic form, the target of serialization is a raw pointer. Currently, I also provide overloads to store data to vector<uint8_t>.

ContextArgs... args
Plain structures and arithmetic types do not require any context to serialize/deserialize (zero args). The data on gpu devices needs, at least, a cuda stream (one arg). mdarrays / rmm vectors may need a memory resource to tell where to keep the deserialized live object (two args). To allow all of these in a single template, I chose to use a parameter pack in this iteration.

Closes: #752

@achirkin achirkin added non-breaking Non-breaking change feature request New feature or request cpp CMake and removed cpp CMake labels Aug 2, 2022
@trivialfis
Copy link
Member

trivialfis commented Aug 2, 2022

Hi, have you considered using a schema like onnx that can be reused by others and provide version info? I think versioning is quite important if we want to keep some stability over serialized object.

Copy link
Contributor

@mhoemmen mhoemmen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for asking me to review! I've marked the code that's definitely incorrect with [INCORRECT]. It will be wrong for layout_stride if ! is_exhaustive().

* @return the number of bytes (to be) written by the pointer
*/
template <typename T, typename... ContextArgs>
auto serialize(uint8_t* out, const T& obj, ContextArgs&&... args) -> size_t
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you consider using std::span as the output argument? This would better express how many elements the output array has.

template <typename T, size_t OutputExtent, typename... ContextArgs>
auto serialize(std::span<uint8_t, OutputExtent> out, const T& obj, ContextArgs&&... args) -> size_t;

See C++ Core Guidelines F.24.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be happy to add overloads to have span as the output, but raft is currently set to C++17. It's not available till C++20, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we returned a raft::span?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, great, I can add overloads that accept raft::span!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we returned a raft::span?

It does not seem to have python bindings at the moment, right? I guess, we'd need at least one version of serialize that could be easily wrapped by cython for #752

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be straightforward to expose raft::span in a pxd file. We can also rename the functions as needed (for e.g. to_span/from_span and to_bytes/from_bytes).

* @return the serialized state
*/
template <typename T, typename... ContextArgs>
auto serialize(const T& obj, ContextArgs&&... args) -> std::vector<uint8_t>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Would you consider an interface that writes to existing storage, instead of performing a new allocation?
  2. If you mean to perform a new allocation, would you consider letting the user optionally pass in a custom allocator?

The first option would imply a state machine that takes an output buffer, and returns the "rest" of the output buffer and whether it finished with output (or some other error occurred, like running out of space).

struct SerializeNoError;
struct SerializeOutOfSpace { size_t remaining_space_required; }
struct SerializeOtherError { /* ... */ };
using SerializeErrorCode = std::variant<SerializeOutOfSpace, SerializeOtherError>;

struct SerializeState {
  std::span<uint8_t> rest_of_buffer;
  SerializeInternalStateRepresentation state;
  bool done = false;
  SerializeErrorCode error_code{SerializeNoError{}};
};

bool keep_going(const SerializeReturn& r) {
  return ! result.done && std::holds_alternative<SerializeNoError>(result.error_code);
}

template<typename T, typename... ContextArgs>
serialize(SerializeState r, const T& obj, ContextArgs&&... args) -> SerializeState;

You then call it in a loop.

const size_t initial_buffer_size = 100;
std::vector<uint8_t> buffer(initial_buffer_size);
SerializeState state{span{buffer.data()}};
while(keep_going(state)) {
  state = serialize(state, obj, args...);
};

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The base implementation (to_bytes) at the moment writes to existing storage, but I don't have any checks whatsoever to see if there is enough space allocated. In the current state, the user is supposed to call serialize with nullptr to find out how much space it needs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@achirkin Would you consider a design that names the space query as a separate function from serialization? Compare to cuSPARSE, which separates *_bufferSize functions from *_solve functions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be totally fine separating them, though for some reason I've had an impression that in raft/cuml we follow the CUB approach of using the same function. @cjnolet , which one do you prefer?

Copy link
Member

@cjnolet cjnolet Aug 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is hard. @mhoemmen, we've been trying to keep the APIs in RAFT a little more C++ friendly, so generally trying to avoid the two-stage compute buffer size -> perform computation where at all possible. I'm usually not particularly in love w/ returning objects from functions either, unless we are explicitly using something like a factory for a complex object, because then the user cannot control the memory allocation.

In this particular case, if we are only ever expecting the serialized buffer to be in host memory, I could see an argument for having serialize return a vector, or more preferably, accepting a vector and populating it. I think if I was to choose between the various options presented, I most like the idea of returning/populating some sort of streamable/iterable type if we can. I find that to be the most flexible design because a large stream can go right to disk and have to buffer only a small portion into memory at any given time. Further, the stream is also nice because you can collect it into a byte buffer at any point if needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cjnolet wrote:

I think if I was to choose between the various options presented, I most like the idea of returning/populating some sort of streamable/iterable type if we can.

I do like the idea of writing to some kind of output range. If it's a random access range, it would be nice to test if it's long enough and report an error if not, rather than just clobbering it. This is always detectable, and is generally recoverable (allocate more memory) or at least worth reporting ("insufficient disk space").

Please do consider not allocating the return buffer each time. This would make it impossible to write sequential records to contiguous storage. It can also be surprisingly expensive in a tight loop.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do consider not allocating the return buffer each time. This would make it impossible to write sequential records to contiguous storage. It can also be surprisingly expensive in a tight loop.

I do wholeheartedly agree with this.

* @return the number of bytes read by the pointer
*/
template <typename T, typename... ContextArgs>
auto deserialize(T* p, const void* in, ContextArgs&&... args) -> size_t
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. What happens if deserialization fails (e.g., because the representation is not correct)? How would you report the error?
  2. It sounds like the intent is for deserialize to do placement new in p. Is that correct?
  3. Would you instead consider an interface that writes to an existing T via T& out? The issue with a raw pointer interface is that users may forget the invariant "an uninitialized host pointer" and pass in a pointer to an existing object. In that case, your interface can totally break any invariants of T, e.g., possibly leaking any heap allocations that T might have inside.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. At the moment, there are no safety checks at all. That being said, I think, we could add them in to_bytes/from_bytes, i.e. make it the responsibility of a person who adds serialization to a particular class - just throw one of the raft errors. What do you think? CC @cjnolet
  2. Yes
  3. I'd be fine with that, but I'm not sure how to do that better, while keeping the serial::from_bytes the same. Shall I add an overload that accepts T& out, manually calls its destructor and then passes the &out pointer further?.. sounds hackish.

In general, the reason why I'm hesitant to change serial::from_bytes to take a reference to an initialized object is that some of the classes in raft/cuml don't seem to have constructors that do nothing. On the one hand, the pre-initialization would likely mean some extra overhead and a minor nuisance of passing dummy arguments if there is no nullary constructor available. On the other hand, that would make impossible the value-returning overload of deserialize (deserialize(const std::vector<uint8_t>& in, ContextArgs&&... args) -> T) for the types that don't have the nullary constructor.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall I add an overload that accepts T& out, manually calls its destructor and then passes the &out pointer further?.. sounds hackish.

Please do not : - )

On the other hand, that would make impossible the value-returning overload ... for the types that don't have the nullary constructor.

It's possible if you have assignment, but ugly -- https://godbolt.org/z/774Ebfvh5 -- so I understand if you don't want to do it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant to say, it would be impossible if I changed serial::from_bytes to accept the reference instead of the pointer (deserialize_raw in your example). Also, deserialize_using_raw in your example does not call the destructor, does it? :) (nb my union trick to address this and copy elision)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@achirkin Right, I definitely forgot the important detail of manually invoking the destructor on a placement new'd thing.

cpp/include/raft/core/serialization.hpp Show resolved Hide resolved
cpp/include/raft/core/serialization.hpp Outdated Show resolved Hide resolved
if (out) {
out += extents_size;
raft::copy(
reinterpret_cast<ElementType*>(out), obj.data_handle(), obj.size(), handle.get_stream());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[INCORRECT]

This is only correct if the layout is exhaustive (contiguous in memory, not strided or otherwise skipping parts of memory). Otherwise, it will both write memory that are not part of the mdarray's elements, and it will miss elements of the mdarray.

The right thing to do would be to define an mdspan copy operation, as in P1673. Then you can copy from the input mdspan to a temporary output mdspan that views out with the desired output layout.

};

template <typename ElementType, typename Extents, typename LayoutPolicy>
struct serial<mdarray<ElementType,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Serializing mdarray instead of mdspan implies code duplication. Would you consider instead only defining serialization for mdspan?

rmm::mr::device_memory_resource* mr = nullptr) -> size_t
{
Extents exts;
auto extents_size = call_deserialize<Extents>(&exts, in);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is your representation of extents if all the extents are static? Are you able to distinguish this case from simply not having any data at all?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, at this moment we don't have any sorts of RTTI at all. That is, in this version the caller should know it has serialized the static extents, by possibly not writing anything at all; then from_bytes should read the same thing - possibly nothing.

in += extents_size;
typename obj_t::mapping_type layout{exts};
typename obj_t::container_policy_type policy{handle.get_stream(), mr};
new (p) obj_t{layout, policy};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach first value-initializes all the elements of the mdarray, then deserializes into the mdarray. This iterates over the mdarray twice. If your mdarray has std::vector as its container, you could replace the raft::copy call with a range over the input, do ranges::to initialization of a std::vector, and then move the vector into the mdarray.

typename obj_t::mapping_type layout{exts};
typename obj_t::container_policy_type policy{handle.get_stream(), mr};
new (p) obj_t{layout, policy};
raft::copy(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[INCORRECT]

This is only correct if the layout is exhaustive (contiguous in memory, not strided or otherwise skipping parts of memory). Please see above note.

@achirkin achirkin changed the base branch from branch-22.08 to branch-22.10 August 9, 2022 07:30
@github-actions
Copy link

This PR has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this PR if it is no longer required. Otherwise, please respond with a comment indicating any updates. This PR will be labeled inactive-90d if there is no activity in the next 60 days.

rapids-bot bot pushed a commit that referenced this pull request Jan 3, 2023
This PR implements serialization to file for  `ivf_pq::index` and `ivf_flat::index` structures.

Index building takes time, therefore downstream projects (like cuML) want to save the index (rapidsai/cuml#4743). But downstream project should not depend on the implementation details of the index, therefore RAFT provides methods to serialize and deserialize the index.

This is still experimental:
- ideally we want to use a general serialization method for mdspan #770,
- instead of directly saving to file, raft should provide a byte string and let the downstream project decide how to save it (e.g. pickle for cuML).

Python wrappers are provided for IVF-PQ to save/load the index.

Authors:
  - Tamas Bela Feher (https://github.com/tfeher)
  - Corey J. Nolet (https://github.com/cjnolet)

Approvers:
  - Corey J. Nolet (https://github.com/cjnolet)

URL: #919
@achirkin
Copy link
Contributor Author

Closing in favor of #1173

@achirkin achirkin closed this Jan 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CMake cpp feature request New feature or request inactive-30d non-breaking Non-breaking change
Projects
Development

Successfully merging this pull request may close these issues.

[FEA] IVF-flat serialization
4 participants