Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Remove device buffer default ctor #424

Conversation

jrhemstad
Copy link
Contributor

Closes #422

Deletes the default ctor for device_buffer and removes all uses.

@jrhemstad jrhemstad added the 2 - In Progress Currently a work in progress label Jun 23, 2020
@GPUtester
Copy link
Contributor

Please update the changelog in order to start CI tests.

View the gpuCI docs here.

@harrism
Copy link
Member

harrism commented Aug 4, 2020

@jrhemstad what is blocking this?

@jrhemstad
Copy link
Contributor Author

@jrhemstad what is blocking this?

rapidsai/cudf#5566

As I mentioned in the PR, it turned out to be more tedious than I expected, so I never got around to finishing it. I can try and do so this week.

@germasch
Copy link
Contributor

germasch commented Aug 5, 2020

So while I have to preface this with saying that I haven't looked at device_uvector in detail, I was hoping to use it as a plug-in replacement for thrust::device_vector in some apps. Not having a default constructor would, however, break compatibility with thrust, and make it not easily usable in my case.

I also have a fundamental usability concerns about data types without default constructor. If one uses such a type as a member of a class, that class itself becomes not default-constructible, and the chain repeats from there. I think having a default constructed empty vector can well be considered a valid state. And as it doesn't have any data yet, it doesn't need to know a stream yet, either. Sometimes, you may construct a class containing a device_uvector early in your app, when you don't know yet what stream the work will be done on. The whole thing might be just be an empty place holder that the real vector is moved into later, etc.

Also the default stream, in particular when using default-stream-per-thread, might well be just what you wanted, anyway. (And of course sometimes, it's not, and it hurts performance, I see that point, too.)

@jrhemstad
Copy link
Contributor Author

was hoping to use it as a plug-in replacement for thrust::device_vector in some apps. Not having a default constructor would, however, break compatibility with thrust, and make it not easily usable in my case.

Breaking this compatibility is intentional. There are crucial differences between rmm::device_uvector and thrust::device_vector. The lesser of these difference is that device_uvector does not default initialize it's elements. The most important difference is that rmm::device_uvector is asynchronous whereas thrust::device_vector is synchronous. Asynchronous memory allocation is a relatively new concept in CUDA or C++, and it has the potential to be a significant footgun. Consider an example:

thrust::device_vector<int> ints(100); // synchronizes the device before returning 
kernel<<<...., s1>>>(ints.data().get()); // safe to pass a kernel on a different stream because of the synchronization

vs.

rmm::device_uvector<int> ints(100); // assume the stream defaulted to 0, no synchronization is performed
kernel<<...., s1>>>(ints.data()); // Bam. Footgun. You're dead. Illegal to use `ints` on `s1` before a synchronization has been performed

This is why we require the stream to be explicit to make it more likely for users to be aware of this potential error.

rmm::device_uvector<int> ints(100, cudaStream_t{0}); // stream is explicit, should hint the user that it's only legal to use on the specified stream

cudaStreamSynchronize(0); // If `ints` is to be used on another stream, requires synchronizing it's stream

kernel<<<..., s1>>>(ints.data()); // This is safe because "ints" is guaranteed to be available thanks to the synchronization

I also have a fundamental usability concerns about data types without default constructor. If one uses such a type as a member of a class, that class itself becomes not default-constructible

Not quite. For example:

class my_type{
   my_type() = default;
private:
   rmm::device_uvector v{0, cudaStream_t{0}};
};

my_type is still default constructible. You were just required to provide explicit arguments to the device_uvector object.

https://godbolt.org/z/7c3Gvh

Also the default stream, in particular when using default-stream-per-thread, might well be just what you wanted, anyway.

Sure, I agree the default stream is often what people want. The point is that it rmm::device_uvector requires you to be explicit about using the default stream to force users to be more aware of the potential for shooting yourself in the foot that is not present with thrust::device_vector.

Now then, I could be swayed by an argument that we could allow device_uvector to be default constructible so long as every possible operation on a device_uvector that has the potential to incur an allocation or copy requires an explicit stream argument, e.g., resize. All such operations already require an explicit stream argument, so perhaps we can allow the default ctor. I'd want to think on it some more.

@germasch
Copy link
Contributor

germasch commented Aug 5, 2020

So I agree what you're saying from your point of view, but on the other hand there's also a need (at least by a single person, me) for an uninitialized vector type that's a thrust::device_vector drop-in replacement. So maybe the answer is that there should be two types, one which is for use in a default-stream-per-thread context (or when not using streams at all), where one doesn't have to deal with these subtleties, and another one that's for general use where one better be deliberate about streams?

[Sure in your example above, I can explicitly initialize a class member. But that essentially stops working when the class member is a template type that could be either device_vector or device_uvector. (Maybe it can be worked around by more template magic on the initializer, but then it's really not pretty).]

I might argue, though, that maybe the issue is more with the naming -- when seeing rmm::device_uvector, that doesn't really tell me that this is a stream-aware vector, it makes me think, device_vector but uninitialized. So if this was called device_stream_uvector (that's not great, someone come up with a better name), that would make me think about streams. At this point, I'd think, well, if default construct this thing, I'd expect it to be on the default stream, just like all the other CUDA stuff behaves if you don't specify the stream). And in fact there might be a point to have a device_stream_vector with more proper C++ semantics (ie, initialized with default-constructed elements).

@jrhemstad
Copy link
Contributor Author

on the other hand there's also a need (at least by a single person, me) for an uninitialized vector type that's a thrust::device_vector drop-in replacement.

Sure, a synchronous, uninitialized vector would be useful, but that's in Thrust's wheelhouse, not RMM's.

Thrust is a higher level library of (mostly) synchronous data structures and algorithms that are relatively easy to use.

RMM is a library for high performance, asynchronous memory allocation. That performance comes at the cost of some ease of use. This is the classic balance between abstraction vs. performance. We try and keep RMM as high level and easy to use as we can, but there is no denying that RMM sits on a point in the spectrum closer to performance than abstraction compared to Thrust.

I might argue, though, that maybe the issue is more with the naming -- when seeing rmm::device_uvector, that doesn't really tell me that this is a stream-aware vector, it makes me think, device_vector but uninitialized.

I can understand the confusion, but I think it stems from confusion about RMM itself as opposed to the particular device_uvector type. RMM, by it's very design, is an interface for asynchronous memory allocation. Looking at the device_memory_resource interface you see that the essential allocate/deallocate functions both take stream arguments and are stream ordered:

void* allocate(std::size_t bytes, cudaStream_t stream = 0)

Everything is built around this core interface, as such, everything* is stream aware. Asynchrony and stream awareness is part and parcel in using RMM.

(*) This is a bit of a white lie since rmm::device_vector is currently just an alias for thrust::device_vector and so no operations are stream ordered, but that will eventually change.

@germasch
Copy link
Contributor

germasch commented Aug 5, 2020

on the other hand there's also a need (at least by a single person, me) for an uninitialized vector type that's a thrust::device_vector drop-in replacement.

Sure, a synchronous, uninitialized vector would be useful, but that's in Thrust's wheelhouse, not RMM's.

I do agree with this -- but I don't think this is ever going to happen in thrust, since I think it's been asked for for many years. Obviously that doesn't mean that it's RMM's job to fill the gap, though it would have been convenient, since you already have a uvector, and you generally seem to care about thrust interoperability.

I can understand the confusion, but I think it stems from confusion about RMM itself as opposed to the particular device_uvector type. RMM, by its very design, is an interface for asynchronous memory allocation. Looking at the device_memory_resource interface you see that the essential allocate/deallocate functions both take stream arguments and are stream ordered:

If you consider RMM's main feature to be asynchronous allocations/deallocations, I agree. I've found RMM because it was recommended to me as a way to avoid the cost of repeated device memory allocations/deallocations, the issue not being that they're synchronizing but just plain slow. You may want to put more emphasis on asynchronous behavior in your README.md, as the overview there really doesn't talk about streams at all, and mentions asynchronicity only with regard to memcpy, not allocation itself.

void* allocate(std::size_t bytes, cudaStream_t stream = 0)

Everything is built around this core interface, as such, everything* is stream aware. Asynchrony and stream awareness is part and parcel in using RMM.

Your example, though is yet another case where if no stream is specified, it defaults to the default stream, just as elsewhere in CUDA ;).

In any case, it's obviously up to you guys to define what RMM is aimed at and what you prioritize in your interfaces. I just meant to bring up an argument for keeping uvector default constructible, so it's (easily) usable both for people who don't care about streams at all (or just per-thread) as well as for people who do more elaborate stream-based design, but I of course I acknowledge that this does make it easier to misuse for people who do use streams directly, so it is a trade-off.

@jrhemstad
Copy link
Contributor Author

I don't think this is ever going to happen in thrust, since I think it's been asked for for many years.

It's a small team with a lot on their plate. I've talked with them in the past and they'd support adding a device_uvector, but it just hasn't been a priority. They'd be happy to review a PR though :)

It's less work than you might think. I've looked into it and all it would really require is defining a new type based on contiguous_storage with a static_assert that the type is "trivially relocatable" .

So something like:

template <typename T, typename Alloc>
device_uvector : contiguous_storage<T, Alloc>{
 using contiguous_storage<T,Alloc>;
 static_assert(thrust::is_trivially_relocatable<T>::value, "Uninitialized vector only supports trivially relocatable types."); 
};

Not a lot of work, somebody just has to do it and test it.

I've found RMM because it was recommended to me as a way to avoid the cost of repeated device memory allocations/deallocations, the issue not being that they're synchronizing but just plain slow.

A large part of what makes cudaMalloc/cudaFree slow is the fact that they are synchronous. It's not the only thing, but is significant.

In any case, it's obviously up to you guys to define what RMM is aimed at and what you prioritize in your interfaces. I just meant to bring up an argument for keeping uvector default constructible

I definitely appreciate having external feedback on usability. It definitely helps to expose biases and blindspots.

I think I'm okay with making device_uvector default constructible with the requirements I mentioned above (all operations require explicit stream arguments). However, I don't think that will really make you happy because rmm:device_uvector definitely won't be (and never will be) a drop in replacement for thrust::device_vector. It lacks many features that thrust::device_vector provides like operator[] that automatically copies from D2H. We can't support that in device_uvector because it doesn't allow a stream argument.

@germasch
Copy link
Contributor

germasch commented Aug 5, 2020

I don't think this is ever going to happen in thrust, since I think it's been asked for for many years.

It's a small team with a lot on their plate. I've talked with them in the past and they'd support adding a device_uvector, but it just hasn't been a priority. They'd be happy to review a PR though :)

Okay, thanks, I might try to get around to doing this some time. Based on googling I had thought they're kinda philosophically opposed to this (I think it's not really C++ standard conforming to have objects in not-initialized state), but apparently not.

I've found RMM because it was recommended to me as a way to avoid the cost of repeated device memory allocations/deallocations, the issue not being that they're synchronizing but just plain slow.

A large part of what makes cudaMalloc/cudaFree slow is the fact that they are synchronous. It's not the only thing, but is significant.

Maybe. I don't know, in my application, I'd guess cudaMalloc is > 100x slower than a cudaDeviceSynchronize(). It's large pieces of memory, though.

I think I'm okay with making device_uvector default constructible with the requirements I mentioned above (all operations require explicit stream arguments). However, I don't think that will really make you happy because rmm:device_uvector definitely won't be (and never will be) a drop in replacement for thrust::device_vector. It lacks many features that thrust::device_vector provides like operator[] that automatically copies from D2H. We can't support that in device_uvector because it doesn't allow a stream argument.

I'd say you should do whatever you think is best for the uvector you're envisioning. I'll admit that it doesn't fit what I was hoping to use it for, and that's perfectly okay. (FWIW, the operator[] on the host I personally don't care about, if anything that's a way to accidentally slow your code way down, though of course sometimes it's convenient.).

@kkraus14 kkraus14 changed the base branch from branch-0.15 to branch-0.16 August 17, 2020 15:50
@harrism
Copy link
Member

harrism commented Oct 2, 2020

@jrhemstad can you update the status of this? Moving to 0.17.

@jrhemstad
Copy link
Contributor Author

Based on conversation, I think we can leave the default ctor and just require all operations to take an explicit stream.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2 - In Progress Currently a work in progress
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Delete default constructor for device_buffer and device_uvector
4 participants