-
Notifications
You must be signed in to change notification settings - Fork 538
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[REVIEW] Proposal for "proper" C/C++ API (issue #92) #247
[REVIEW] Proposal for "proper" C/C++ API (issue #92) #247
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like the developer_guide idea. Thanks Jiri for putting it out!
Looks good, Jiri! I'm looking forward to using this. I am currently using cusparse in UMAP. It would be nice if the cusparse handle could be added to this API as well. |
@cjnolet didn't realize that you were using cusparse. Can you add cusparse_wrappers, just like the cublas_wrappers, cusolver_wrappers, inside ml-prims, please? |
Thanks for the review @cjnolet I just added cusparseHandle_t to cumlHandle_impl. |
|
||
#pragma once | ||
|
||
#include <rmm/rmm.h> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, we expect rmm now to be installed before cuml build, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. IMHO cuML should not depend on RMM, we should try to keep the external dependencies of cuML. This header is just provided as a convenience for users of cuML. It should not be used inside of cuML. E.g. if a cuser of cuML wants to use RMM for all device allocations code like this could be used:
#include <cuML.hpp>
#include <rmm/rmm.h>
#include <common/rmmAllocatorAdapter.hpp>
void foo() {
rmmOptions_t rmmOptions;
rmmOptions.allocation_mode = PoolAllocation;
rmmOptions.initial_pool_size = 0;
rmmOptions.enable_logging = false;
rmmError_t rmmStatus = rmmInitialize(&rmmOptions);
if ( RMM_SUCCESS != rmmStatus )
std::cerr<<"WARN: Could not initialize RMM: "<<rmmGetErrorString(rmmStatus)<<std::endl;
std::unique_ptr<ML::cumlHandle> cumlHandle( new ML::cumlHandle() );
std::shared_ptr<ML::rmmAllocatorAdapter> allocator( new ML::rmmAllocatorAdapter() );
cumlHandle->setDeviceAllocator( allocator );
cudaStream_t stream;
cudaStreamCreate( &stream );
cumlHandle->setStream( stream );
cudaStreamDestroy( stream );
if ( rmmIsInitialized( NULL ) )
rmmFinalize();
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jirikraus thinking about it, could we put that foo
function you just wrote (or something similar) somewhere as a convenience function for creating rmm enabled cumlHandle? From the python library I believe I will always call with RMM as the memory manager for the time being
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah! Makes sense now. Thanks Jiri.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dantegd maybe adding this as a code-block in the doxygen comment would be better for now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, probably will end up adding it for the cython call, but in doxygen is fine for now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@teju85 I agree this should go into a code-block in the doxygen comment. However before we concluded the discussion on the API I don't think it makes sense to invest much in documentation, which is the reason I have not added documentation yet.
@dantegd I think this code should be only in the documentation and examples. It needs to be written by the user of cuML as that is the one who knows if RMM is already initialized and how he want's to initialize. E.g. if cuML gets called from a python script which used cuDF to prepare data RMM was initialized already before. If we exluce RMM init and shutdown the function would only provide
std::shared_ptr<ML::rmmAllocatorAdapter> allocator( new ML::rmmAllocatorAdapter() );
cumlHandle->setDeviceAllocator( allocator );
which IMHO is not worth a separate helper function. Do you agree?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @jirikraus here. It is not worth such a helper function.
Sure Jiri. Makes sense to document this once we finalized the interfaces.
cuML/src/common/device_buffer.hpp
Outdated
if ( n > 0 ) | ||
{ | ||
_data = static_cast<value_type*>(_allocator->allocate( _capacity*sizeof(value_type), 0 )); | ||
CUDA_CHECK( cudaStreamSynchronize( 0 ) ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit confused here. So effectively we are sync'ing on the default stream everytime we create a device_buffer
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, because there is no other stream available in the constructor to pass into the allocator and to synchronize on. I have chosen this approach because I think it would be confusing if a constructor is asynchronous. I would expect that an object is fully usable after construction. If a user provided stream should be used for the allocation and the synchronization should be avoided the technique described in the developer guide using resize
and release
could be used. One alternative I could think is forcing users of device_buffer and host_buffer to use resize by not offering a constructor that takes a size. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah... you don't really want an asynchronous constructor.
But I don't like that if I want asynchronous allocation/free I can't use RAII.
How about an API that returns a std::future
to an instance of a device_buffer
allocated on a stream? This still allows for asynchronous memory allocation/free that still uses RAII.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. Thanks for the clarification
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get you point @jrhemstad regarding std::future
, however that does not change the issue of the constructor, right? It would replace the constructor with factory. Also you still have RAII when it is done like this
template<typename T>
void foo( ML::cumlHandle* handle, .., cudaStream_t stream )
{
...
ML::device_buffer<T> temp( handle->getDeviceAllocator(), 0 )
temp.resize(n, stream);
kernelA<<<grid, block, 0, stream>>>(..., temp.data(), ...);
kernelB<<<grid, block, 0, stream>>>(..., temp.data(), ...);
temp.release(stream);
}
in this case the pointer is still owned by temp. I.e. if the call to release
does not happen for some reason (e.g. because a developer forgets to call it) the destructor of temp will make sure that the memory is released when temp goes out of scope. Does that make sense so you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just pushed an update for device_buffer and host_buffer. Now the constructor takes a stream as an argument and the device_buffer and host_buffer are tied to that stream. The associated stream a different stream can be changed by calling resize
, reserve´ or
release`.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice. This now looks very aligned with our idea of not having to use default stream in cuml or ml-prims!
1be12f1
to
f8ff22a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lots of good stuff here. You've given me several ideas that I'd like to adapt to cuDF.
cuML/DEVELOPER_GUIDE.md
Outdated
template<typename T> | ||
void foo(ML::cumlHandle* handle, cudaStream_t stream, ... ) | ||
{ | ||
T* temp_h = handle->getDeviceAllocator()->allocate(n*sizeof(T), stream); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the spirit of what you're suggesting here is good, but the API is a little verbose.
Maybe something more like:
T* p = handle->device_allocate(n*sizeof(T), stream);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, what about for more specialized allocations? Like page locked memory? Or UVM? Would there be extra allocators in handle
for this usage? Or maybe an enum passed into device_allocate
that describes the kind of allocation you want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jrhemstad IIUC, page locked memory is what is being proposed with the host_allocator concept below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Liked your suggestion about reducing verbosity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting idea regarding the specialized allocators for Unified Memory and pinned host memory. However before we add these kind of things I would prefer to see the use case first. Currently I know that we need temporary host memory and temporary device memory. The interface is sufficient to support that and already gives the user of cuML quite some flexibility, e.g. he could pass in an allocator that allocates pinned host memory or one that allocates pageable host memory (current default is pinned host memory). Similar for temporary device memory he could use Unified Memory, pinned host memory or pinned device memory.
Regarding the less verbose API: I actually had that in an earlier version I tried, but preferred a cleaner interface that does not duplicate functionality between classes. Also the ML::device_buffer
and ML::host_buffer
should be preferred for allocations IMHO. A verbose interface make the alternative less attractive. However I do not have a very strong opinion on this one so I am open to provide a device_allocate and host_allocate in cumlHandle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer to see the use case first.
I have no idea what the use-case may be :) This is my first time looking at cuML code. That said, we can never predict the future, so it's safest to assume that if something can happen (like wanting to use UVM), it probably will at some point and the design should be able to handle that.
Similar for temporary device memory he could use Unified Memory, pinned host memory or pinned device memory.
This would require the user defining their own custom allocator then? I think that's fine, just making sure I understand.
Also the
ML::device_buffer
andML::host_buffer
should be preferred for allocations IMHO. A verbose interface make the alternative less attractive.
100% agreed that RAII constructs should be preferred. I had a sneaking suspicion that you left the malloc/free
-like API verbose to discourage people from using it :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer to see the use case first.
I have no idea what the use-case may be :) This is my first time looking at cuML code. That said, we can never predict the future, so it's safest to assume that if something can happen (like wanting to use UVM), it probably will at some point and the design should be able to handle that.
I would say the current design is flexible enough for that. Once the need comes up we can add allocation routines to the proposed allocator interfaces with default implementations, e.g. for manged memory calling through to cudaMallocManaged()
. Since we won't need to make allocations with a new type of memory before that happens it won't require to touch old cuML code. And cuML users affected be the default implementation can easily override the default one. Does that seem reasonable to you?
Similar for temporary device memory he could use Unified Memory, pinned host memory or pinned device memory.
This would require the user defining their own custom allocator then? I think that's fine, just making sure I understand.
Yes. A example for a custome allocator that falls back to is
class cachingDeviceAllocator : public ML::deviceAllocator {
public:
cachingDeviceAllocator()
, _allocator(8, 3, cub::CachingDeviceAllocator::INVALID_BIN, cub::CachingDeviceAllocator::INVALID_SIZE)
{}
virtual void* allocate( std::size_t n, cudaStream_t stream ) {
void* ptr = 0;
_allocator.DeviceAllocate( &ptr, n, stream );
return ptr;
}
virtual void deallocate( void* p, std::size_t, cudaStream_t ) {
_allocator.DeviceFree(p);
}
private:
cub::CachingDeviceAllocator _allocator;
};
That would be passed to cuML like this
std::shared_ptr<ML::rmmAllocatorAdapter> allocator( new ML::rmmAllocatorAdapter() );
cumlHandle->setDeviceAllocator( allocator );
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's just keep the current verbose API as it is. Depending on which one of these interfaces (device_buffer or allocate calls) folks end up preferring, we can then revisit the API restructuring.
handle->getHostAllocator()->deallocate(temp_h, n*sizeof(T), stream); | ||
} | ||
``` | ||
Small host memory heap allocations, e.g. as internally done by STL containers, are fine, e.g. an `std::vector` managing only a handful of integers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you're providing a custom allocator for host memory, it's straightforward to provide an allocator for std::vector
and similar containers. You could then even provide an alias like cuml::vector
that is a std::vector
using your custom allocator.
We do this in cuDF with rmm::device_vector
-> thrust::device_vector
using RMM for the allocator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have thought about that quite a bit. The advantage of passing in an allocator as a runtime argument over a template parameter are:
- Irrespective of the used allocator the containers remain type compatible.
- We can easily have multiple different allocator objects, e.g. one per device one per thread, ....
With a template argument we would need to reference a global object. Does that make sense to you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All good points.
I suppose since you are providing a host allocator and host_buffer, there's little reason for someone to use a std::vector
.
Another point in favor of "allocators as parameters" vs "template parameters" is that it allows stateful allocators, e.g., an allocator w/ a stream. This wouldn't be possible with an allocator as a template argument.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't have to reference a global argument with the latest Thrust device_vector
-- the constructor has an allocator parameter. The same is true of std::vector
: https://en.cppreference.com/w/cpp/container/vector/vector (see constructor #2 for example)
vector( size_type count, const T& value, const Allocator& alloc = Allocator());
cuML/DEVELOPER_GUIDE.md
Outdated
void foo( ML::cumlHandle* handle, .., cudaStream_t stream ) | ||
{ | ||
... | ||
ML::device_buffer<T> temp( handle->getDeviceAllocator(), 0 ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the implication of calling it buffer
instead of vector
that the memory is not default-initialized? I.e., a std::vector
and thrust::device_vector
are default initialized, which oftentimes is unnecessary and you can save some performance by avoiding the initialization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The intention to call it buffer instead of vector is to make clear that this is a simple container enabling:
- asynchronous RAII
- Use of the proposed allocater interface
and not to offer the interface of a std::vector or thrust::device_vector. As that would duplicate quite some code and would come with a significant need for testing.
cuML/DEVELOPER_GUIDE.md
Outdated
void foo( ML::cumlHandle* handle, .. ) | ||
{ | ||
ML::thrustAllocatorAdapter alloc( handle->getDeviceAllocator(), handle->getStream() ); | ||
auto execution_policy = thrust::cuda::par(alloc).on(handle->getStream()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's possible to do this in a one-liner, we do this in RMM.
https://github.com/rapidsai/rmm/blob/branch-0.6/include/rmm/thrust_rmm_allocator.h#L94
Note that the use of a unique_ptr
is a workaround for a deficiency in Thrust's execution policies being unable to bind to temporaries, but that will change in the next release of Thrust.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing this out. This is a great idea. I just tried it but to my understanding the implementation of exec_policy
requires C++14 and I could not find a elegant way to express it with C++11 (which cuML is currently at). So I an implementation of exec_policy
only conditionally and updated the developer guide mentioning the new possibilities with C++14. Let me know if I am missing something and we can make it work with C++11.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
C++14 is only required for the auto
return type.
It's possible to support C++11 using trailing return type: https://www.ibm.com/support/knowledgecenter/en/SSLTBW_2.3.0/com.ibm.zos.v2r3.cbclx01/trailing_return.htm
It would just be a little more verbose, something like...
using T = decltype(thrust::cuda::par(rmm_allocator<char>{}));
inline exec_policy(cudaStream_t stream = 0) -> std::unique_ptr<T, std::function<void(T*)>
But again, the unique_ptr
ugliness goes away w/ the update to Thrust.
inline auto exec_policy(cudaStream_t stream = 0) {
return thrust::cuda::par(rmm_allocator<char>{stream}).on(stream);
}
And then usage is just:
thrust::sort(exec_policy(stream), ...);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jrhemstad I could make it work with C++11 and just pushed the related updates.
cuML/src/common/cumlHandle.cpp
Outdated
|
||
void cumlHandle_impl::destroyResources() | ||
{ | ||
while ( !_cusparse_handles.empty() ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is purely a aesthetic design critique, but a repeated pattern like this usually tells me there's a need for a more generic abstraction.
Perhaps a for_each
over the elements of each vector and then vector.clear()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree to your statement. However given that I need to call a different destroy function for every vector I do not see how I would save a lot of code even using lambdas and for_each
. However I am open for concrete suggestions :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However I am open for concrete suggestions :- )
Fair enough!
Okay, so... this is a bit more of an invasive change, but hear me out because I think it's beneficial for more than just solving this problem.
So for your vectors of resources, instead of maintaining a vector of the "raw" resource, you could create a simple wrapper class for each resource. For example, streams:
struct stream_wrapper{
stream_wrapper(){
CUDA_CHECK( cudaStreamCreate(&stream) );
}
~stream_wrapper(){
CUDA_CHECK( cudaStreamDestroy(&stream) );
}
// This allows using a `stream_wrapper` implicitly as if it were a cudaStream_t*,
// i.e., the fact that it's a wrapper is invisible to the user
operator cudaStream_t*() { return &stream };
private:
cudaStream_t stream;
};
And then... your vector of cudaStream_t
would become a vector of stream_wrapper
s...
std::vector<stream_wrapper> _streams;
And then in destroyResources
you can simply do _streams.clear()
and let the wrapper's destructor handle all of the resource specific cleanup!
You could then create similar wrappers for all of the other resources.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see how this would make the code safer. However it will also add quite some code. Since this is an implementation detail and we are primarily discussing the API here I would leave the destroy resources code as it currently is. Any objections?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see how this would make the code safer. However it will also add quite some code. Since this is an implementation detail and we are primarily discussing the API here I would leave the destroy resources code as it currently is. Any objections?
Absolutely just an implementation detail, don't let it block this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jrhemstad can you file a issue on this one so that we don't miss this, please?
cuML/src/common/device_buffer.hpp
Outdated
if ( n > 0 ) | ||
{ | ||
_data = static_cast<value_type*>(_allocator->allocate( _capacity*sizeof(value_type), 0 )); | ||
CUDA_CHECK( cudaStreamSynchronize( 0 ) ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah... you don't really want an asynchronous constructor.
But I don't like that if I want asynchronous allocation/free I can't use RAII.
How about an API that returns a std::future
to an instance of a device_buffer
allocated on a stream? This still allows for asynchronous memory allocation/free that still uses RAII.
std::vector<cublasHandle_t> _cublas_handles; | ||
std::vector<cusolverDnHandle_t> _cusolverDn_handles; | ||
std::vector<cusparseHandle_t> _cusparse_handles; | ||
std::shared_ptr<deviceAllocator> _deviceAllocator; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm trying to understand the need for the allocators to be shared_ptr
s instead of a unique_ptr
member.
Shouldn't the handle
be in charge of maintaining the lifetime of the allocator? That is, who else would have a reference to the allocator that requires using a shared_ptr
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have chosen a shared pointer as this allows the cuML user to reuse an allocator in other places in a save way. Does that make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I get that, but couldn't you also accomplish the same safety with a unique_ptr
?
My default is to always prefer a unique_ptr
to a shared_ptr
unless I absolutely need a shared_ptr
, and I don't understand well enough yet why a shared_ptr
is required here.
My understanding is that there's a one-to-one relationship between an instance of a ML::handle
and an instance of an allocator, and as such, the handle
should have sole ownership over it's corresponding allocator. Is that understanding correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the point of view of anything that is implemented in cuML yes. However not from the point of view from a cuML user. If setDeviceAllocator
would take a unique_ptr a cuML can't use a deviceAllocator that needs to outlive the cuML handle. Makes sense? Or am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If
setDeviceAllocator
would take a unique_ptr a cuML can't use a deviceAllocator that needs to outlive the cuML handle. Makes sense? Or am I missing something?
Ah yes, so if someone wants to reuse the same allocator between two cuML API calls, using a unique_ptr
would be difficult.
Thanks for explaining it to me, I believe a shared_ptr
is absolutely appropriate here.
cuML/src/common/device_buffer.hpp
Outdated
|
||
device_buffer& operator=(const device_buffer& other) = delete; | ||
|
||
device_buffer(std::shared_ptr<deviceAllocator> allocator, size_type n = 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a bit cumbersome to always have to pass an allocator as a parameter whenever I want to do an allocation with RAII.
It'd be nicer if it could be a template parameter such that you could have an alias for a device_buffer
that always uses my_allocator
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have thought about that quite a bit. The advantage of passing in an allocator as a runtime argument over a template parameter are:
- Irrespective of the used allocator the containers remain type compatible.
- We can easily have multiple different allocator objects, e.g. one per device one per thread, ....
With a template argument we would need to reference a global object. Does that make sense to you?
@jirikraus I don't see the |
ah.. never mind my previous comment. The conversations still show up with the old code, whereas the commits themselves have your changes. I got confused by this distinction. |
I just added a new section to the proposed developer guide about Multi GPU Programming. Please take a look at it. |
Multi GPU programming info LGTM. |
cuML/DEVELOPER_GUIDE.md
Outdated
```cpp | ||
void foo( ML::cumlHandle* handle, ...) | ||
{ | ||
cublasHandle_t cublasHandle = handle->getImpl()->getCublasHandle(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jirikraus all of our cu*Handle objects typically support a setStream
method and IIRC they all have default stream initialized in them? Maybe we should ask devs to prohibit from calling setStream
on these handles?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or maybe update our cumlHandle_impl::setStream
to in turn also set streams for these handles appropriately?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a good point. setStream
is not really meant for cuML developers. It is meant for cuML users. Therefore I think it is better to pass around const ML::cumlHandle&
in cuML which would prohibit changing the user provided stream. This also ties in with a comment from @jrhemstad regarding how to pass around the handle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry Jiri, if my previous comments confused you. I meant the following kind of usage pattern. In other words, should we allow developers to perform cublasSetStream on the internal cublas handle, explicitly, for example?
cublasHandle_t cublasHandle = handle->getImpl()->getCublasHandle();
cudaStream_t someStreamPickedByMe;
cublasSetStream(cublasHandle, someStreamPickedByMe);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On one hand, it sounds like a valid use-case if we want to allow devs to use internal streams and schedule cublas/cusolver calls on these streams.
On the other hand, these cublasSetStream
kind of calls are persistent, easily causing false dependencies when the same cublasHandle gets called somewhere else in the pipeline!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also thought a bit more about this and one possibility would be that we drop the assumption that e.g. the cublasHandle
managed by handle has any stream associated with it. Instead one should assume that it is required to set the stream before every cuBLAS call. For cuML and ml-prims developers I think everybody should use the wrapper is cublas_wrappers.h
and we add a stream argument to the interface of those wrappers. E.g. for gemm
template <typename T>
cublasStatus_t cublasgemm(cublasHandle_t handle, cublasOperation_t transA,
cublasOperation_t transB, int m, int n, int k,
const T *alfa, const T *A, int lda, const T *B,
int ldb, const T *beta, T *C, int ldc, cudaStream_t stream);
which would have the following specialization for dgemm
template <>
inline cublasStatus_t
cublasgemm(cublasHandle_t handle, cublasOperation_t transA,
cublasOperation_t transB, int m, int n, int k, const double *alfa,
const double *A, int lda, const double *B, int ldb,
const double *beta, double *C, int ldc, cudaStream_t stream) {
cublasStatus_t status = cublasSetStream(handle, stream);
if ( CUBLAS_STATUS_SUCCESS != status )
return status;
return cublasDgemm(handle, transA, transB, m, n, k, alfa, A, lda, B, ldb,
beta, C, ldc);
}
Rational for this idea is that changing streams is s check operation and after an operation is en queued it is fine to change the stream. Only issue I see with this is potential races in a multi threaded enviroment, e.g.
Thread A: sets stream to stream A on handle
Thread B: sets stream to stream B on handle
Thread B: launches Dgemm on handle in stream B
Thread B: launches Dgemm on handle in stream B (ups!)
We could either manage that by providing a cublasHandle_t wrapper containing a lock in addition to the handle that could be used in the cuBLAS wrappers. Or simply define a cuML, ml-prims developer rule that the wrapper are not thread safe and that one handle should be created per thread? The latter one would then require the cumlHandle_t
to mange multiple cublasHandle's or that we disallow multi threading for cuML, ml-prims implementation. Considering the cuBLAS documentation having one handle per thread is propably what we should prefer:
2.1.3. Thread Safety
The library is thread safe and its functions can be called from multiple host threads, even with the same handle. When multiple threads share the same handle, extreme care needs to be taken when the handle configuration is changed because that change will affect potentially subsequent CUBLAS calls in all threads. It is even more true for the destruction of the handle. So it is not recommended that multiple thread share the same CUBLAS handle.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My opinion (might be based, because I like the OPG approach!) is that we should disallow the usage of same cumlHandle across multiple threads.
Your cublas_wrappers idea is a much better one.
Implementation details:
If we allow stream
as a default option in the above cublasgemm
method you defined above, then I think this change can be made within the branch-0.6 timeframe itself. Else, it'll cause huge amount of changes across existing ml-prims and cuML code (not to mention the rework needed for ongoing PR's!).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have filed issue #256 as our first step towards updating these *_wrapper header files. We shall update the existing methods to expose a stream as a default argument.
Will soon file another issue to remove the default-ness of stream argument across all ml-prims methods. We should not be launching work on default streams inside cuML or ml-prims!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. It makes sense that the implementation of cuML and ml-prims is single threaded in this case we won't have the issues I described. Just added two sentences in the Thread safety section of DEVELOPER_GUIDE.md
to document it and put it up for discussion.
af9e567
to
15fcf12
Compare
Thanks @teju85 @jrhemstad for the thorough review and your good feedback. I currently have the following open items on my list:
Please correct me if I am wrong, but I think I either addressed your other comments or there is an open discussion regarding them. Thanks Jiri |
Agreed. Better to pass around const-ref. One other item is the I'm pretty sure we'd almost need all of the info inside the |
@harrism should probably also review this for thoughts on allocator abstractions as well as for ideas about adopting something similar in cuDF. |
15fcf12
to
169c902
Compare
I am not sure if cumlHandle_impl should be moved to ml-prims. I see that we want to avoid duplicating resources such as cublas or cusolver handles and that we also don't want to make the ml-prims API extremely verbose. However I see potential issues with sharing these resources, especially CUDA streams, between a cuML and ml-prims handle. As that could cause false dependencies. I therefore would suggest to introduce a new ml-prims handle and move the ownership of the allocators there. cumlHandle_impl will hold a ml-prim handle and reference the allocators in there. What do you think? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You'll probably want to add a MPI communicator resource to the handle eventually, but I don't think that needs to block the current PR as it's mainly an API proposal and you've sufficiently demonstrated how requesting resources will work.
@jirikraus My main motivation behind writing that comment was to reduce duplication of resources (and thus code) between ml-prims and cuML. I'm ok with either full migration of the impl structure or a new handle in ml-prims with this impl struct being composed of it. I think it's better addressed in a separate issue/PR. |
…avoid code dublication.
Co-Authored-By: jirikraus <[email protected]>
8cc4d3a
to
694e956
Compare
@harrism thanks a lot for your thorough review and good feedback. I have incorporated almost all of you suggestions for
Regarding on remark to which I could not reply inline for some reason
As you say that would come with the dependency to the latest thrust version for device_vector. Regarding the motivation for device_buffer and host_buffer in general I have added something to the developer guide. |
Can one of the admins verify this patch? |
[HOTFIX] [REVIEW] updated the ridge regression notebook. Merging on @dantegd 's approval.
This is a proposal for a imporved C and C++ API. When reviewing this I suggest to start with
cuML/DEVELOPER_GUIDE.md
as that explains the used concepts.