Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] Proposal for "proper" C/C++ API (issue #92) #247

Merged
merged 27 commits into from
Mar 12, 2019

Conversation

jirikraus
Copy link
Contributor

This is a proposal for a imporved C and C++ API. When reviewing this I suggest to start with cuML/DEVELOPER_GUIDE.md as that explains the used concepts.

Copy link
Member

@teju85 teju85 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like the developer_guide idea. Thanks Jiri for putting it out!

cuML/DEVELOPER_GUIDE.md Outdated Show resolved Hide resolved
cuML/DEVELOPER_GUIDE.md Outdated Show resolved Hide resolved
cuML/DEVELOPER_GUIDE.md Outdated Show resolved Hide resolved
@dantegd dantegd added 2 - In Progress Currenty a work in progress proposal Change current process or code labels Feb 27, 2019
@cjnolet
Copy link
Member

cjnolet commented Feb 27, 2019

Looks good, Jiri! I'm looking forward to using this.

I am currently using cusparse in UMAP. It would be nice if the cusparse handle could be added to this API as well.

@cjnolet cjnolet changed the title Proposal for "proper" C/C++ API (issue #92) [REVIEW] Proposal for "proper" C/C++ API (issue #92) Feb 27, 2019
@teju85
Copy link
Member

teju85 commented Feb 27, 2019

@cjnolet didn't realize that you were using cusparse. Can you add cusparse_wrappers, just like the cublas_wrappers, cusolver_wrappers, inside ml-prims, please?

@jirikraus
Copy link
Contributor Author

Thanks for the review @cjnolet I just added cusparseHandle_t to cumlHandle_impl.


#pragma once

#include <rmm/rmm.h>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, we expect rmm now to be installed before cuml build, right?

Copy link
Contributor Author

@jirikraus jirikraus Feb 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. IMHO cuML should not depend on RMM, we should try to keep the external dependencies of cuML. This header is just provided as a convenience for users of cuML. It should not be used inside of cuML. E.g. if a cuser of cuML wants to use RMM for all device allocations code like this could be used:

#include <cuML.hpp>
#include <rmm/rmm.h>
#include <common/rmmAllocatorAdapter.hpp>

void foo() {
   rmmOptions_t rmmOptions;
   rmmOptions.allocation_mode = PoolAllocation;
   rmmOptions.initial_pool_size = 0;
   rmmOptions.enable_logging = false;
   rmmError_t rmmStatus = rmmInitialize(&rmmOptions);
   if ( RMM_SUCCESS != rmmStatus )
       std::cerr<<"WARN: Could not initialize RMM: "<<rmmGetErrorString(rmmStatus)<<std::endl;

   std::unique_ptr<ML::cumlHandle> cumlHandle( new ML::cumlHandle() );

   std::shared_ptr<ML::rmmAllocatorAdapter> allocator( new ML::rmmAllocatorAdapter() );

   cumlHandle->setDeviceAllocator( allocator );

   cudaStream_t stream;
   cudaStreamCreate( &stream );
   cumlHandle->setStream( stream );

   cudaStreamDestroy( stream );
   if ( rmmIsInitialized( NULL ) )
       rmmFinalize();
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jirikraus thinking about it, could we put that foo function you just wrote (or something similar) somewhere as a convenience function for creating rmm enabled cumlHandle? From the python library I believe I will always call with RMM as the memory manager for the time being

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah! Makes sense now. Thanks Jiri.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dantegd maybe adding this as a code-block in the doxygen comment would be better for now?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, probably will end up adding it for the cython call, but in doxygen is fine for now

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@teju85 I agree this should go into a code-block in the doxygen comment. However before we concluded the discussion on the API I don't think it makes sense to invest much in documentation, which is the reason I have not added documentation yet.
@dantegd I think this code should be only in the documentation and examples. It needs to be written by the user of cuML as that is the one who knows if RMM is already initialized and how he want's to initialize. E.g. if cuML gets called from a python script which used cuDF to prepare data RMM was initialized already before. If we exluce RMM init and shutdown the function would only provide

std::shared_ptr<ML::rmmAllocatorAdapter> allocator( new ML::rmmAllocatorAdapter() );
cumlHandle->setDeviceAllocator( allocator );

which IMHO is not worth a separate helper function. Do you agree?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @jirikraus here. It is not worth such a helper function.

Sure Jiri. Makes sense to document this once we finalized the interfaces.

cuML/src/cuML.hpp Outdated Show resolved Hide resolved
cuML/src/cuML.hpp Outdated Show resolved Hide resolved
cuML/src/common/host_buffer.hpp Outdated Show resolved Hide resolved
cuML/src/common/host_buffer.hpp Outdated Show resolved Hide resolved
if ( n > 0 )
{
_data = static_cast<value_type*>(_allocator->allocate( _capacity*sizeof(value_type), 0 ));
CUDA_CHECK( cudaStreamSynchronize( 0 ) );
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused here. So effectively we are sync'ing on the default stream everytime we create a device_buffer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, because there is no other stream available in the constructor to pass into the allocator and to synchronize on. I have chosen this approach because I think it would be confusing if a constructor is asynchronous. I would expect that an object is fully usable after construction. If a user provided stream should be used for the allocation and the synchronization should be avoided the technique described in the developer guide using resize and release could be used. One alternative I could think is forcing users of device_buffer and host_buffer to use resize by not offering a constructor that takes a size. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah... you don't really want an asynchronous constructor.

But I don't like that if I want asynchronous allocation/free I can't use RAII.

How about an API that returns a std::future to an instance of a device_buffer allocated on a stream? This still allows for asynchronous memory allocation/free that still uses RAII.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Thanks for the clarification

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get you point @jrhemstad regarding std::future, however that does not change the issue of the constructor, right? It would replace the constructor with factory. Also you still have RAII when it is done like this

template<typename T>
void foo( ML::cumlHandle* handle, .., cudaStream_t stream )
{
    ...
    ML::device_buffer<T> temp( handle->getDeviceAllocator(), 0 )
    
    temp.resize(n, stream);
    kernelA<<<grid, block, 0, stream>>>(..., temp.data(), ...);
    kernelB<<<grid, block, 0, stream>>>(..., temp.data(), ...);
    temp.release(stream);
}

in this case the pointer is still owned by temp. I.e. if the call to release does not happen for some reason (e.g. because a developer forgets to call it) the destructor of temp will make sure that the memory is released when temp goes out of scope. Does that make sense so you?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just pushed an update for device_buffer and host_buffer. Now the constructor takes a stream as an argument and the device_buffer and host_buffer are tied to that stream. The associated stream a different stream can be changed by calling resize, reserve´ or release`.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. This now looks very aligned with our idea of not having to use default stream in cuml or ml-prims!

cuML/src/common/host_buffer.hpp Outdated Show resolved Hide resolved
cuML/src/common/host_buffer.hpp Outdated Show resolved Hide resolved
cuML/src/common/host_buffer.hpp Outdated Show resolved Hide resolved
cuML/src/common/host_buffer.hpp Outdated Show resolved Hide resolved
cuML/src/common/host_buffer.hpp Outdated Show resolved Hide resolved
@jirikraus jirikraus force-pushed the fea-ext-cuML-cpp-iface branch 2 times, most recently from 1be12f1 to f8ff22a Compare February 27, 2019 16:36
Copy link
Contributor

@jrhemstad jrhemstad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lots of good stuff here. You've given me several ideas that I'd like to adapt to cuDF.

template<typename T>
void foo(ML::cumlHandle* handle, cudaStream_t stream, ... )
{
T* temp_h = handle->getDeviceAllocator()->allocate(n*sizeof(T), stream);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the spirit of what you're suggesting here is good, but the API is a little verbose.

Maybe something more like:
T* p = handle->device_allocate(n*sizeof(T), stream);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, what about for more specialized allocations? Like page locked memory? Or UVM? Would there be extra allocators in handle for this usage? Or maybe an enum passed into device_allocate that describes the kind of allocation you want.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jrhemstad IIUC, page locked memory is what is being proposed with the host_allocator concept below.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Liked your suggestion about reducing verbosity.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting idea regarding the specialized allocators for Unified Memory and pinned host memory. However before we add these kind of things I would prefer to see the use case first. Currently I know that we need temporary host memory and temporary device memory. The interface is sufficient to support that and already gives the user of cuML quite some flexibility, e.g. he could pass in an allocator that allocates pinned host memory or one that allocates pageable host memory (current default is pinned host memory). Similar for temporary device memory he could use Unified Memory, pinned host memory or pinned device memory.
Regarding the less verbose API: I actually had that in an earlier version I tried, but preferred a cleaner interface that does not duplicate functionality between classes. Also the ML::device_buffer and ML::host_buffer should be preferred for allocations IMHO. A verbose interface make the alternative less attractive. However I do not have a very strong opinion on this one so I am open to provide a device_allocate and host_allocate in cumlHandle.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to see the use case first.

I have no idea what the use-case may be :) This is my first time looking at cuML code. That said, we can never predict the future, so it's safest to assume that if something can happen (like wanting to use UVM), it probably will at some point and the design should be able to handle that.

Similar for temporary device memory he could use Unified Memory, pinned host memory or pinned device memory.

This would require the user defining their own custom allocator then? I think that's fine, just making sure I understand.

Also the ML::device_buffer and ML::host_buffer should be preferred for allocations IMHO. A verbose interface make the alternative less attractive.

100% agreed that RAII constructs should be preferred. I had a sneaking suspicion that you left the malloc/free-like API verbose to discourage people from using it :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to see the use case first.

I have no idea what the use-case may be :) This is my first time looking at cuML code. That said, we can never predict the future, so it's safest to assume that if something can happen (like wanting to use UVM), it probably will at some point and the design should be able to handle that.

I would say the current design is flexible enough for that. Once the need comes up we can add allocation routines to the proposed allocator interfaces with default implementations, e.g. for manged memory calling through to cudaMallocManaged(). Since we won't need to make allocations with a new type of memory before that happens it won't require to touch old cuML code. And cuML users affected be the default implementation can easily override the default one. Does that seem reasonable to you?

Similar for temporary device memory he could use Unified Memory, pinned host memory or pinned device memory.

This would require the user defining their own custom allocator then? I think that's fine, just making sure I understand.

Yes. A example for a custome allocator that falls back to is

class cachingDeviceAllocator : public ML::deviceAllocator {
public:
    cachingDeviceAllocator()
          , _allocator(8, 3, cub::CachingDeviceAllocator::INVALID_BIN, cub::CachingDeviceAllocator::INVALID_SIZE)
    {}

    virtual void* allocate( std::size_t n, cudaStream_t stream ) {
        void* ptr = 0;
        _allocator.DeviceAllocate( &ptr, n, stream );
        return ptr;
    }

    virtual void deallocate( void* p, std::size_t, cudaStream_t ) {
        _allocator.DeviceFree(p);
    }
    
private:
    cub::CachingDeviceAllocator _allocator;
};

That would be passed to cuML like this

std::shared_ptr<ML::rmmAllocatorAdapter> allocator( new ML::rmmAllocatorAdapter() );
cumlHandle->setDeviceAllocator( allocator );

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's just keep the current verbose API as it is. Depending on which one of these interfaces (device_buffer or allocate calls) folks end up preferring, we can then revisit the API restructuring.

cuML/DEVELOPER_GUIDE.md Outdated Show resolved Hide resolved
handle->getHostAllocator()->deallocate(temp_h, n*sizeof(T), stream);
}
```
Small host memory heap allocations, e.g. as internally done by STL containers, are fine, e.g. an `std::vector` managing only a handful of integers.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're providing a custom allocator for host memory, it's straightforward to provide an allocator for std::vector and similar containers. You could then even provide an alias like cuml::vector that is a std::vector using your custom allocator.

We do this in cuDF with rmm::device_vector -> thrust::device_vector using RMM for the allocator.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have thought about that quite a bit. The advantage of passing in an allocator as a runtime argument over a template parameter are:

  • Irrespective of the used allocator the containers remain type compatible.
  • We can easily have multiple different allocator objects, e.g. one per device one per thread, ....

With a template argument we would need to reference a global object. Does that make sense to you?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All good points.

I suppose since you are providing a host allocator and host_buffer, there's little reason for someone to use a std::vector.

Another point in favor of "allocators as parameters" vs "template parameters" is that it allows stateful allocators, e.g., an allocator w/ a stream. This wouldn't be possible with an allocator as a template argument.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't have to reference a global argument with the latest Thrust device_vector -- the constructor has an allocator parameter. The same is true of std::vector: https://en.cppreference.com/w/cpp/container/vector/vector (see constructor #2 for example)

vector( size_type count, const T& value, const Allocator& alloc = Allocator());

void foo( ML::cumlHandle* handle, .., cudaStream_t stream )
{
...
ML::device_buffer<T> temp( handle->getDeviceAllocator(), 0 )
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the implication of calling it buffer instead of vector that the memory is not default-initialized? I.e., a std::vector and thrust::device_vector are default initialized, which oftentimes is unnecessary and you can save some performance by avoiding the initialization.

Copy link
Contributor Author

@jirikraus jirikraus Feb 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intention to call it buffer instead of vector is to make clear that this is a simple container enabling:

  • asynchronous RAII
  • Use of the proposed allocater interface

and not to offer the interface of a std::vector or thrust::device_vector. As that would duplicate quite some code and would come with a significant need for testing.

void foo( ML::cumlHandle* handle, .. )
{
ML::thrustAllocatorAdapter alloc( handle->getDeviceAllocator(), handle->getStream() );
auto execution_policy = thrust::cuda::par(alloc).on(handle->getStream());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's possible to do this in a one-liner, we do this in RMM.

https://github.com/rapidsai/rmm/blob/branch-0.6/include/rmm/thrust_rmm_allocator.h#L94

Note that the use of a unique_ptr is a workaround for a deficiency in Thrust's execution policies being unable to bind to temporaries, but that will change in the next release of Thrust.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing this out. This is a great idea. I just tried it but to my understanding the implementation of exec_policy requires C++14 and I could not find a elegant way to express it with C++11 (which cuML is currently at). So I an implementation of exec_policy only conditionally and updated the developer guide mentioning the new possibilities with C++14. Let me know if I am missing something and we can make it work with C++11.

Copy link
Contributor

@jrhemstad jrhemstad Feb 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

C++14 is only required for the auto return type.

It's possible to support C++11 using trailing return type: https://www.ibm.com/support/knowledgecenter/en/SSLTBW_2.3.0/com.ibm.zos.v2r3.cbclx01/trailing_return.htm

It would just be a little more verbose, something like...

using T = decltype(thrust::cuda::par(rmm_allocator<char>{}));
inline exec_policy(cudaStream_t stream = 0) -> std::unique_ptr<T, std::function<void(T*)>

But again, the unique_ptr ugliness goes away w/ the update to Thrust.

inline auto exec_policy(cudaStream_t stream = 0) {
  return thrust::cuda::par(rmm_allocator<char>{stream}).on(stream);
}

And then usage is just:

thrust::sort(exec_policy(stream), ...);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jrhemstad I could make it work with C++11 and just pushed the related updates.


void cumlHandle_impl::destroyResources()
{
while ( !_cusparse_handles.empty() )
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is purely a aesthetic design critique, but a repeated pattern like this usually tells me there's a need for a more generic abstraction.

Perhaps a for_each over the elements of each vector and then vector.clear().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree to your statement. However given that I need to call a different destroy function for every vector I do not see how I would save a lot of code even using lambdas and for_each. However I am open for concrete suggestions :-)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However I am open for concrete suggestions :- )

Fair enough!

Okay, so... this is a bit more of an invasive change, but hear me out because I think it's beneficial for more than just solving this problem.

So for your vectors of resources, instead of maintaining a vector of the "raw" resource, you could create a simple wrapper class for each resource. For example, streams:

struct stream_wrapper{

   stream_wrapper(){
      CUDA_CHECK( cudaStreamCreate(&stream) );
   }

   ~stream_wrapper(){
      CUDA_CHECK( cudaStreamDestroy(&stream) );
   }

   // This allows using a `stream_wrapper` implicitly as if it were a cudaStream_t*,
   // i.e., the fact that it's a wrapper is invisible to the user  
   operator cudaStream_t*() { return &stream }; 

private:
  cudaStream_t stream;
};

And then... your vector of cudaStream_t would become a vector of stream_wrappers...

std::vector<stream_wrapper>           _streams;

And then in destroyResources you can simply do _streams.clear() and let the wrapper's destructor handle all of the resource specific cleanup!

You could then create similar wrappers for all of the other resources.

What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see how this would make the code safer. However it will also add quite some code. Since this is an implementation detail and we are primarily discussing the API here I would leave the destroy resources code as it currently is. Any objections?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see how this would make the code safer. However it will also add quite some code. Since this is an implementation detail and we are primarily discussing the API here I would leave the destroy resources code as it currently is. Any objections?

Absolutely just an implementation detail, don't let it block this PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jrhemstad can you file a issue on this one so that we don't miss this, please?

if ( n > 0 )
{
_data = static_cast<value_type*>(_allocator->allocate( _capacity*sizeof(value_type), 0 ));
CUDA_CHECK( cudaStreamSynchronize( 0 ) );
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah... you don't really want an asynchronous constructor.

But I don't like that if I want asynchronous allocation/free I can't use RAII.

How about an API that returns a std::future to an instance of a device_buffer allocated on a stream? This still allows for asynchronous memory allocation/free that still uses RAII.

cuML/src/common/device_buffer.hpp Outdated Show resolved Hide resolved
std::vector<cublasHandle_t> _cublas_handles;
std::vector<cusolverDnHandle_t> _cusolverDn_handles;
std::vector<cusparseHandle_t> _cusparse_handles;
std::shared_ptr<deviceAllocator> _deviceAllocator;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to understand the need for the allocators to be shared_ptrs instead of a unique_ptr member.

Shouldn't the handle be in charge of maintaining the lifetime of the allocator? That is, who else would have a reference to the allocator that requires using a shared_ptr?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have chosen a shared pointer as this allows the cuML user to reuse an allocator in other places in a save way. Does that make sense?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I get that, but couldn't you also accomplish the same safety with a unique_ptr?

My default is to always prefer a unique_ptr to a shared_ptr unless I absolutely need a shared_ptr, and I don't understand well enough yet why a shared_ptr is required here.

My understanding is that there's a one-to-one relationship between an instance of a ML::handle and an instance of an allocator, and as such, the handle should have sole ownership over it's corresponding allocator. Is that understanding correct?

Copy link
Contributor Author

@jirikraus jirikraus Feb 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the point of view of anything that is implemented in cuML yes. However not from the point of view from a cuML user. If setDeviceAllocator would take a unique_ptr a cuML can't use a deviceAllocator that needs to outlive the cuML handle. Makes sense? Or am I missing something?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If setDeviceAllocator would take a unique_ptr a cuML can't use a deviceAllocator that needs to outlive the cuML handle. Makes sense? Or am I missing something?

Ah yes, so if someone wants to reuse the same allocator between two cuML API calls, using a unique_ptr would be difficult.

Thanks for explaining it to me, I believe a shared_ptr is absolutely appropriate here.


device_buffer& operator=(const device_buffer& other) = delete;

device_buffer(std::shared_ptr<deviceAllocator> allocator, size_type n = 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit cumbersome to always have to pass an allocator as a parameter whenever I want to do an allocation with RAII.

It'd be nicer if it could be a template parameter such that you could have an alias for a device_buffer that always uses my_allocator.

Copy link
Contributor Author

@jirikraus jirikraus Feb 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have thought about that quite a bit. The advantage of passing in an allocator as a runtime argument over a template parameter are:

  • Irrespective of the used allocator the containers remain type compatible.
  • We can easily have multiple different allocator objects, e.g. one per device one per thread, ....

With a template argument we would need to reference a global object. Does that make sense to you?

cuML/src/cuML.hpp Outdated Show resolved Hide resolved
@teju85
Copy link
Member

teju85 commented Feb 28, 2019

@jirikraus I don't see the nullptr updates in the latest changes?

@teju85
Copy link
Member

teju85 commented Feb 28, 2019

ah.. never mind my previous comment. The conversations still show up with the old code, whereas the commits themselves have your changes. I got confused by this distinction.

@jirikraus
Copy link
Contributor Author

I just added a new section to the proposed developer guide about Multi GPU Programming. Please take a look at it.

@teju85
Copy link
Member

teju85 commented Feb 28, 2019

Multi GPU programming info LGTM.

```cpp
void foo( ML::cumlHandle* handle, ...)
{
cublasHandle_t cublasHandle = handle->getImpl()->getCublasHandle();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jirikraus all of our cu*Handle objects typically support a setStream method and IIRC they all have default stream initialized in them? Maybe we should ask devs to prohibit from calling setStream on these handles?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe update our cumlHandle_impl::setStream to in turn also set streams for these handles appropriately?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a good point. setStream is not really meant for cuML developers. It is meant for cuML users. Therefore I think it is better to pass around const ML::cumlHandle& in cuML which would prohibit changing the user provided stream. This also ties in with a comment from @jrhemstad regarding how to pass around the handle.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry Jiri, if my previous comments confused you. I meant the following kind of usage pattern. In other words, should we allow developers to perform cublasSetStream on the internal cublas handle, explicitly, for example?

cublasHandle_t cublasHandle = handle->getImpl()->getCublasHandle();
cudaStream_t someStreamPickedByMe;
cublasSetStream(cublasHandle, someStreamPickedByMe);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On one hand, it sounds like a valid use-case if we want to allow devs to use internal streams and schedule cublas/cusolver calls on these streams.

On the other hand, these cublasSetStream kind of calls are persistent, easily causing false dependencies when the same cublasHandle gets called somewhere else in the pipeline!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also thought a bit more about this and one possibility would be that we drop the assumption that e.g. the cublasHandle managed by handle has any stream associated with it. Instead one should assume that it is required to set the stream before every cuBLAS call. For cuML and ml-prims developers I think everybody should use the wrapper is cublas_wrappers.h and we add a stream argument to the interface of those wrappers. E.g. for gemm

template <typename T>
cublasStatus_t cublasgemm(cublasHandle_t handle, cublasOperation_t transA,
                          cublasOperation_t transB, int m, int n, int k,
                          const T *alfa, const T *A, int lda, const T *B,
                          int ldb, const T *beta, T *C, int ldc, cudaStream_t stream);

which would have the following specialization for dgemm

template <>
inline cublasStatus_t
  cublasgemm(cublasHandle_t handle, cublasOperation_t transA,
             cublasOperation_t transB, int m, int n, int k, const double *alfa,
             const double *A, int lda, const double *B, int ldb,
             const double *beta, double *C, int ldc, cudaStream_t stream) {
  cublasStatus_t status = cublasSetStream(handle, stream);
  if ( CUBLAS_STATUS_SUCCESS != status )
    return status;
  return cublasDgemm(handle, transA, transB, m, n, k, alfa, A, lda, B, ldb,
                     beta, C, ldc);
}

Rational for this idea is that changing streams is s check operation and after an operation is en queued it is fine to change the stream. Only issue I see with this is potential races in a multi threaded enviroment, e.g.
Thread A: sets stream to stream A on handle
Thread B: sets stream to stream B on handle
Thread B: launches Dgemm on handle in stream B
Thread B: launches Dgemm on handle in stream B (ups!)
We could either manage that by providing a cublasHandle_t wrapper containing a lock in addition to the handle that could be used in the cuBLAS wrappers. Or simply define a cuML, ml-prims developer rule that the wrapper are not thread safe and that one handle should be created per thread? The latter one would then require the cumlHandle_t to mange multiple cublasHandle's or that we disallow multi threading for cuML, ml-prims implementation. Considering the cuBLAS documentation having one handle per thread is propably what we should prefer:

2.1.3. Thread Safety
The library is thread safe and its functions can be called from multiple host threads, even with the same handle. When multiple threads share the same handle, extreme care needs to be taken when the handle configuration is changed because that change will affect potentially subsequent CUBLAS calls in all threads. It is even more true for the destruction of the handle. So it is not recommended that multiple thread share the same CUBLAS handle.

What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My opinion (might be based, because I like the OPG approach!) is that we should disallow the usage of same cumlHandle across multiple threads.

Your cublas_wrappers idea is a much better one.

Implementation details:
If we allow stream as a default option in the above cublasgemm method you defined above, then I think this change can be made within the branch-0.6 timeframe itself. Else, it'll cause huge amount of changes across existing ml-prims and cuML code (not to mention the rework needed for ongoing PR's!).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have filed issue #256 as our first step towards updating these *_wrapper header files. We shall update the existing methods to expose a stream as a default argument.

Will soon file another issue to remove the default-ness of stream argument across all ml-prims methods. We should not be launching work on default streams inside cuML or ml-prims!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. It makes sense that the implementation of cuML and ml-prims is single threaded in this case we won't have the issues I described. Just added two sentences in the Thread safety section of DEVELOPER_GUIDE.md to document it and put it up for discussion.

@jirikraus jirikraus force-pushed the fea-ext-cuML-cpp-iface branch from af9e567 to 15fcf12 Compare February 28, 2019 13:28
@jirikraus
Copy link
Contributor Author

Thanks @teju85 @jrhemstad for the thorough review and your good feedback. I currently have the following open items on my list:

  • How to pass around cumlHandle? I currently think this should be as a const reference. I will work on a change.
  • The name of cumlHandle. I agree that ML::cumlHandle is a dublication. I will update it to ML::handle.
  • How to handle internal streams?

Please correct me if I am wrong, but I think I either addressed your other comments or there is an open discussion regarding them.

Thanks

Jiri

@teju85
Copy link
Member

teju85 commented Feb 28, 2019

passing around cumlHandle

Agreed. Better to pass around const-ref.

One other item is the cumlHandle_impl

I'm pretty sure we'd almost need all of the info inside the cumlHandle_impl object (stream, cublas, cusolver and sooner cusparse handles) inside ml-prims. As such it also makes sense to migrate this impl class (please rename it to something meaningful) to inside of ml-prims itself. And since the dependency is from cuML to ml-prims (and should never be the other way around), this migration should be fine. What say?

@jrhemstad
Copy link
Contributor

@harrism should probably also review this for thoughts on allocator abstractions as well as for ideas about adopting something similar in cuDF.

@jirikraus jirikraus force-pushed the fea-ext-cuML-cpp-iface branch from 15fcf12 to 169c902 Compare February 28, 2019 20:59
@jirikraus
Copy link
Contributor Author

One other item is the cumlHandle_impl

I'm pretty sure we'd almost need all of the info inside the cumlHandle_impl object (stream, cublas, cusolver and sooner cusparse handles) inside ml-prims. As such it also makes sense to migrate this impl class (please rename it to something meaningful) to inside of ml-prims itself. And since the dependency is from cuML to ml-prims (and should never be the other way around), this migration should be fine. What say?

I am not sure if cumlHandle_impl should be moved to ml-prims. I see that we want to avoid duplicating resources such as cublas or cusolver handles and that we also don't want to make the ml-prims API extremely verbose. However I see potential issues with sharing these resources, especially CUDA streams, between a cuML and ml-prims handle. As that could cause false dependencies. I therefore would suggest to introduce a new ml-prims handle and move the ownership of the allocators there. cumlHandle_impl will hold a ml-prim handle and reference the allocators in there. What do you think?

Copy link
Contributor

@jrhemstad jrhemstad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll probably want to add a MPI communicator resource to the handle eventually, but I don't think that needs to block the current PR as it's mainly an API proposal and you've sufficiently demonstrated how requesting resources will work.

@teju85
Copy link
Member

teju85 commented Mar 1, 2019

@jirikraus My main motivation behind writing that comment was to reduce duplication of resources (and thus code) between ml-prims and cuML. I'm ok with either full migration of the impl structure or a new handle in ml-prims with this impl struct being composed of it.

I think it's better addressed in a separate issue/PR.

@jirikraus jirikraus force-pushed the fea-ext-cuML-cpp-iface branch from 8cc4d3a to 694e956 Compare March 8, 2019 20:58
@jirikraus
Copy link
Contributor Author

@harrism thanks a lot for your thorough review and good feedback. I have incorporated almost all of you suggestions for DEVELOPER_GUIDE.md and hope that I did not miss to address any of your points. With the latest set of commit I changed the following:

  • Added allocator adapter for STL containers. + a section in the DEVELOPER_GUIDE.md highlighting it.
  • Dropped the multi GPU part of cumlHandle_impl
  • Updated DEVELOPER_GUIDE.md based on suggestions from @harrism plus a few additions
    • Added a empty section about Testing and Unit Testing.
    • Added a section about Documentation stating that APIs should be documented with doxygen syntax.
  • Added first version of doxygen documentation to the public interface (for internal interfaces like cumlHandle_impl this is still missing). I would prefer if we can merge this as it is and add missing documentation overtime in a collaborative effort.

Regarding on remark to which I could not reply inline for some reason

You don't have to reference a global argument with the latest Thrust device_vector -- the constructor has an allocator parameter. The same is true of std::vector:
https://en.cppreference.com/w/cpp/container/vector/vector (see constructor #2 for example)

vector( size_type count, const T& value, const Allocator& alloc = Allocator());

As you say that would come with the dependency to the latest thrust version for device_vector. Regarding the motivation for device_buffer and host_buffer in general I have added something to the developer guide.

@GPUtester
Copy link
Contributor

Can one of the admins verify this patch?

@dantegd dantegd merged commit 49dfd22 into rapidsai:branch-0.6 Mar 12, 2019
Salonijain27 pushed a commit to Salonijain27/cuml that referenced this pull request Jan 22, 2020
[HOTFIX] [REVIEW] updated the ridge regression notebook.  Merging on @dantegd 's approval.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2 - In Progress Currenty a work in progress proposal Change current process or code
Projects
None yet
Development

Successfully merging this pull request may close these issues.