[REVIEW] Proposal for "proper" C/C++ API (issue #92) #247

jirikraus · 2019-02-27T12:54:50Z

This is a proposal for a imporved C and C++ API. When reviewing this I suggest to start with cuML/DEVELOPER_GUIDE.md as that explains the used concepts.

teju85

I really like the developer_guide idea. Thanks Jiri for putting it out!

cuML/DEVELOPER_GUIDE.md

cjnolet · 2019-02-27T14:39:56Z

Looks good, Jiri! I'm looking forward to using this.

I am currently using cusparse in UMAP. It would be nice if the cusparse handle could be added to this API as well.

teju85 · 2019-02-27T14:46:48Z

@cjnolet didn't realize that you were using cusparse. Can you add cusparse_wrappers, just like the cublas_wrappers, cusolver_wrappers, inside ml-prims, please?

jirikraus · 2019-02-27T14:58:59Z

Thanks for the review @cjnolet I just added cusparseHandle_t to cumlHandle_impl.

teju85 · 2019-02-27T15:24:24Z

cuML/src/common/rmmAllocatorAdapter.hpp

+
+#pragma once
+
+#include <rmm/rmm.h>


So, we expect rmm now to be installed before cuml build, right?

No. IMHO cuML should not depend on RMM, we should try to keep the external dependencies of cuML. This header is just provided as a convenience for users of cuML. It should not be used inside of cuML. E.g. if a cuser of cuML wants to use RMM for all device allocations code like this could be used:

#include <cuML.hpp> #include <rmm/rmm.h> #include <common/rmmAllocatorAdapter.hpp> void foo() { rmmOptions_t rmmOptions; rmmOptions.allocation_mode = PoolAllocation; rmmOptions.initial_pool_size = 0; rmmOptions.enable_logging = false; rmmError_t rmmStatus = rmmInitialize(&rmmOptions); if ( RMM_SUCCESS != rmmStatus ) std::cerr<<"WARN: Could not initialize RMM: "<<rmmGetErrorString(rmmStatus)<<std::endl; std::unique_ptr<ML::cumlHandle> cumlHandle( new ML::cumlHandle() ); std::shared_ptr<ML::rmmAllocatorAdapter> allocator( new ML::rmmAllocatorAdapter() ); cumlHandle->setDeviceAllocator( allocator ); cudaStream_t stream; cudaStreamCreate( &stream ); cumlHandle->setStream( stream ); cudaStreamDestroy( stream ); if ( rmmIsInitialized( NULL ) ) rmmFinalize(); }

@jirikraus thinking about it, could we put that foo function you just wrote (or something similar) somewhere as a convenience function for creating rmm enabled cumlHandle? From the python library I believe I will always call with RMM as the memory manager for the time being

Ah! Makes sense now. Thanks Jiri.

@dantegd maybe adding this as a code-block in the doxygen comment would be better for now?

Yeah, probably will end up adding it for the cython call, but in doxygen is fine for now

@teju85 I agree this should go into a code-block in the doxygen comment. However before we concluded the discussion on the API I don't think it makes sense to invest much in documentation, which is the reason I have not added documentation yet.
@dantegd I think this code should be only in the documentation and examples. It needs to be written by the user of cuML as that is the one who knows if RMM is already initialized and how he want's to initialize. E.g. if cuML gets called from a python script which used cuDF to prepare data RMM was initialized already before. If we exluce RMM init and shutdown the function would only provide

std::shared_ptr<ML::rmmAllocatorAdapter> allocator( new ML::rmmAllocatorAdapter() ); cumlHandle->setDeviceAllocator( allocator );

which IMHO is not worth a separate helper function. Do you agree?

I agree with @jirikraus here. It is not worth such a helper function.

Sure Jiri. Makes sense to document this once we finalized the interfaces.

cuML/src/cuML.hpp

cuML/src/common/host_buffer.hpp

teju85 · 2019-02-27T15:36:46Z

cuML/src/common/device_buffer.hpp

+        if ( n > 0 )
+        {
+            _data = static_cast<value_type*>(_allocator->allocate( _capacity*sizeof(value_type), 0 ));
+            CUDA_CHECK( cudaStreamSynchronize( 0 ) );


I'm a bit confused here. So effectively we are sync'ing on the default stream everytime we create a device_buffer?

Yes, because there is no other stream available in the constructor to pass into the allocator and to synchronize on. I have chosen this approach because I think it would be confusing if a constructor is asynchronous. I would expect that an object is fully usable after construction. If a user provided stream should be used for the allocation and the synchronization should be avoided the technique described in the developer guide using resize and release could be used. One alternative I could think is forcing users of device_buffer and host_buffer to use resize by not offering a constructor that takes a size. What do you think?

Yeah... you don't really want an asynchronous constructor.

But I don't like that if I want asynchronous allocation/free I can't use RAII.

How about an API that returns a std::future to an instance of a device_buffer allocated on a stream? This still allows for asynchronous memory allocation/free that still uses RAII.

Makes sense. Thanks for the clarification

I get you point @jrhemstad regarding std::future, however that does not change the issue of the constructor, right? It would replace the constructor with factory. Also you still have RAII when it is done like this

template<typename T> void foo( ML::cumlHandle* handle, .., cudaStream_t stream ) { ... ML::device_buffer<T> temp( handle->getDeviceAllocator(), 0 ) temp.resize(n, stream); kernelA<<<grid, block, 0, stream>>>(..., temp.data(), ...); kernelB<<<grid, block, 0, stream>>>(..., temp.data(), ...); temp.release(stream); }

in this case the pointer is still owned by temp. I.e. if the call to release does not happen for some reason (e.g. because a developer forgets to call it) the destructor of temp will make sure that the memory is released when temp goes out of scope. Does that make sense so you?

I just pushed an update for device_buffer and host_buffer. Now the constructor takes a stream as an argument and the device_buffer and host_buffer are tied to that stream. The associated stream a different stream can be changed by calling resize, reserve´ or release`.

Nice. This now looks very aligned with our idea of not having to use default stream in cuml or ml-prims!

cuML/src/common/host_buffer.hpp

jrhemstad

Lots of good stuff here. You've given me several ideas that I'd like to adapt to cuDF.

jrhemstad · 2019-02-27T14:52:27Z

cuML/DEVELOPER_GUIDE.md

+template<typename T>
+void foo(ML::cumlHandle* handle, cudaStream_t stream, ... )
+{
+    T* temp_h = handle->getDeviceAllocator()->allocate(n*sizeof(T), stream);


I think the spirit of what you're suggesting here is good, but the API is a little verbose.

Maybe something more like:
T* p = handle->device_allocate(n*sizeof(T), stream);

Also, what about for more specialized allocations? Like page locked memory? Or UVM? Would there be extra allocators in handle for this usage? Or maybe an enum passed into device_allocate that describes the kind of allocation you want.

@jrhemstad IIUC, page locked memory is what is being proposed with the host_allocator concept below.

Liked your suggestion about reducing verbosity.

Interesting idea regarding the specialized allocators for Unified Memory and pinned host memory. However before we add these kind of things I would prefer to see the use case first. Currently I know that we need temporary host memory and temporary device memory. The interface is sufficient to support that and already gives the user of cuML quite some flexibility, e.g. he could pass in an allocator that allocates pinned host memory or one that allocates pageable host memory (current default is pinned host memory). Similar for temporary device memory he could use Unified Memory, pinned host memory or pinned device memory.
Regarding the less verbose API: I actually had that in an earlier version I tried, but preferred a cleaner interface that does not duplicate functionality between classes. Also the ML::device_buffer and ML::host_buffer should be preferred for allocations IMHO. A verbose interface make the alternative less attractive. However I do not have a very strong opinion on this one so I am open to provide a device_allocate and host_allocate in cumlHandle.

I would prefer to see the use case first.

I have no idea what the use-case may be :) This is my first time looking at cuML code. That said, we can never predict the future, so it's safest to assume that if something can happen (like wanting to use UVM), it probably will at some point and the design should be able to handle that.

Similar for temporary device memory he could use Unified Memory, pinned host memory or pinned device memory.

This would require the user defining their own custom allocator then? I think that's fine, just making sure I understand.

Also the ML::device_buffer and ML::host_buffer should be preferred for allocations IMHO. A verbose interface make the alternative less attractive.

100% agreed that RAII constructs should be preferred. I had a sneaking suspicion that you left the malloc/free-like API verbose to discourage people from using it :)

I would prefer to see the use case first.

I have no idea what the use-case may be :) This is my first time looking at cuML code. That said, we can never predict the future, so it's safest to assume that if something can happen (like wanting to use UVM), it probably will at some point and the design should be able to handle that.

I would say the current design is flexible enough for that. Once the need comes up we can add allocation routines to the proposed allocator interfaces with default implementations, e.g. for manged memory calling through to cudaMallocManaged(). Since we won't need to make allocations with a new type of memory before that happens it won't require to touch old cuML code. And cuML users affected be the default implementation can easily override the default one. Does that seem reasonable to you?

Similar for temporary device memory he could use Unified Memory, pinned host memory or pinned device memory.

This would require the user defining their own custom allocator then? I think that's fine, just making sure I understand.

Yes. A example for a custome allocator that falls back to is

class cachingDeviceAllocator : public ML::deviceAllocator { public: cachingDeviceAllocator() , _allocator(8, 3, cub::CachingDeviceAllocator::INVALID_BIN, cub::CachingDeviceAllocator::INVALID_SIZE) {} virtual void* allocate( std::size_t n, cudaStream_t stream ) { void* ptr = 0; _allocator.DeviceAllocate( &ptr, n, stream ); return ptr; } virtual void deallocate( void* p, std::size_t, cudaStream_t ) { _allocator.DeviceFree(p); } private: cub::CachingDeviceAllocator _allocator; };

That would be passed to cuML like this

std::shared_ptr<ML::rmmAllocatorAdapter> allocator( new ML::rmmAllocatorAdapter() ); cumlHandle->setDeviceAllocator( allocator );

Let's just keep the current verbose API as it is. Depending on which one of these interfaces (device_buffer or allocate calls) folks end up preferring, we can then revisit the API restructuring.

cuML/DEVELOPER_GUIDE.md

jrhemstad · 2019-02-27T14:56:18Z

cuML/DEVELOPER_GUIDE.md

+    handle->getHostAllocator()->deallocate(temp_h, n*sizeof(T), stream);
+}
+```
+Small host memory heap allocations, e.g. as internally done by STL containers, are fine, e.g. an `std::vector` managing only a handful of integers.


If you're providing a custom allocator for host memory, it's straightforward to provide an allocator for std::vector and similar containers. You could then even provide an alias like cuml::vector that is a std::vector using your custom allocator.

We do this in cuDF with rmm::device_vector -> thrust::device_vector using RMM for the allocator.

I have thought about that quite a bit. The advantage of passing in an allocator as a runtime argument over a template parameter are:

Irrespective of the used allocator the containers remain type compatible.

We can easily have multiple different allocator objects, e.g. one per device one per thread, ....

With a template argument we would need to reference a global object. Does that make sense to you?

All good points.

I suppose since you are providing a host allocator and host_buffer, there's little reason for someone to use a std::vector.

Another point in favor of "allocators as parameters" vs "template parameters" is that it allows stateful allocators, e.g., an allocator w/ a stream. This wouldn't be possible with an allocator as a template argument.

You don't have to reference a global argument with the latest Thrust device_vector -- the constructor has an allocator parameter. The same is true of std::vector: https://en.cppreference.com/w/cpp/container/vector/vector (see constructor #2 for example)

vector( size_type count, const T& value, const Allocator& alloc = Allocator());

jrhemstad · 2019-02-27T14:58:54Z

cuML/DEVELOPER_GUIDE.md

+void foo( ML::cumlHandle* handle, .., cudaStream_t stream )
+{
+    ...
+    ML::device_buffer<T> temp( handle->getDeviceAllocator(), 0 )


Is the implication of calling it buffer instead of vector that the memory is not default-initialized? I.e., a std::vector and thrust::device_vector are default initialized, which oftentimes is unnecessary and you can save some performance by avoiding the initialization.

The intention to call it buffer instead of vector is to make clear that this is a simple container enabling:

asynchronous RAII

Use of the proposed allocater interface

and not to offer the interface of a std::vector or thrust::device_vector. As that would duplicate quite some code and would come with a significant need for testing.

jrhemstad · 2019-02-27T15:01:00Z

cuML/DEVELOPER_GUIDE.md

+void foo( ML::cumlHandle* handle, .. )
+{
+    ML::thrustAllocatorAdapter alloc( handle->getDeviceAllocator(), handle->getStream() );
+    auto execution_policy = thrust::cuda::par(alloc).on(handle->getStream());


It's possible to do this in a one-liner, we do this in RMM.

https://github.com/rapidsai/rmm/blob/branch-0.6/include/rmm/thrust_rmm_allocator.h#L94

Note that the use of a unique_ptr is a workaround for a deficiency in Thrust's execution policies being unable to bind to temporaries, but that will change in the next release of Thrust.

Thanks for pointing this out. This is a great idea. I just tried it but to my understanding the implementation of exec_policy requires C++14 and I could not find a elegant way to express it with C++11 (which cuML is currently at). So I an implementation of exec_policy only conditionally and updated the developer guide mentioning the new possibilities with C++14. Let me know if I am missing something and we can make it work with C++11.

C++14 is only required for the auto return type.

It's possible to support C++11 using trailing return type: https://www.ibm.com/support/knowledgecenter/en/SSLTBW_2.3.0/com.ibm.zos.v2r3.cbclx01/trailing_return.htm

It would just be a little more verbose, something like...

using T = decltype(thrust::cuda::par(rmm_allocator<char>{})); inline exec_policy(cudaStream_t stream = 0) -> std::unique_ptr<T, std::function<void(T*)>

But again, the unique_ptr ugliness goes away w/ the update to Thrust.

inline auto exec_policy(cudaStream_t stream = 0) { return thrust::cuda::par(rmm_allocator<char>{stream}).on(stream); }

And then usage is just:

thrust::sort(exec_policy(stream), ...);

Thanks @jrhemstad I could make it work with C++11 and just pushed the related updates.

jrhemstad · 2019-02-27T16:15:24Z

cuML/src/common/cumlHandle.cpp

+
+void cumlHandle_impl::destroyResources()
+{
+    while ( !_cusparse_handles.empty() )


This is purely a aesthetic design critique, but a repeated pattern like this usually tells me there's a need for a more generic abstraction.

Perhaps a for_each over the elements of each vector and then vector.clear().

I agree to your statement. However given that I need to call a different destroy function for every vector I do not see how I would save a lot of code even using lambdas and for_each. However I am open for concrete suggestions :-)

However I am open for concrete suggestions :- )

Fair enough!

Okay, so... this is a bit more of an invasive change, but hear me out because I think it's beneficial for more than just solving this problem.

So for your vectors of resources, instead of maintaining a vector of the "raw" resource, you could create a simple wrapper class for each resource. For example, streams:

struct stream_wrapper{ stream_wrapper(){ CUDA_CHECK( cudaStreamCreate(&stream) ); } ~stream_wrapper(){ CUDA_CHECK( cudaStreamDestroy(&stream) ); } // This allows using a `stream_wrapper` implicitly as if it were a cudaStream_t*, // i.e., the fact that it's a wrapper is invisible to the user operator cudaStream_t*() { return &stream }; private: cudaStream_t stream; };

And then... your vector of cudaStream_t would become a vector of stream_wrappers...

std::vector<stream_wrapper> _streams;

And then in destroyResources you can simply do _streams.clear() and let the wrapper's destructor handle all of the resource specific cleanup!

You could then create similar wrappers for all of the other resources.

What do you think?

I see how this would make the code safer. However it will also add quite some code. Since this is an implementation detail and we are primarily discussing the API here I would leave the destroy resources code as it currently is. Any objections?

I see how this would make the code safer. However it will also add quite some code. Since this is an implementation detail and we are primarily discussing the API here I would leave the destroy resources code as it currently is. Any objections?

Absolutely just an implementation detail, don't let it block this PR.

@jrhemstad can you file a issue on this one so that we don't miss this, please?

jrhemstad · 2019-02-27T16:28:15Z

cuML/src/common/device_buffer.hpp

+        if ( n > 0 )
+        {
+            _data = static_cast<value_type*>(_allocator->allocate( _capacity*sizeof(value_type), 0 ));
+            CUDA_CHECK( cudaStreamSynchronize( 0 ) );


Yeah... you don't really want an asynchronous constructor.

But I don't like that if I want asynchronous allocation/free I can't use RAII.

How about an API that returns a std::future to an instance of a device_buffer allocated on a stream? This still allows for asynchronous memory allocation/free that still uses RAII.

cuML/src/common/device_buffer.hpp

jrhemstad · 2019-02-27T16:40:12Z

cuML/src/common/cumlHandle.hpp

+    std::vector<cublasHandle_t>         _cublas_handles;
+    std::vector<cusolverDnHandle_t>     _cusolverDn_handles;
+    std::vector<cusparseHandle_t>       _cusparse_handles;
+    std::shared_ptr<deviceAllocator>    _deviceAllocator;


I'm trying to understand the need for the allocators to be shared_ptrs instead of a unique_ptr member.

Shouldn't the handle be in charge of maintaining the lifetime of the allocator? That is, who else would have a reference to the allocator that requires using a shared_ptr?

I have chosen a shared pointer as this allows the cuML user to reuse an allocator in other places in a save way. Does that make sense?

Yeah, I get that, but couldn't you also accomplish the same safety with a unique_ptr?

My default is to always prefer a unique_ptr to a shared_ptr unless I absolutely need a shared_ptr, and I don't understand well enough yet why a shared_ptr is required here.

My understanding is that there's a one-to-one relationship between an instance of a ML::handle and an instance of an allocator, and as such, the handle should have sole ownership over it's corresponding allocator. Is that understanding correct?

From the point of view of anything that is implemented in cuML yes. However not from the point of view from a cuML user. If setDeviceAllocator would take a unique_ptr a cuML can't use a deviceAllocator that needs to outlive the cuML handle. Makes sense? Or am I missing something?

If setDeviceAllocator would take a unique_ptr a cuML can't use a deviceAllocator that needs to outlive the cuML handle. Makes sense? Or am I missing something?

Ah yes, so if someone wants to reuse the same allocator between two cuML API calls, using a unique_ptr would be difficult.

Thanks for explaining it to me, I believe a shared_ptr is absolutely appropriate here.

jrhemstad · 2019-02-27T16:46:31Z

cuML/src/common/device_buffer.hpp

+
+    device_buffer& operator=(const device_buffer& other) = delete;
+
+    device_buffer(std::shared_ptr<deviceAllocator> allocator, size_type n = 0)


It's a bit cumbersome to always have to pass an allocator as a parameter whenever I want to do an allocation with RAII.

It'd be nicer if it could be a template parameter such that you could have an alias for a device_buffer that always uses my_allocator.

I have thought about that quite a bit. The advantage of passing in an allocator as a runtime argument over a template parameter are:

Irrespective of the used allocator the containers remain type compatible.

We can easily have multiple different allocator objects, e.g. one per device one per thread, ....

With a template argument we would need to reference a global object. Does that make sense to you?

cuML/src/cuML.hpp

teju85 · 2019-02-28T04:02:44Z

@jirikraus I don't see the nullptr updates in the latest changes?

teju85 · 2019-02-28T04:06:21Z

ah.. never mind my previous comment. The conversations still show up with the old code, whereas the commits themselves have your changes. I got confused by this distinction.

jirikraus · 2019-02-28T11:09:27Z

I just added a new section to the proposed developer guide about Multi GPU Programming. Please take a look at it.

teju85 · 2019-02-28T11:19:41Z

Multi GPU programming info LGTM.

teju85 · 2019-02-28T11:33:47Z

cuML/DEVELOPER_GUIDE.md

+```cpp
+void foo( ML::cumlHandle* handle, ...)
+{
+    cublasHandle_t cublasHandle = handle->getImpl()->getCublasHandle();


@jirikraus all of our cu*Handle objects typically support a setStream method and IIRC they all have default stream initialized in them? Maybe we should ask devs to prohibit from calling setStream on these handles?

Or maybe update our cumlHandle_impl::setStream to in turn also set streams for these handles appropriately?

That is a good point. setStream is not really meant for cuML developers. It is meant for cuML users. Therefore I think it is better to pass around const ML::cumlHandle& in cuML which would prohibit changing the user provided stream. This also ties in with a comment from @jrhemstad regarding how to pass around the handle.

Sorry Jiri, if my previous comments confused you. I meant the following kind of usage pattern. In other words, should we allow developers to perform cublasSetStream on the internal cublas handle, explicitly, for example?

cublasHandle_t cublasHandle = handle->getImpl()->getCublasHandle(); cudaStream_t someStreamPickedByMe; cublasSetStream(cublasHandle, someStreamPickedByMe);

On one hand, it sounds like a valid use-case if we want to allow devs to use internal streams and schedule cublas/cusolver calls on these streams.

On the other hand, these cublasSetStream kind of calls are persistent, easily causing false dependencies when the same cublasHandle gets called somewhere else in the pipeline!

I also thought a bit more about this and one possibility would be that we drop the assumption that e.g. the cublasHandle managed by handle has any stream associated with it. Instead one should assume that it is required to set the stream before every cuBLAS call. For cuML and ml-prims developers I think everybody should use the wrapper is cublas_wrappers.h and we add a stream argument to the interface of those wrappers. E.g. for gemm

template <typename T> cublasStatus_t cublasgemm(cublasHandle_t handle, cublasOperation_t transA, cublasOperation_t transB, int m, int n, int k, const T *alfa, const T *A, int lda, const T *B, int ldb, const T *beta, T *C, int ldc, cudaStream_t stream);

which would have the following specialization for dgemm

template <> inline cublasStatus_t cublasgemm(cublasHandle_t handle, cublasOperation_t transA, cublasOperation_t transB, int m, int n, int k, const double *alfa, const double *A, int lda, const double *B, int ldb, const double *beta, double *C, int ldc, cudaStream_t stream) { cublasStatus_t status = cublasSetStream(handle, stream); if ( CUBLAS_STATUS_SUCCESS != status ) return status; return cublasDgemm(handle, transA, transB, m, n, k, alfa, A, lda, B, ldb, beta, C, ldc); }

Rational for this idea is that changing streams is s check operation and after an operation is en queued it is fine to change the stream. Only issue I see with this is potential races in a multi threaded enviroment, e.g.
Thread A: sets stream to stream A on handle
Thread B: sets stream to stream B on handle
Thread B: launches Dgemm on handle in stream B
Thread B: launches Dgemm on handle in stream B (ups!)
We could either manage that by providing a cublasHandle_t wrapper containing a lock in addition to the handle that could be used in the cuBLAS wrappers. Or simply define a cuML, ml-prims developer rule that the wrapper are not thread safe and that one handle should be created per thread? The latter one would then require the cumlHandle_t to mange multiple cublasHandle's or that we disallow multi threading for cuML, ml-prims implementation. Considering the cuBLAS documentation having one handle per thread is propably what we should prefer:

2.1.3. Thread Safety
The library is thread safe and its functions can be called from multiple host threads, even with the same handle. When multiple threads share the same handle, extreme care needs to be taken when the handle configuration is changed because that change will affect potentially subsequent CUBLAS calls in all threads. It is even more true for the destruction of the handle. So it is not recommended that multiple thread share the same CUBLAS handle.

What do you think?

My opinion (might be based, because I like the OPG approach!) is that we should disallow the usage of same cumlHandle across multiple threads.

Your cublas_wrappers idea is a much better one.

Implementation details:
If we allow stream as a default option in the above cublasgemm method you defined above, then I think this change can be made within the branch-0.6 timeframe itself. Else, it'll cause huge amount of changes across existing ml-prims and cuML code (not to mention the rework needed for ongoing PR's!).

Have filed issue #256 as our first step towards updating these *_wrapper header files. We shall update the existing methods to expose a stream as a default argument.

Will soon file another issue to remove the default-ness of stream argument across all ml-prims methods. We should not be launching work on default streams inside cuML or ml-prims!

Agree. It makes sense that the implementation of cuML and ml-prims is single threaded in this case we won't have the issues I described. Just added two sentences in the Thread safety section of DEVELOPER_GUIDE.md to document it and put it up for discussion.

jirikraus · 2019-02-28T14:13:02Z

Thanks @teju85 @jrhemstad for the thorough review and your good feedback. I currently have the following open items on my list:

How to pass around cumlHandle? I currently think this should be as a const reference. I will work on a change.
The name of cumlHandle. I agree that ML::cumlHandle is a dublication. I will update it to ML::handle.
How to handle internal streams?

Please correct me if I am wrong, but I think I either addressed your other comments or there is an open discussion regarding them.

Thanks

Jiri

teju85 · 2019-02-28T14:59:49Z

passing around cumlHandle

Agreed. Better to pass around const-ref.

One other item is the cumlHandle_impl

I'm pretty sure we'd almost need all of the info inside the cumlHandle_impl object (stream, cublas, cusolver and sooner cusparse handles) inside ml-prims. As such it also makes sense to migrate this impl class (please rename it to something meaningful) to inside of ml-prims itself. And since the dependency is from cuML to ml-prims (and should never be the other way around), this migration should be fine. What say?

jrhemstad · 2019-02-28T16:05:53Z

@harrism should probably also review this for thoughts on allocator abstractions as well as for ideas about adopting something similar in cuDF.

jirikraus · 2019-02-28T21:19:19Z

One other item is the cumlHandle_impl

I'm pretty sure we'd almost need all of the info inside the cumlHandle_impl object (stream, cublas, cusolver and sooner cusparse handles) inside ml-prims. As such it also makes sense to migrate this impl class (please rename it to something meaningful) to inside of ml-prims itself. And since the dependency is from cuML to ml-prims (and should never be the other way around), this migration should be fine. What say?

I am not sure if cumlHandle_impl should be moved to ml-prims. I see that we want to avoid duplicating resources such as cublas or cusolver handles and that we also don't want to make the ml-prims API extremely verbose. However I see potential issues with sharing these resources, especially CUDA streams, between a cuML and ml-prims handle. As that could cause false dependencies. I therefore would suggest to introduce a new ml-prims handle and move the ownership of the allocators there. cumlHandle_impl will hold a ml-prim handle and reference the allocators in there. What do you think?

jrhemstad

You'll probably want to add a MPI communicator resource to the handle eventually, but I don't think that needs to block the current PR as it's mainly an API proposal and you've sufficiently demonstrated how requesting resources will work.

teju85 · 2019-03-01T03:46:56Z

@jirikraus My main motivation behind writing that comment was to reduce duplication of resources (and thus code) between ml-prims and cuML. I'm ok with either full migration of the impl structure or a new handle in ml-prims with this impl struct being composed of it.

I think it's better addressed in a separate issue/PR.

…mmon

…UTING.md

…avoid code dublication.

Co-Authored-By: jirikraus <[email protected]>

…e interface.

jirikraus · 2019-03-08T21:01:09Z

@harrism thanks a lot for your thorough review and good feedback. I have incorporated almost all of you suggestions for DEVELOPER_GUIDE.md and hope that I did not miss to address any of your points. With the latest set of commit I changed the following:

Added allocator adapter for STL containers. + a section in the DEVELOPER_GUIDE.md highlighting it.
Dropped the multi GPU part of cumlHandle_impl
Updated DEVELOPER_GUIDE.md based on suggestions from @harrism plus a few additions
- Added a empty section about Testing and Unit Testing.
- Added a section about Documentation stating that APIs should be documented with doxygen syntax.
Added first version of doxygen documentation to the public interface (for internal interfaces like cumlHandle_impl this is still missing). I would prefer if we can merge this as it is and add missing documentation overtime in a collaborative effort.

Regarding on remark to which I could not reply inline for some reason

You don't have to reference a global argument with the latest Thrust device_vector -- the constructor has an allocator parameter. The same is true of std::vector:
https://en.cppreference.com/w/cpp/container/vector/vector (see constructor #2 for example)

vector( size_type count, const T& value, const Allocator& alloc = Allocator());

As you say that would come with the dependency to the latest thrust version for device_vector. Regarding the motivation for device_buffer and host_buffer in general I have added something to the developer guide.

GPUtester · 2019-03-12T04:23:06Z

Can one of the admins verify this patch?

@dantegd

[HOTFIX] [REVIEW] updated the ridge regression notebook. Merging on @dantegd 's approval.

teju85 reviewed Feb 27, 2019

View reviewed changes

cuML/DEVELOPER_GUIDE.md Outdated Show resolved Hide resolved

cuML/DEVELOPER_GUIDE.md Outdated Show resolved Hide resolved

teju85 requested changes Feb 27, 2019

View reviewed changes

cuML/DEVELOPER_GUIDE.md Outdated Show resolved Hide resolved

dantegd requested review from cjnolet and oyilmaz-nvidia February 27, 2019 14:37

dantegd added 2 - In Progress Currenty a work in progress proposal Change current process or code labels Feb 27, 2019

cjnolet approved these changes Feb 27, 2019

View reviewed changes

cjnolet changed the title ~~Proposal for "proper" C/C++ API (issue #92)~~ [REVIEW] Proposal for "proper" C/C++ API (issue #92) Feb 27, 2019

teju85 reviewed Feb 27, 2019

View reviewed changes

teju85 requested changes Feb 27, 2019

View reviewed changes

jirikraus force-pushed the fea-ext-cuML-cpp-iface branch 2 times, most recently from 1be12f1 to f8ff22a Compare February 27, 2019 16:36

jrhemstad reviewed Feb 27, 2019

View reviewed changes

teju85 requested changes Feb 28, 2019

View reviewed changes

cuML/src/cuML.hpp Outdated Show resolved Hide resolved

teju85 requested changes Feb 28, 2019

View reviewed changes

jirikraus force-pushed the fea-ext-cuML-cpp-iface branch from af9e567 to 15fcf12 Compare February 28, 2019 13:28

jirikraus force-pushed the fea-ext-cuML-cpp-iface branch from 15fcf12 to 169c902 Compare February 28, 2019 20:59

jrhemstad approved these changes Feb 28, 2019

View reviewed changes

jirikraus and others added 17 commits March 8, 2019 15:38

Added thrust execution policy convience function (idea from RMM)

28ab359

Added internal streams to cumlHandle

4a8ba29

Streamlined implementation of resize in device_buffer and host_buffer

8ba1c8d

Change namespace of deviceAllocator and hostAllocator from ML to MLCo…

16c41a5

…mmon

Switched to using const ML::cumlHandle_impl& instead of ML::cumlHandle*

1fcf45a

Added some content related to thread safety.

e2f662f

Moved device_buffer and host_buffer to ml-prims

1286ef1

Fixed typo in DEVELOPER_GUIDE.md and added cross reference to CONTRIB…

b139e5e

…UTING.md

Fixed DEVELOPER_GUIDE.md formatting and added C APIs section.

3ebd4b0

Extended C API proposal.

d90f276

Added buffer_base as base class for device_buffer and host_buffer to …

0854765

…avoid code dublication.

Dropped multi GPU part of cumlHandle_impl

effef57

Added allocator adapter for STL containers.

81d2637

Update cuML/DEVELOPER_GUIDE.md

ff4fdc2

Co-Authored-By: jirikraus <[email protected]>

ML::exec_policy -> ML::thrust_exec_policy

c784628

Updated cuML/DEVELOPER_GUIDE.md based on feedback from Mark Harris.

d0eef1e

Added first version of doxygen documentation to the public part of th…

694e956

…e interface.

jirikraus force-pushed the fea-ext-cuML-cpp-iface branch from 8cc4d3a to 694e956 Compare March 8, 2019 20:58

Merge branch 'branch-0.6' into fea-ext-cuML-cpp-iface

dd2e2a2

teju85 mentioned this pull request Mar 12, 2019

[FEA] upgrade cuml to c++14 #304

Closed

dantegd merged commit 49dfd22 into rapidsai:branch-0.6 Mar 12, 2019

teju85 mentioned this pull request Mar 14, 2019

[WIP] Expose cumlHandle (+ other goodies!) into cython world #331

Closed

5 tasks

dantegd mentioned this pull request Mar 26, 2019

[FEA] Allow DeviceNDArray as input data #369

Closed

teju85 mentioned this pull request Apr 1, 2019

Fixed an issue when switching streams on device_buffer and host_buffer #358

Merged

Salonijain27 pushed a commit to Salonijain27/cuml that referenced this pull request Jan 22, 2020

Merge pull request rapidsai#247 from Salonijain27/bug-ridge-nb-update

589a612

[HOTFIX] [REVIEW] updated the ridge regression notebook. Merging on @dantegd 's approval.


		device_buffer& operator=(const device_buffer& other) = delete;

		device_buffer(std::shared_ptr<deviceAllocator> allocator, size_type n = 0)

[REVIEW] Proposal for "proper" C/C++ API (issue #92) #247

[REVIEW] Proposal for "proper" C/C++ API (issue #92) #247

Conversation

jirikraus commented Feb 27, 2019

teju85 left a comment

Choose a reason for hiding this comment

cjnolet commented Feb 27, 2019

teju85 commented Feb 27, 2019

jirikraus commented Feb 27, 2019

Choose a reason for hiding this comment

jirikraus Feb 27, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jrhemstad left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jirikraus Feb 28, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jrhemstad Feb 28, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jirikraus Feb 28, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jirikraus Feb 28, 2019 • edited Loading

Choose a reason for hiding this comment

teju85 commented Feb 28, 2019

teju85 commented Feb 28, 2019

jirikraus commented Feb 28, 2019

teju85 commented Feb 28, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jirikraus commented Feb 28, 2019

teju85 commented Feb 28, 2019 • edited Loading

jrhemstad commented Feb 28, 2019

jirikraus commented Feb 28, 2019

jirikraus Feb 27, 2019 •

edited

Loading

jirikraus Feb 28, 2019 •

edited

Loading

jrhemstad Feb 28, 2019 •

edited

Loading

jirikraus Feb 28, 2019 •

edited

Loading

jirikraus Feb 28, 2019 •

edited

Loading

teju85 commented Feb 28, 2019 •

edited

Loading