Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] Proposal for "proper" C/C++ API (issue #92) #247

Merged
merged 27 commits into from
Mar 12, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
25997e8
Proposal for "proper" C/C++ API (issue #92)
jirikraus Feb 14, 2019
5271ff2
Addressed teju85's comments on cuML/DEVELOPER_GUIDE.md
jirikraus Feb 27, 2019
5bf55f5
Added cusparseHandle_t to cumlHandle_impl need by UMAP
jirikraus Feb 27, 2019
74e6039
Fixed comments and used nullptr instead of 0.
jirikraus Feb 27, 2019
2f20e45
Added stream to host_buffer and device_buffer to avoid the default st…
jirikraus Feb 28, 2019
84ad2e2
Fixed include of std::vector.
jirikraus Feb 28, 2019
8099e38
Added Multi GPU section to developer guide proposal.
jirikraus Feb 28, 2019
e9f6771
Switched from typedef to using.
jirikraus Feb 28, 2019
2f273e5
Moved allocator interface to ml-prims
jirikraus Feb 28, 2019
28ab359
Added thrust execution policy convience function (idea from RMM)
jirikraus Feb 28, 2019
4a8ba29
Added internal streams to cumlHandle
jirikraus Feb 28, 2019
8ba1c8d
Streamlined implementation of resize in device_buffer and host_buffer
jirikraus Mar 1, 2019
16c41a5
Change namespace of deviceAllocator and hostAllocator from ML to MLCo…
jirikraus Mar 1, 2019
1fcf45a
Switched to using const ML::cumlHandle_impl& instead of ML::cumlHandle*
jirikraus Mar 1, 2019
e2f662f
Added some content related to thread safety.
jirikraus Mar 1, 2019
1286ef1
Moved device_buffer and host_buffer to ml-prims
jirikraus Mar 1, 2019
b139e5e
Fixed typo in DEVELOPER_GUIDE.md and added cross reference to CONTRIB…
jirikraus Mar 1, 2019
3ebd4b0
Fixed DEVELOPER_GUIDE.md formatting and added C APIs section.
jirikraus Mar 5, 2019
d90f276
Extended C API proposal.
jirikraus Mar 5, 2019
0854765
Added buffer_base as base class for device_buffer and host_buffer to …
jirikraus Mar 5, 2019
effef57
Dropped multi GPU part of cumlHandle_impl
jirikraus Mar 8, 2019
81d2637
Added allocator adapter for STL containers.
jirikraus Mar 8, 2019
ff4fdc2
Update cuML/DEVELOPER_GUIDE.md
harrism Mar 8, 2019
c784628
ML::exec_policy -> ML::thrust_exec_policy
jirikraus Mar 8, 2019
d0eef1e
Updated cuML/DEVELOPER_GUIDE.md based on feedback from Mark Harris.
jirikraus Mar 8, 2019
694e956
Added first version of doxygen documentation to the public part of th…
jirikraus Mar 8, 2019
dd2e2a2
Merge branch 'branch-0.6' into fea-ext-cuML-cpp-iface
dantegd Mar 12, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
# cuML 0.6.0 (Date TBD)

## New Features

- PR #247 : Added "proper" CUDA API to cuML
- PR #235: NearestNeighbors MG Support
- PR #261: UMAP Algorithm
- PR #290: NearestNeighbors numpy MG Support
Expand Down
12 changes: 7 additions & 5 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,11 +27,13 @@ into three categories:
2. Find an issue to work on. The best way is to look for the [good first issue](https://github.com/rapidsai/cuml/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22)
or [help wanted](https://github.com/rapidsai/cuml/issues?q=is%3Aissue+is%3Aopen+label%3A%22help+wanted%22) labels
3. Comment on the issue saying you are going to work on it
4. Code! Make sure to update unit tests!
5. When done, [create your pull request](https://github.com/rapidsai/cuml/compare)
6. Verify that CI passes all [status checks](https://help.github.com/articles/about-status-checks/). Fix if needed
7. Wait for other developers to review your code and update code as needed
8. Once reviewed and approved, a RAPIDS developer will merge your pull request
4. Get familar with the developer guide relevant for you:
* For C++ developers it is avaiable here [DEVELOPER_GUIDE.md](cuML/DEVELOPER_GUIDE.md)
5. Code! Make sure to update unit tests!
6. When done, [create your pull request](https://github.com/rapidsai/cuml/compare)
7. Verify that CI passes all [status checks](https://help.github.com/articles/about-status-checks/). Fix if needed
8. Wait for other developers to review your code and update code as needed
9. Once reviewed and approved, a RAPIDS developer will merge your pull request

Remember, if you are unsure about anything, don't hesitate to comment on issues
and ask for clarifications!
Expand Down
7 changes: 6 additions & 1 deletion cuML/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -188,6 +188,8 @@ add_library(cuml SHARED
src/glm/glm.cu
src/knn/knn.cu
src/kalman_filter/lkf_py.cu
src/common/cumlHandle.cpp
src/common/cuml_api.cpp
src/umap/umap.cu
)

Expand Down Expand Up @@ -219,10 +221,12 @@ target_link_libraries(ml_test
${CUDA_cublas_LIBRARY}
${CUDA_curand_LIBRARY}
${CUDA_cusolver_LIBRARY}
${CUDA_cusparse_LIBRARY}
${CUDA_CUDART_LIBRARY}
gpufaisslib
${CUDA_cusparse_LIBRARY}
${CUDA_nvgraph_LIBRARY}
gpufaisslib
gpufaisslib
faisslib
${BLAS_LIBRARIES}
cuml
Expand All @@ -240,6 +244,7 @@ target_link_libraries(ml_mg_test
${CUDA_cublas_LIBRARY}
${CUDA_curand_LIBRARY}
${CUDA_cusolver_LIBRARY}
${CUDA_cusparse_LIBRARY}
${CUDA_CUDART_LIBRARY}
${CUDA_cusparse_LIBRARY}
${CUDA_nvgraph_LIBRARY}
Expand Down
160 changes: 160 additions & 0 deletions cuML/DEVELOPER_GUIDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,160 @@
# cuML developer guide
This document summarizes rules and best practices for contributions to the cuML C++ component of rapidsai/cuml. This is a living document and contributions for clarifications or fixes and issue reports are highly welcome.

## General
Please start by reading [CONTRIBUTING.md](../CONTRIBUTING.md).

## Thread safety
cuML is thread safe so its functions can be called from multiple host threads if they use different handles.

The implementation of cuML is single threaded.

## Coding style

## Error handling
Call CUDA APIs via the provided helper macros `CUDA_CHECK`, `CUBLAS_CHECK` and `CUSOLVER_CHECK`. These macros take care of checking the return values of the used API calls and generate an exception when the command is not successful. If you need to avoid an exception, e.g. inside a destructor, use `CUDA_CHECK_NO_THROW`, `CUBLAS_CHECK_NO_THROW ` and `CUSOLVER_CHECK_NO_THROW ` (currently not available, see https://github.com/rapidsai/cuml/issues/229). These macros log the error but do not throw an exception.

## Logging
Add once https://github.com/rapidsai/cuml/issues/100 is addressed.

## Documentation
All external interfaces need to have a complete [doxygen](http://www.doxygen.nl) API documentation. This is also recommended for internal interfaces.

## Testing and Unit Testing
TODO: Add this

## Device and Host memory allocations
To enable `libcuml.so` users to control how memory for temporary data is allocated, allocate device memory using the allocator provided:
```cpp
template<typename T>
void foo(const ML::cumlHandle_impl& h, cudaStream_t stream, ... )
{
T* temp_h = h.getDeviceAllocator()->allocate(n*sizeof(T), stream);
...
h.getDeviceAllocator()->deallocate(temp_h, n*sizeof(T), stream);
}
```
The same rule applies to larger amounts of host heap memory:
```cpp
template<typename T>
void foo(const ML::cumlHandle_impl& h, cudaStream_t stream, ... )
{
T* temp_h = h.getHostAllocator()->allocate(n*sizeof(T), stream);
...
h.getHostAllocator()->deallocate(temp_h, n*sizeof(T), stream);
}
```
Small host memory heap allocations, e.g. as internally done by STL containers, are fine, e.g. an `std::vector` managing only a handful of integers.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're providing a custom allocator for host memory, it's straightforward to provide an allocator for std::vector and similar containers. You could then even provide an alias like cuml::vector that is a std::vector using your custom allocator.

We do this in cuDF with rmm::device_vector -> thrust::device_vector using RMM for the allocator.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have thought about that quite a bit. The advantage of passing in an allocator as a runtime argument over a template parameter are:

  • Irrespective of the used allocator the containers remain type compatible.
  • We can easily have multiple different allocator objects, e.g. one per device one per thread, ....

With a template argument we would need to reference a global object. Does that make sense to you?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All good points.

I suppose since you are providing a host allocator and host_buffer, there's little reason for someone to use a std::vector.

Another point in favor of "allocators as parameters" vs "template parameters" is that it allows stateful allocators, e.g., an allocator w/ a stream. This wouldn't be possible with an allocator as a template argument.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't have to reference a global argument with the latest Thrust device_vector -- the constructor has an allocator parameter. The same is true of std::vector: https://en.cppreference.com/w/cpp/container/vector/vector (see constructor #2 for example)

vector( size_type count, const T& value, const Allocator& alloc = Allocator());

Both the Host and the Device Allocators might allow asynchronous stream ordered allocation and deallocation. This can provide significant performance benefits so a stream always needs to be specified when allocating or deallocating (see [Asynchronous operations and stream ordering](# Asynchronous operations and stream ordering)). `ML::deviceAllocator` returns pinned device memory on the current device, while `ML::hostAllocator` returns host memory. A user of cuML can write customized allocators and pass them into cuML. If a cuML user does not provide custom allocators default allocators will be used. For `ML::deviceAllocator` the default is to use `cudaMalloc`/`cudaFree`. For `ML::hostAllocator` the default is to use `cudaMallocHost`/`cudaFreeHost`.
There are two simple container classes compatible with the allocator interface `MLCommon::device_buffer` available in `ml-prims/src/common/device_buffer.hpp` and `MLCommon::host_buffer` available in `ml-prims/src/common/host_buffer.hpp`. These allow to follow the [RAII idiom](https://en.wikipedia.org/wiki/Resource_acquisition_is_initialization) to avoid resources leaks and enable exception safe code. These containers also allow asynchronous allocation and deallocation using the `resize` and `release` member functions:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recommend you comment on why these are recommended over just using thrust::device_vector and std::vector or thrust::host_vector with your custom allocator. If there isn't a good reason to not use them, why create a new class?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree that should be clarified in the guide. I will add something.

```cpp
template<typename T>
void foo(const ML::cumlHandle_impl& h, ..., cudaStream_t stream )
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I especially now like the idea about writing our algos as: void cumlAlgo(const cumlHandle_impl& h, ...). This avoids a ton of getImpl calls in the downstream which would have been necessary, otherwise!

{
...
MLCommon::device_buffer<T> temp( h.getDeviceAllocator(), stream, 0 )

temp.resize(n, stream);
kernelA<<<grid, block, 0, stream>>>(..., temp.data(), ...);
kernelB<<<grid, block, 0, stream>>>(..., temp.data(), ...);
temp.release(stream);
}
```
The motivation for `MLCommon::host_buffer` and `MLCommon::device_buffer` over using `std::vector` or `thrust::device_vector` (which would require thrust 1.9.4 or later) is to enable exception safe asynchronous allocation and deallocation following stream semantics with an explicit interface while avoiding the overhead of implicitly initializing the underlying allocation.
To use `ML::hostAllocator` with a STL container the header `src/common/allocatorAdapter.hpp` provides `ML::stdAllocatorAdapter`:
```cpp
template<typename T>
void foo(const ML::cumlHandle_impl& h, ..., cudaStream_t stream )
{
...
std::vector<T,ML::stdAllocatorAdapter<T> > temp( n, val, ML::stdAllocatorAdapter<T>(h.getHostAllocator(), stream) )
...
}
```
If thrust 1.9.4 or later is avaiable for use in cuML a similar allocator can be provided for `thrust::device_vector`.
### Using Thrust [AllocationsThrust]
To ensure that thrust algorithms allocate temporary memory via the provided device memory allocator, use the `ML::thrustAllocatorAdapter` available in `src/common/allocatorAdapter.hpp` with the `thrust::cuda::par` execution policy:
```cpp
void foo(const ML::cumlHandle_impl& h, ..., cudaStream_t stream )
{
ML::thrustAllocatorAdapter alloc( h.getDeviceAllocator(), stream );
auto execution_policy = thrust::cuda::par(alloc).on(stream);
thrust::for_each(execution_policy, ... );
}
```
The header `src/common/allocatorAdapter.hpp` also provides a helper function to create an execution policy:
```cpp
void foo(const ML::cumlHandle_impl& h, ... , cudaStream_t stream )
{
auto execution_policy = ML::thrust_exec_policy(h.getDeviceAllocator(),stream);
thrust::for_each(execution_policy->on(stream), ... );
}
```

## Asynchronous operations and stream ordering
All ML algorithms should be as asynchronous as possible avoiding the use of the default stream (aka as NULL or `0` stream). Implementations that require only one CUDA Stream should use the stream from `ML::cumlHandle_impl`:
```cpp
void foo(const ML::cumlHandle_impl& h, ...)
{
cudaStream_t stream = h.getStream();
}
```
When multiple streams are needed, e.g. to manage a pipeline, use the internal streams available in `ML::cumlHandle_impl` (see [CUDA Resources](# CUDA Resources)). If multiple streams are used all operations still must be ordered according to `ML::cumlHandle::getStream()`. Before any operation in any of the internal CUDA streams is started, all previous work in `ML::cumlHandle::getStream()` must have completed. Any work enqueued in `ML::cumlHandle::getStream()` after a cuML function returns should not start before all work enqueued in the internal streams has completed. E.g. if a cuML algorithm is called like this:
```cpp
void foo(const double* const srcdata, double* const result)
{
ML::cumlHandle cumlHandle;

cudaStream_t stream;
CUDA_RT_CALL( cudaStreamCreate( &stream ) );
cumlHandle.setStream( stream );

...

CUDA_CHECK( cudaMemcpyAsync( srcdata, h_srcdata.data(), n*sizeof(double), cudaMemcpyHostToDevice, stream ) );

ML::algo(cumlHandle, dopredict, srcdata, result, ... );

CUDA_CHECK( cudaMemcpyAsync( h_result.data(), result, m*sizeof(int), cudaMemcpyDeviceToHost, stream ) );

...
}
```
No work in any stream should start in `ML::algo` before the `cudaMemcpyAsync` in `stream` launched before the call to `ML::algo` is done. And all work in all streams used in `ML::algo` should be done before the `cudaMemcpyAsync` in `stream` launched after the call to `ML::algo` starts.

This can be ensured by introducing interstream dependencies with CUDA events and `cudaStreamWaitEvent`. For convenience, the header `cumlHandle.hpp` provides the class `ML::detail::streamSyncer` which lets all `ML::cumlHandle_impl` internal CUDA streams wait on `ML::cumlHandle::getStream()` in its constructor and in its destructor and lets `ML::cumlHandle::getStream()` wait on all work enqueued in the `ML::cumlHandle_impl` internal CUDA streams. Here is an example:
```cpp
void cumlAlgo(const ML::cumlHandle_impl& h, ...)
{
ML::detail::streamSyncer _(handle);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a really cool use of RAII. Love it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

}
```
This ensures the stream ordering behavior described above.

### Using Thrust
To ensure that thrust algorithms are executed in the intended stream the `thrust::cuda::par` execution policy should be used (see [Using Thrust](# AllocationsThrust) in [Device and Host memory allocations](# Device and Host memory allocations)).

## CUDA Resources
Do not create reusable CUDA resources directly in Implementations of ML algorithms. Instead, use the existing resources in `ML::cumlHandle_impl ` to avoid constant creation and deletion of reusable resources such as CUDA streams, CUDA events or library handles. Please file a feature request if a resource handle is missing in `ML::cumlHandle_impl `.
The resources can be obtained like this
```cpp
void foo(const ML::cumlHandle_impl& h, ...)
{
cublasHandle_t cublasHandle = h.getCublasHandle();
const int num_streams = h.getNumInternalStreams();
const int stream_idx = ...
cudaStream_t stream = h.getInternalStream(stream_idx);
...
}
```

## Multi GPU

The multi GPU paradigm of cuML is **O**ne **P**rocess per **G**PU (OPG). Each algorithm should be implemented in a way that it can run with a single GPU without any dependencies to any communication library. A multi GPU implementation can assume the following:
* The user of cuML has initialized MPI and created a communicator that can be used by the ML algorithm.
* All processes in the MPI communicator call into the ML algorithm cooperatively.
* The used MPI is CUDA-aware, i.e. it is possible to directly pass device pointers to MPI

## C APIs

ML algorithms implemented in cuML should have C++ APIs that are easy to wrap in C. Use only C compatible types or objects that can be passed as opaque handles (like `cumlHandle_t`). Using templates is fine if those can be instantiated from a specialized C++ function with `extern "C"` linkage.
179 changes: 179 additions & 0 deletions cuML/src/common/allocatorAdapter.hpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,179 @@
/*
* Copyright (c) 2018-2019, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

#pragma once

#include <limits>

#include <thrust/system/cuda/execution_policy.h>

#include "../../../ml-prims/src/utils.h"

#include "../cuML.hpp"

namespace ML {

template<typename T>
class stdAllocatorAdapter
{
public:
using size_type = std::size_t;
using value_type = T;
using pointer = value_type*;
using const_pointer = const value_type*;
using reference = value_type&;
using const_reference = const value_type&;
using difference_type = std::ptrdiff_t;

template<typename U>
struct rebind
{
typedef stdAllocatorAdapter<U> other;
};

stdAllocatorAdapter() = delete;

stdAllocatorAdapter(const stdAllocatorAdapter& other) = default;

template<typename U>
stdAllocatorAdapter(stdAllocatorAdapter<U> const& other)
: _allocator(other._allocator), _stream(other._stream)
{}

stdAllocatorAdapter& operator=(const stdAllocatorAdapter& other) = default;

stdAllocatorAdapter(std::shared_ptr<hostAllocator> allocator, cudaStream_t stream)
: _allocator(allocator), _stream(stream)
{}

~stdAllocatorAdapter () {}

inline pointer address(reference ref) const
{
return &ref;
}
inline const_pointer address(const_reference ref) const
{
return &ref;
}

pointer allocate(size_type size, typename std::allocator<void>::const_pointer = 0)
{
return static_cast<pointer>(_allocator->allocate( size, _stream ));
}
void deallocate(pointer ptr, size_type size) {
_allocator->deallocate(ptr, size, _stream);
}

inline size_type max_size() const
{
return std::numeric_limits<size_type>::max() / sizeof(value_type);
}

void construct(pointer ptr, const value_type& t) const
{
new(ptr) value_type(t);
}
void destroy(pointer ptr) const
{
ptr->~value_type();
}

bool operator==(const stdAllocatorAdapter&) const
{
return true;
}
bool operator!=(const stdAllocatorAdapter& other) const
{
return !operator==(other);
}

private:
std::shared_ptr<hostAllocator> _allocator;
cudaStream_t _stream = 0;
};

/**
* @todo: Complete doxygen documentation
* @code{.cpp}
* void foo( const cumlHandle_impl& h, ... , cudaStream_t stream )
* {
* auto execution_policy = ML::thrust_exec_policy(h.getDeviceAllocator(),stream);
* thrust::for_each(execution_policy->on(stream), ... );
* }
* @endcode
*/
class thrustAllocatorAdapter
{
public:
using value_type = char;

thrustAllocatorAdapter() = delete;

thrustAllocatorAdapter(std::shared_ptr<deviceAllocator> allocator, cudaStream_t stream)
: _allocator(allocator), _stream(stream)
{}

~thrustAllocatorAdapter() {}

char* allocate(const size_t size)
{
return static_cast<char*>(_allocator->allocate( size, _stream ));
}

void deallocate(char* ptr, const size_t size)
{
_allocator->deallocate( ptr, size, _stream );
}

private:
std::shared_ptr<deviceAllocator> _allocator;
cudaStream_t _stream = 0;
};

namespace
{
thrustAllocatorAdapter _decltypeHelper{0,0};
}

/**
* @brief Returns a unique_ptr to a Thrust CUDA execution policy that uses the
* passed in allocator for temporary memory allocation.
*
* @param[in] allocator The allocator to use
* @param[in] stream The stream that the allocator will use
*
* @returns A Thrust execution policy that will use allocator for temporary memory
* allocation.
*/
inline auto thrust_exec_policy(std::shared_ptr<deviceAllocator> allocator, cudaStream_t stream) -> std::unique_ptr<decltype(thrust::cuda::par(_decltypeHelper)),std::function<void(decltype(thrust::cuda::par(_decltypeHelper))*)> >
{
thrustAllocatorAdapter * alloc{nullptr};

alloc = new thrustAllocatorAdapter(allocator, stream);

using T = decltype(thrust::cuda::par(*alloc));

auto deleter = [alloc](T* pointer) {
delete alloc;
delete pointer;
};

std::unique_ptr<T, decltype(deleter)> policy{new T(*alloc), deleter};
return policy;
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding:

Suggested change
inline auto exec_policy(const cumlHandle_impl& h, cudaStream_t stream) -> std::unique_ptr<decltype(thrust::cuda::par(_decltypeHelper)),std::function<void(decltype(thrust::cuda::par(_decltypeHelper))*)> >
{
return exec_policy(h.getDeviceAllocator(), stream);
}

Enables your thrust:: algorithm calls to be one-liners. Your users will be happier.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point, but I am not convinced that we should add this function. Regarding passing in const cumlHandle_impl& h instead of std::shared_ptr<deviceAllocator> allocator does my argument that the execution policy does not depend on cumlHandle_impl made above convince you?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see an argument about this above. Above what?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I assumed you a reading top to bottom, propably github decided to hide part of the conversation. I was refering this reply made above:

Reply begin:

Regarding passing in ML::cumHandle_impl&: This is partly just my preference to be explicit, because the execution policy only needs the allocator and nothing else from handle. Besides that it avoids introducing a dependency of ML::exec_policy that is not really needed.
Regarding the one liner: I agree that this would be nicer. @jrhemstad said that this unfortunately requires thurst 1.9.4 or newer as the thrust execution policies with older thrust versions can't be initiated with temporaries. The current workaround with auto pointers also does not allow this because the auto pointer gets out of scope before the object is used. To my understanding another nice thing with thrust 1.9.4+ for this would be that one would not need to repeat the stream. We could provide a helper function that could be used like this:

thrust::for_each(ML::exec_policy(alloc, stream), ... );

Reply end

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be even more explicit I should have propably written: "Regarding passing in ML::cumHandle_impl& instead of std::shared_ptr<deviceAllocator> allocator: This is partly just my preference to be explicit, because the execution policy only needs the allocator and nothing else from handle"

} // end namespace ML
Loading