Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] Hiding implementation details for comms #409

Merged
merged 91 commits into from
Feb 1, 2022
Merged
Show file tree
Hide file tree
Changes from 6 commits
Commits
Show all changes
91 commits
Select commit Hold shift + click to select a range
07644ec
add option to build faiss shared libs
trxcllnt Sep 30, 2021
f3e1f09
add FAISS::FAISS to the list of global targets
trxcllnt Sep 30, 2021
e038644
update cuco
trxcllnt Sep 30, 2021
3fa9c94
create a faissTargets export-set
trxcllnt Sep 30, 2021
fc132ef
Merge branch 'fix/build-shared-faiss' into fix/node-rapids-21.10
trxcllnt Sep 30, 2021
85572d0
add faiss to raft-exports
trxcllnt Sep 30, 2021
b19a4b6
revert
trxcllnt Sep 30, 2021
6c74570
Merge branch 'fix/build-shared-faiss' into fix/node-rapids-21.10
trxcllnt Sep 30, 2021
14785ff
create a faiss-exports export-set
trxcllnt Sep 30, 2021
25b62e6
mark faiss to be found as part of resolving raft-exports dependencies
trxcllnt Sep 30, 2021
76edf36
create faiss-exports export set
trxcllnt Sep 30, 2021
92851b3
create faiss-exports export set
trxcllnt Sep 30, 2021
8e266ef
export faiss::faiss target to build export set
trxcllnt Oct 5, 2021
9d535fb
Merge branch 'fix/node-rapids-21.10' into fix/build-shared-faiss
trxcllnt Oct 5, 2021
7c990b1
Merge branch 'branch-21.10' of github.com:rapidsai/raft into fix/buil…
trxcllnt Oct 5, 2021
3a20c64
Merge branch 'branch-21.12' of github.com:rapidsai/raft into fix/buil…
trxcllnt Oct 5, 2021
83e3aa6
fix faiss GLOBAL_TARGETS
trxcllnt Oct 6, 2021
2773edc
update rapids-cmake version
trxcllnt Oct 6, 2021
3ff1c4d
fix naming for generated Findfaiss.cmake module so we find conda-inst…
trxcllnt Oct 6, 2021
cbb6f2d
ensure faiss::faiss target is available in build and install export sets
trxcllnt Oct 7, 2021
713a872
add gtest to raft-exports, remove dead code
trxcllnt Oct 7, 2021
ceb4d8b
update CMake faiss variable name
trxcllnt Oct 12, 2021
1d3ffab
make -v set cmake --log-level=VERBOSE
trxcllnt Oct 12, 2021
b48e393
update cuco hash to the commit that uses rapids-cmake v21.12
trxcllnt Oct 13, 2021
dc076ae
Merge branch 'branch-21.12' of github.com:rapidsai/raft into fix/buil…
trxcllnt Oct 14, 2021
e824384
Merge branch 'branch-21.12' of github.com:rapidsai/raft into fix/buil…
trxcllnt Nov 10, 2021
078e33c
Merge branch 'branch-21.12' of github.com:rapidsai/raft into fix/buil…
trxcllnt Nov 11, 2021
30dafb6
Merge branch 'branch-22.02' of github.com:rapidsai/raft into fix/buil…
trxcllnt Nov 18, 2021
aedc6d7
update cuco version
trxcllnt Nov 19, 2021
245ea3d
Merge branch 'branch-22.02' of github.com:rapidsai/raft into fix/buil…
trxcllnt Nov 20, 2021
2b99f1a
Merge branch 'branch-22.02' of github.com:rapidsai/raft into fix/buil…
trxcllnt Dec 1, 2021
bb4b7c7
move faiss into separate raft-faiss-exports export set, only include …
trxcllnt Dec 1, 2021
e798ab1
check for faiss in raft_FIND_COMPONENTS for build side export set too
trxcllnt Dec 2, 2021
2d5898d
pass EXCLUDE_FROM_ALL in get_faiss.cmake
trxcllnt Dec 2, 2021
2cb3c3a
enable CUDA language in code_string
trxcllnt Dec 2, 2021
23912f5
pass TRUE
trxcllnt Dec 2, 2021
fc306f6
drop CUDA from the list of global languages
trxcllnt Dec 2, 2021
61c263a
link raft to faiss if faiss component is requested
trxcllnt Dec 2, 2021
3b8cdf7
HIding implementation details for comms
cjnolet Dec 7, 2021
9e80074
Removing unused include
cjnolet Dec 7, 2021
a14e034
Fixing test
cjnolet Dec 7, 2021
8b62532
Fixing style
cjnolet Dec 7, 2021
ddfcee9
Exposing `initialize_mpi_comms` for cugrpah and cuml
cjnolet Dec 7, 2021
21a2c88
ixing style
cjnolet Dec 7, 2021
2bf24cb
Updating copyrights
cjnolet Dec 7, 2021
41bd51b
link raft to faiss if faiss component requested
trxcllnt Dec 8, 2021
4a95f2e
Merge branch 'branch-22.02' of github.com:rapidsai/raft into fix/buil…
trxcllnt Dec 15, 2021
082eea9
Merge branch 'branch-22.02' of github.com:rapidsai/raft into fix/buil…
trxcllnt Dec 16, 2021
ce8a005
fix FAISS::FAISS -> faiss::faiss
trxcllnt Dec 16, 2021
c39685a
Simplify raft component CMake logic, and allow compilation without FAISS
robertmaynard Dec 20, 2021
af46a09
Add RAFT_ENABLE_NN_COMPONENT CMake build option
robertmaynard Dec 20, 2021
1e9b9bd
Adding option to turn off buildign all shared libs
cjnolet Dec 20, 2021
7fc712c
Refactor get_faiss to export faiss::faiss build target
trxcllnt Dec 20, 2021
a85b085
RAFT_ENABLE_NN_COMPONENT is a dependent option of RAFT_BUILD_SHARED_LIBS
robertmaynard Dec 20, 2021
260a727
Using lowercase faiss target
cjnolet Dec 20, 2021
bb4a7f3
Changing RAFT_BUILD_SHARED_LIBS -> RAFT_COMPILE_LIBRARIES
cjnolet Dec 20, 2021
8c02d2c
Passing deps transitively
cjnolet Dec 21, 2021
ebb7c1d
Moving faiss back to test target. We need to figure out how to pass
cjnolet Dec 21, 2021
c1e9c2d
Another round of cmake cleanups
robertmaynard Dec 22, 2021
fef8446
Merge "Add option to build faiss shared libs" into refactor_cmake_raf…
robertmaynard Dec 22, 2021
52e8635
Refactor again now that I understand the libs are an optimization
robertmaynard Dec 22, 2021
293302f
Move `nn` and `distance` to CMake components
robertmaynard Jan 4, 2022
61e5b3c
Make searching for faiss a controllable option
robertmaynard Jan 6, 2022
3701daf
Merge branch 'branch-22.02' into refactor_cmake_raft_target_logic
robertmaynard Jan 6, 2022
60c81a7
fix typo
trxcllnt Jan 13, 2022
55301e7
Merge branch 'branch-22.02' of github.com:rapidsai/raft into refactor…
trxcllnt Jan 13, 2022
d20e5ef
add GTest targets to install side of raft-exports export set
trxcllnt Jan 14, 2022
1594f7f
install gtest if not already installed
trxcllnt Jan 14, 2022
1c74204
revert get_gtest.cmake changes
trxcllnt Jan 14, 2022
0bc750b
remove raft_ prefix from global targets list
trxcllnt Jan 17, 2022
a3c0370
revert previous change, move target_link_libraries below add_library …
trxcllnt Jan 17, 2022
ea209c1
fix _lib name to add prefix
trxcllnt Jan 18, 2022
7093ad5
add missing PKG_ prefix
trxcllnt Jan 18, 2022
f613e39
guard creating alias targets when added as a submodule via CPM
trxcllnt Jan 18, 2022
9a5aa77
Merge branch 'branch-22.02' into imp-2202-hide_comms_impl
cjnolet Jan 18, 2022
97a3d95
ucp_helper from detail
cjnolet Jan 18, 2022
32bcbb6
Using includes from detail
cjnolet Jan 18, 2022
b33b46a
Removing syntax error from bad merge
cjnolet Jan 19, 2022
b2e0b01
update cuco hash
trxcllnt Jan 19, 2022
ad90d9b
Merge branch 'branch-22.02' of github.com:rapidsai/raft into refactor…
trxcllnt Jan 19, 2022
e31b04c
Merge branch 'branch-22.02' into imp-2202-hide_comms_impl
cjnolet Jan 19, 2022
902657c
Replace the use of RMM's CUDA Python bindings with those from CUDA-Py…
shwina Jan 19, 2022
d7f0074
Add cuda-python to dev env
shwina Jan 19, 2022
c7666ad
Remove older comment about cudaStream_t attribute
shwina Jan 19, 2022
8446e2f
Style
shwina Jan 19, 2022
4de51ed
0 -> cudaSuccess
shwina Jan 19, 2022
9d1bdc5
Add raft_export while rapids-cmake adds a export(COMPONENT) feature
robertmaynard Jan 19, 2022
4f2d14b
Merge remote-tracking branch 'ashwin/replace-rmm-cuda-python-bindings…
cjnolet Jan 19, 2022
ea5e0a4
Merge branch 'refactor_cmake_raft_target_logic' into imp-2202-hide_co…
cjnolet Jan 19, 2022
234c825
Merge branch 'branch-22.04' into imp-2202-hide_comms_impl
cjnolet Jan 25, 2022
93d9a57
Adding get_thrust back
cjnolet Jan 25, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
40 changes: 30 additions & 10 deletions cpp/include/raft/comms/comms.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -38,63 +38,83 @@ enum class status_t {
};

template <typename value_t>
constexpr datatype_t get_type();
constexpr datatype_t

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you put additional space here? Or is this clang-format's job?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't add any additional spaces. This must have been clang-format.

get_type();

template <>
constexpr datatype_t get_type<char>()
constexpr datatype_t

get_type<char>()
{
return datatype_t::CHAR;
}

template <>
constexpr datatype_t get_type<uint8_t>()
constexpr datatype_t

get_type<uint8_t>()
{
return datatype_t::UINT8;
}

template <>
constexpr datatype_t get_type<int>()
constexpr datatype_t

get_type<int>()
{
return datatype_t::INT32;
}

template <>
constexpr datatype_t get_type<uint32_t>()
constexpr datatype_t

get_type<uint32_t>()
{
return datatype_t::UINT32;
}

template <>
constexpr datatype_t get_type<int64_t>()
constexpr datatype_t

get_type<int64_t>()
{
return datatype_t::INT64;
}

template <>
constexpr datatype_t get_type<uint64_t>()
constexpr datatype_t

get_type<uint64_t>()
{
return datatype_t::UINT64;
}

template <>
constexpr datatype_t get_type<float>()
constexpr datatype_t

get_type<float>()
{
return datatype_t::FLOAT32;
}

template <>
constexpr datatype_t get_type<double>()
constexpr datatype_t

get_type<double>()
{
return datatype_t::FLOAT64;
}

class comms_iface {
public:
virtual int get_size() const = 0;

virtual int get_rank() const = 0;

virtual std::unique_ptr<comms_iface> comm_split(int color, int key) const = 0;
virtual void barrier() const = 0;

virtual void barrier() const = 0;

virtual status_t sync_stream(cudaStream_t stream) const = 0;

Expand Down
168 changes: 168 additions & 0 deletions cpp/include/raft/comms/comms_test.hpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,168 @@
/*
* Copyright (c) 2020, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we really place a test file under the main include directory? (i.e. include/raft)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@seunghwak, that's a great point and I'm open to suggestions on this one. In short, even though the file contains test functions, I've exposed them through the public API primarily for the hand-off to the cython layer.

I guess on one hand it could be argued that the functions are internal to RAFT so maybe they could be treated as implementation details. On the other hand, they are public to the C/C++ API of the comms package so we are trying to make explicit the points which are exposed outside of the comms C/C++ layer. If we were only using gtest for these functions, I think they'd just be implemented there, but these functions are also intended to enable some real-time tests for system debugging.

Maybe it would be more explicit if we put them in a raft/comms/test directory? For now, it would be the only file in there unless maybe we split the current file into smaller pieces. Anyways, suggestions are welcome here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but these functions are also intended to enable some real-time tests for system debugging.

If this is the case, I am not opposed to placing these files in the include/raft(/comms). include/raft/comms/test is preferable if there are many files under this directory and we need further grouping but right at this moment, it sounds like an overkill.

If the test files are purely for internal testing, I prefer something like include/raft_test but I guess this is irrelevant.


#pragma once

#include <raft/comms/comms.hpp>
#include <raft/comms/detail/test.hpp>

#include <raft/handle.hpp>

namespace raft {
namespace comms {

/**
* @brief A simple sanity check that NCCL is able to perform a collective operation
*
* @param[in] handle the raft handle to use. This is expected to already have an
* initialized comms instance.
* @param[in] root the root rank id
*/
bool test_collective_allreduce(const handle_t& handle, int root)
{
return detail::test_collective_allreduce(handle, root);
}

/**
* @brief A simple sanity check that NCCL is able to perform a collective operation
*
* @param[in] handle the raft handle to use. This is expected to already have an
* initialized comms instance.
* @param[in] root the root rank id
*/
bool test_collective_broadcast(const handle_t& handle, int root)
{
return detail::test_collective_broadcast(handle, root);
}

/**
* @brief A simple sanity check that NCCL is able to perform a collective reduce
*
* @param[in] handle the raft handle to use. This is expected to already have an
* initialized comms instance.
* @param[in] root the root rank id
*/
bool test_collective_reduce(const handle_t& handle, int root)
{
return detail::test_collective_reduce(handle, root);
}

/**
* @brief A simple sanity check that NCCL is able to perform a collective allgather
*
* @param[in] handle the raft handle to use. This is expected to already have an
* initialized comms instance.
* @param[in] root the root rank id
*/
bool test_collective_allgather(const handle_t& handle, int root)
{
return detail::test_collective_allgather(handle, root);
}

/**
* @brief A simple sanity check that NCCL is able to perform a collective gather
*
* @param[in] handle the raft handle to use. This is expected to already have an
* initialized comms instance.
* @param[in] root the root rank id
*/
bool test_collective_gather(const handle_t& handle, int root)
{
return detail::test_collective_gather(handle, root);
}

/**
* @brief A simple sanity check that NCCL is able to perform a collective gatherv
*
* @param[in] handle the raft handle to use. This is expected to already have an
* initialized comms instance.
* @param[in] root the root rank id
*/
bool test_collective_gatherv(const handle_t& handle, int root)
{
return detail::test_collective_gatherv(handle, root);
}

/**
* @brief A simple sanity check that NCCL is able to perform a collective reducescatter
*
* @param[in] handle the raft handle to use. This is expected to already have an
* initialized comms instance.
* @param[in] root the root rank id
*/
bool test_collective_reducescatter(const handle_t& handle, int root)
{
return detail::test_collective_reducescatter(handle, root);
}

/**
* A simple sanity check that UCX is able to send messages between all ranks
*
* @param[in] h the raft handle to use. This is expected to already have an
* initialized comms instance.
* @param[in] numTrials number of iterations of all-to-all messaging to perform
*/
bool test_pointToPoint_simple_send_recv(const handle_t& h, int numTrials)
{
return detail::test_pointToPoint_simple_send_recv(h, numTrials);
}

/**
* A simple sanity check that device is able to send OR receive.
*
* @param h the raft handle to use. This is expected to already have an
* initialized comms instance.
* @param numTrials number of iterations of send or receive messaging to perform
*/
bool test_pointToPoint_device_send_or_recv(const handle_t& h, int numTrials)
{
return detail::test_pointToPoint_device_send_or_recv(h, numTrials);
}

/**
* A simple sanity check that device is able to send and receive at the same time.
*
* @param h the raft handle to use. This is expected to already have an
* initialized comms instance.
* @param numTrials number of iterations of send or receive messaging to perform
*/
bool test_pointToPoint_device_sendrecv(const handle_t& h, int numTrials)
{
return detail::test_pointToPoint_device_sendrecv(h, numTrials);
}

/**
* A simple sanity check that device is able to perform multiple concurrent sends and receives.
*
* @param h the raft handle to use. This is expected to already have an
* initialized comms instance.
* @param numTrials number of iterations of send or receive messaging to perform
*/
bool test_pointToPoint_device_multicast_sendrecv(const handle_t& h, int numTrials)
{
return detail::test_pointToPoint_device_multicast_sendrecv(h, numTrials);
}

/**
* A simple test that the comms can be split into 2 separate subcommunicators
*
* @param h the raft handle to use. This is expected to already have an
* initialized comms instance.
* @param n_colors number of different colors to test
*/
bool test_commsplit(const handle_t& h, int n_colors) { return detail::test_commsplit(h, n_colors); }
} // namespace comms
}; // namespace raft
Loading