Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HOTFIX] Remove -g from cython compile commands #321

Merged
merged 41 commits into from
Sep 16, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
41 commits
Select commit Hold shift + click to select a range
7fc5a22
DOC v21.08 Updates
raydouglass May 19, 2021
bd40f7a
Merge remote-tracking branch 'upstream/branch-21.06' into branch-21.0…
ajschmidt8 May 24, 2021
8183ac9
Merge pull request #236 from ajschmidt8/branch-21.08-merge-21.06
ajschmidt8 May 24, 2021
e9606cb
Merge pull request #238 from rapidsai/branch-21.06
GPUtester May 27, 2021
587cff1
Merge pull request #239 from rapidsai/branch-21.06
GPUtester May 27, 2021
6a0605b
Merge pull request #242 from rapidsai/branch-21.06
GPUtester May 29, 2021
5ac095f
Merge pull request #244 from rapidsai/branch-21.06
GPUtester Jun 2, 2021
7f7a443
Merge pull request #245 from rapidsai/branch-21.06
GPUtester Jun 2, 2021
f9be523
Merge pull request #247 from rapidsai/branch-21.06
GPUtester Jun 2, 2021
130d661
Merge pull request #249 from rapidsai/branch-21.06
GPUtester Jun 2, 2021
3e30601
Merge pull request #256 from rapidsai/branch-21.06
GPUtester Jun 7, 2021
fedab76
Update UCX-Py version to 0.21 (#255)
pentschev Jun 7, 2021
f1ea3e0
removing divye from codeowners (#257)
divyegala Jun 7, 2021
21cd7b0
Fix mst knn test build failure due to RMM device_buffer change (#253)
mdoijade Jun 8, 2021
f9b3c49
Update get_rmm.cmake to better support CalVer (#258)
harrism Jun 8, 2021
f65ed02
Pass stream to device_scalar::value() calls. (#259)
harrism Jun 8, 2021
1dc7423
Merge pull request #262 from rapidsai/branch-21.06
GPUtester Jun 8, 2021
1fb6e7c
Merge pull request #267 from rapidsai/branch-21.06
GPUtester Jun 9, 2021
1c1b4a0
Move FAISS ANN wrappers to RAFT (#265)
cjnolet Jun 9, 2021
6c02b59
Revert "pin dask versions in CI (#260)" (#264)
ajschmidt8 Jun 10, 2021
73417b2
Move ANN to RAFT (additional updates) (#270)
cjnolet Jun 10, 2021
926a9c6
Add Grid stride pairwise dist and fused L2 NN kernels (#250)
mdoijade Jun 11, 2021
c5a87f5
Merge remote-tracking branch 'upstream/branch-21.06' into branch-21.0…
ajschmidt8 Jun 11, 2021
b963e68
Merge pull request #274 from ajschmidt8/branch-21.08-merge-21.06
ajschmidt8 Jun 11, 2021
bfaa6a0
expose epsilon parameter to allow precision to to be specified (#275)
ChuckHastings Jun 15, 2021
806b7fa
Use nested include in destination of install headers to avoid docker …
dantegd Jun 18, 2021
b266d54
Add lds and sts inline ptx instructions to force vector instruction g…
mdoijade Jun 21, 2021
2ba5d76
Use the 21.08 branch of rapids-cmake as rmm requires it (#278)
robertmaynard Jun 22, 2021
caa44e6
Sparse semirings cleanup + hash table & batching strategies (#269)
divyegala Jun 23, 2021
03e666e
Add `cuco::cuco` to list of linked libraries (#279)
trxcllnt Jun 24, 2021
4f959fc
Const raft handle in sparse bfknn (#280)
cjnolet Jun 25, 2021
eabaafe
Use `NVIDIA/cuCollections` repo again (#284)
trxcllnt Jul 1, 2021
82061e0
Add chebyshev, canberra, minkowksi and hellinger distance metrics (#276)
mdoijade Jul 6, 2021
22a16dd
Always add faiss library alias if it's missing (#287)
trxcllnt Jul 9, 2021
f94780c
Remaining sparse semiring distances (#261)
cjnolet Jul 12, 2021
35411a0
Overlap epilog compute with ldg of next grid stride in pairwise dista…
mdoijade Jul 15, 2021
14ff641
Fix support for different input and output types in linalg::reduce (#…
Nyrio Jul 21, 2021
c527774
Pin max `dask` & `distributed` versions (#301)
galipremsagar Jul 27, 2021
a3af389
Pinning cuco to a specific commit hash for release (#304)
rlratzel Jul 28, 2021
d66067f
update changelog
raydouglass Aug 4, 2021
955fa5e
Remove -g from cython compile commands (#317)
trxcllnt Aug 25, 2021
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions .github/CODEOWNERS
Validating CODEOWNERS rules …
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
#cpp code owners
cpp/ @divyegala @rapidsai/cuml-cpp-codeowners @rapidsai/cugraph-cpp-codeowners
cpp/ @rapidsai/cuml-cpp-codeowners @rapidsai/cugraph-cpp-codeowners

#python code owners
python/ @divyegala @rapidsai/cuml-python-codeowners @rapidsai/cugraph-python-codeowners
python/ @rapidsai/cuml-python-codeowners @rapidsai/cugraph-python-codeowners

#cmake code owners
**/CMakeLists.txt @divyegala @rapidsai/cuml-cmake-codeowners @rapidsai/cugraph-cmake-codeowners
**/CMakeLists.txt @rapidsai/cuml-cmake-codeowners @rapidsai/cugraph-cmake-codeowners
**/cmake/ @rapidsai/cuml-cmake-codeowners @rapidsai/cugraph-cmake-codeowners
python/setup.py @rapidsai/cuml-cmake-codeowners @rapidsai/cugraph-cmake-codeowners
build.sh @rapidsai/cuml-cmake-codeowners @rapidsai/cugraph-cmake-codeowners
Expand Down
41 changes: 41 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,44 @@
# raft 21.08.00 (4 Aug 2021)

## 🚨 Breaking Changes

- expose epsilon parameter to allow precision to to be specified ([#275](https://github.com/rapidsai/raft/pull/275)) [@ChuckHastings](https://github.com/ChuckHastings)

## 🐛 Bug Fixes

- Fix support for different input and output types in linalg::reduce ([#296](https://github.com/rapidsai/raft/pull/296)) [@Nyrio](https://github.com/Nyrio)
- Const raft handle in sparse bfknn ([#280](https://github.com/rapidsai/raft/pull/280)) [@cjnolet](https://github.com/cjnolet)
- Add `cuco::cuco` to list of linked libraries ([#279](https://github.com/rapidsai/raft/pull/279)) [@trxcllnt](https://github.com/trxcllnt)
- Use nested include in destination of install headers to avoid docker permission issues ([#263](https://github.com/rapidsai/raft/pull/263)) [@dantegd](https://github.com/dantegd)
- Update UCX-Py version to 0.21 ([#255](https://github.com/rapidsai/raft/pull/255)) [@pentschev](https://github.com/pentschev)
- Fix mst knn test build failure due to RMM device_buffer change ([#253](https://github.com/rapidsai/raft/pull/253)) [@mdoijade](https://github.com/mdoijade)

## 🚀 New Features

- Add chebyshev, canberra, minkowksi and hellinger distance metrics ([#276](https://github.com/rapidsai/raft/pull/276)) [@mdoijade](https://github.com/mdoijade)
- Move FAISS ANN wrappers to RAFT ([#265](https://github.com/rapidsai/raft/pull/265)) [@cjnolet](https://github.com/cjnolet)
- Remaining sparse semiring distances ([#261](https://github.com/rapidsai/raft/pull/261)) [@cjnolet](https://github.com/cjnolet)
- removing divye from codeowners ([#257](https://github.com/rapidsai/raft/pull/257)) [@divyegala](https://github.com/divyegala)

## 🛠️ Improvements

- Pinning cuco to a specific commit hash for release ([#304](https://github.com/rapidsai/raft/pull/304)) [@rlratzel](https://github.com/rlratzel)
- Pin max `dask` & `distributed` versions ([#301](https://github.com/rapidsai/raft/pull/301)) [@galipremsagar](https://github.com/galipremsagar)
- Overlap epilog compute with ldg of next grid stride in pairwise distance & fusedL2NN kernels ([#292](https://github.com/rapidsai/raft/pull/292)) [@mdoijade](https://github.com/mdoijade)
- Always add faiss library alias if it's missing ([#287](https://github.com/rapidsai/raft/pull/287)) [@trxcllnt](https://github.com/trxcllnt)
- Use `NVIDIA/cuCollections` repo again ([#284](https://github.com/rapidsai/raft/pull/284)) [@trxcllnt](https://github.com/trxcllnt)
- Use the 21.08 branch of rapids-cmake as rmm requires it ([#278](https://github.com/rapidsai/raft/pull/278)) [@robertmaynard](https://github.com/robertmaynard)
- expose epsilon parameter to allow precision to to be specified ([#275](https://github.com/rapidsai/raft/pull/275)) [@ChuckHastings](https://github.com/ChuckHastings)
- Fix `21.08` forward-merge conflicts ([#274](https://github.com/rapidsai/raft/pull/274)) [@ajschmidt8](https://github.com/ajschmidt8)
- Add lds and sts inline ptx instructions to force vector instruction generation ([#273](https://github.com/rapidsai/raft/pull/273)) [@mdoijade](https://github.com/mdoijade)
- Move ANN to RAFT (additional updates) ([#270](https://github.com/rapidsai/raft/pull/270)) [@cjnolet](https://github.com/cjnolet)
- Sparse semirings cleanup + hash table & batching strategies ([#269](https://github.com/rapidsai/raft/pull/269)) [@divyegala](https://github.com/divyegala)
- Revert "pin dask versions in CI ([#260)" (#264](https://github.com/rapidsai/raft/pull/260)" (#264)) [@ajschmidt8](https://github.com/ajschmidt8)
- Pass stream to device_scalar::value() calls. ([#259](https://github.com/rapidsai/raft/pull/259)) [@harrism](https://github.com/harrism)
- Update get_rmm.cmake to better support CalVer ([#258](https://github.com/rapidsai/raft/pull/258)) [@harrism](https://github.com/harrism)
- Add Grid stride pairwise dist and fused L2 NN kernels ([#250](https://github.com/rapidsai/raft/pull/250)) [@mdoijade](https://github.com/mdoijade)
- Fix merge conflicts ([#236](https://github.com/rapidsai/raft/pull/236)) [@ajschmidt8](https://github.com/ajschmidt8)

# raft 21.06.00 (9 Jun 2021)

## 🐛 Bug Fixes
Expand Down
6 changes: 3 additions & 3 deletions ci/gpu/build.sh
Original file line number Diff line number Diff line change
Expand Up @@ -51,16 +51,16 @@ gpuci_conda_retry install -c conda-forge -c rapidsai -c rapidsai-nightly -c nvid
"rmm=${MINOR_VERSION}" \
"dask-cudf=${MINOR_VERSION}" \
"dask-cuda=${MINOR_VERSION}" \
"ucx-py=0.20.*" \
"ucx-py=0.21.*" \
"rapids-build-env=${MINOR_VERSION}.*" \
"rapids-notebook-env=${MINOR_VERSION}.*" \
"rapids-doc-env=${MINOR_VERSION}.*"

# Install the master version of dask, distributed, and dask-ml
gpuci_logger "Install the master version of dask and distributed"
set -x
pip install "git+https://github.com/dask/distributed.git@2021.05.1" --upgrade --no-deps
pip install "git+https://github.com/dask/dask.git@2021.05.1" --upgrade --no-deps
pip install "git+https://github.com/dask/distributed.git@2021.07.1" --upgrade --no-deps
pip install "git+https://github.com/dask/dask.git@2021.07.1" --upgrade --no-deps
set +x


Expand Down
6 changes: 3 additions & 3 deletions ci/local/old-gpubuild.sh
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,7 @@ gpuci_conda_retry install -c conda-forge -c rapidsai -c rapidsai-nightly -c nvid
"distributed>=2.12.0" \
"dask-cudf=${MINOR_VERSION}" \
"dask-cuda=${MINOR_VERSION}" \
"ucx-py=0.20.*"
"ucx-py=0.21.*"

if [ "$RUN_CUML_LIBCUML_TESTS" = "ON" ] || [ "$RUN_CUML_PRIMS_TESTS" = "ON" ] || [ "$RUN_CUML_PYTHON_TESTS" = "ON" ]; then
gpuci_conda_retry install -c conda-forge -c rapidsai -c rapidsai-nightly -c nvidia \
Expand All @@ -81,8 +81,8 @@ fi

# Install the master version of dask, distributed, and dask-ml
set -x
pip install "git+https://github.com/dask/distributed.git@2021.05.1" --upgrade --no-deps
pip install "git+https://github.com/dask/dask.git@2021.05.1" --upgrade --no-deps
pip install "git+https://github.com/dask/distributed.git@2021.07.1" --upgrade --no-deps
pip install "git+https://github.com/dask/dask.git@2021.07.1" --upgrade --no-deps
set +x


Expand Down
17 changes: 13 additions & 4 deletions cpp/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ include(FetchContent)
FetchContent_Declare(
rapids-cmake
GIT_REPOSITORY https://github.com/rapidsai/rapids-cmake.git
GIT_TAG origin/branch-21.06
GIT_TAG origin/branch-21.08
)
FetchContent_MakeAvailable(rapids-cmake)
include(rapids-cmake)
Expand All @@ -30,7 +30,7 @@ include(rapids-find)

rapids_cuda_init_architectures(RAFT)

project(RAFT VERSION 21.06.00 LANGUAGES CXX CUDA)
project(RAFT VERSION 21.08.00 LANGUAGES CXX CUDA)

##############################################################################
# - build type ---------------------------------------------------------------
Expand Down Expand Up @@ -104,6 +104,7 @@ endif(NOT DISABLE_OPENMP OR NOT ${DISABLE_OPENMP})
# add third party dependencies using CPM
rapids_cpm_init()

include(cmake/thirdparty/get_thrust.cmake)
include(cmake/thirdparty/get_rmm.cmake)
include(cmake/thirdparty/get_cuco.cmake)

Expand All @@ -117,6 +118,8 @@ endif()
##############################################################################
# - install targets-----------------------------------------------------------

include(CPack)

add_library(raft INTERFACE)
add_library(raft::raft ALIAS raft)
target_include_directories(raft INTERFACE "$<BUILD_INTERFACE:${RAFT_SOURCE_DIR}/include>"
Expand All @@ -130,6 +133,7 @@ INTERFACE
CUDA::cudart
CUDA::cusparse
rmm::rmm
cuco::cuco
)

target_compile_features(raft INTERFACE cxx_std_17 $<BUILD_INTERFACE:cuda_std_17>)
Expand All @@ -140,10 +144,15 @@ install(TARGETS raft
)

include(GNUInstallDirs)
install(DIRECTORY include/
DESTINATION ${CMAKE_INSTALL_INCLUDEDIR}
install(DIRECTORY include/raft/
DESTINATION ${CMAKE_INSTALL_INCLUDEDIR}/raft
)

# Temporary install of raft.hpp while the file is removed
install(FILES include/raft.hpp
DESTINATION ${CMAKE_INSTALL_INCLUDEDIR}/raft
)

##############################################################################
# - install export -----------------------------------------------------------
set(doc_string
Expand Down
14 changes: 8 additions & 6 deletions cpp/cmake/thirdparty/get_cuco.cmake
Original file line number Diff line number Diff line change
Expand Up @@ -16,20 +16,22 @@

function(find_and_configure_cuco VERSION)

if(TARGET cuco::cuco)
return()
endif()

rapids_cpm_find(cuco ${VERSION}
GLOBAL_TARGETS cuco cuco::cuco
GLOBAL_TARGETS cuco::cuco
BUILD_EXPORT_SET raft-exports
INSTALL_EXPORT_SET raft-exports
CPM_ARGS
GIT_REPOSITORY https://github.com/NVIDIA/cuCollections.git
GIT_TAG 0b672bbde7c85a79df4d7ca5f82e15e5b4a57700
GIT_TAG b1fea0cbe4c384160740af00f7c8760846539abb
OPTIONS "BUILD_TESTS OFF"
"BUILD_BENCHMARKS OFF"
"BUILD_EXAMPLES OFF"
)

if(NOT TARGET cuco::cuco)
add_library(cuco::cuco ALIAS cuco)
endif()

endfunction()

find_and_configure_cuco(0.0.1)
5 changes: 4 additions & 1 deletion cpp/cmake/thirdparty/get_faiss.cmake
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,10 @@ function(find_and_configure_faiss)

if(FAISS_ADDED)
set(FAISS_GPU_HEADERS ${FAISS_SOURCE_DIR} PARENT_SCOPE)
add_library(FAISS::FAISS ALIAS faiss)
endif()

if(TARGET faiss AND NOT TARGET FAISS::FAISS)
add_library(FAISS::FAISS ALIAS faiss)
endif()

endfunction()
Expand Down
10 changes: 8 additions & 2 deletions cpp/cmake/thirdparty/get_rmm.cmake
Original file line number Diff line number Diff line change
Expand Up @@ -20,13 +20,19 @@ function(find_and_configure_rmm VERSION)
return()
endif()

if(${VERSION} MATCHES [=[([0-9]+)\.([0-9]+)\.([0-9]+)]=])
set(MAJOR_AND_MINOR "${CMAKE_MATCH_1}.${CMAKE_MATCH_2}")
else()
set(MAJOR_AND_MINOR "${VERSION}")
endif()

rapids_cpm_find(rmm ${VERSION}
GLOBAL_TARGETS rmm::rmm
BUILD_EXPORT_SET raft-exports
INSTALL_EXPORT_SET raft-exports
CPM_ARGS
GIT_REPOSITORY https://github.com/rapidsai/rmm.git
GIT_TAG branch-${VERSION}
GIT_TAG branch-${MAJOR_AND_MINOR}
GIT_SHALLOW TRUE
OPTIONS "BUILD_TESTS OFF"
"BUILD_BENCHMARKS OFF"
Expand All @@ -36,6 +42,6 @@ function(find_and_configure_rmm VERSION)

endfunction()

set(RAFT_MIN_VERSION_rmm "${RAFT_VERSION_MAJOR}.${RAFT_VERSION_MINOR}")
set(RAFT_MIN_VERSION_rmm "${RAFT_VERSION_MAJOR}.${RAFT_VERSION_MINOR}.00")

find_and_configure_rmm(${RAFT_MIN_VERSION_rmm})
30 changes: 30 additions & 0 deletions cpp/cmake/thirdparty/get_thrust.cmake
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
# =============================================================================
# Copyright (c) 2021, NVIDIA CORPORATION.
#
# Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except
# in compliance with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software distributed under the License
# is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express
# or implied. See the License for the specific language governing permissions and limitations under
# the License.
# =============================================================================

# Use CPM to find or clone thrust
function(find_and_configure_thrust VERSION)

rapids_cpm_find(
Thrust ${VERSION}
BUILD_EXPORT_SET raft-exports
INSTALL_EXPORT_SET raft-exports
CPM_ARGS
GIT_REPOSITORY https://github.com/NVIDIA/thrust.git
GIT_TAG ${VERSION}
GIT_SHALLOW TRUE
OPTIONS "THRUST_INSTALL OFF")

endfunction()

find_and_configure_thrust(1.12.0)
101 changes: 68 additions & 33 deletions cpp/include/raft/common/device_loads_stores.cuh
Original file line number Diff line number Diff line change
Expand Up @@ -24,60 +24,95 @@ namespace raft {
* @defgroup SmemStores Shared memory store operations
* @{
* @brief Stores to shared memory (both vectorized and non-vectorized forms)
* @param[out] addr shared memory address
* requires the given shmem pointer to be aligned by the vector
length, like for float4 lds/sts shmem pointer should be aligned
by 16 bytes else it might silently fail or can also give
runtime error.
* @param[out] addr shared memory address (should be aligned to vector size)
* @param[in] x data to be stored at this address
*/
DI void sts(float* addr, const float& x) { *addr = x; }
DI void sts(float* addr, const float (&x)[1]) { *addr = x[0]; }
DI void sts(float* addr, const float& x) {
auto s1 = __cvta_generic_to_shared(reinterpret_cast<float*>(addr));
asm volatile("st.shared.f32 [%0], {%1};" : : "l"(s1), "f"(x));
}
DI void sts(float* addr, const float (&x)[1]) {
auto s1 = __cvta_generic_to_shared(reinterpret_cast<float*>(addr));
asm volatile("st.shared.f32 [%0], {%1};" : : "l"(s1), "f"(x[0]));
}
DI void sts(float* addr, const float (&x)[2]) {
float2 v2 = make_float2(x[0], x[1]);
auto* s2 = reinterpret_cast<float2*>(addr);
*s2 = v2;
auto s2 = __cvta_generic_to_shared(reinterpret_cast<float2*>(addr));
asm volatile("st.shared.v2.f32 [%0], {%1, %2};"
:
: "l"(s2), "f"(x[0]), "f"(x[1]));
}
DI void sts(float* addr, const float (&x)[4]) {
float4 v4 = make_float4(x[0], x[1], x[2], x[3]);
auto* s4 = reinterpret_cast<float4*>(addr);
*s4 = v4;
auto s4 = __cvta_generic_to_shared(reinterpret_cast<float4*>(addr));
asm volatile("st.shared.v4.f32 [%0], {%1, %2, %3, %4};"
:
: "l"(s4), "f"(x[0]), "f"(x[1]), "f"(x[2]), "f"(x[3]));
}

DI void sts(double* addr, const double& x) {
auto s1 = __cvta_generic_to_shared(reinterpret_cast<double*>(addr));
asm volatile("st.shared.f64 [%0], {%1};" : : "l"(s1), "d"(x));
}
DI void sts(double* addr, const double (&x)[1]) {
auto s1 = __cvta_generic_to_shared(reinterpret_cast<double*>(addr));
asm volatile("st.shared.f64 [%0], {%1};" : : "l"(s1), "d"(x[0]));
}
DI void sts(double* addr, const double& x) { *addr = x; }
DI void sts(double* addr, const double (&x)[1]) { *addr = x[0]; }
DI void sts(double* addr, const double (&x)[2]) {
double2 v2 = make_double2(x[0], x[1]);
auto* s2 = reinterpret_cast<double2*>(addr);
*s2 = v2;
auto s2 = __cvta_generic_to_shared(reinterpret_cast<double2*>(addr));
asm volatile("st.shared.v2.f64 [%0], {%1, %2};"
:
: "l"(s2), "d"(x[0]), "d"(x[1]));
}
/** @} */

/**
* @defgroup SmemLoads Shared memory load operations
* @{
* @brief Loads from shared memory (both vectorized and non-vectorized forms)
requires the given shmem pointer to be aligned by the vector
length, like for float4 lds/sts shmem pointer should be aligned
by 16 bytes else it might silently fail or can also give
runtime error.
* @param[out] x the data to be loaded
* @param[in] addr shared memory address from where to load
* (should be aligned to vector size)
*/
DI void lds(float& x, float* addr) { x = *addr; }
DI void lds(float (&x)[1], float* addr) { x[0] = *addr; }
DI void lds(float& x, float* addr) {
auto s1 = __cvta_generic_to_shared(reinterpret_cast<float*>(addr));
asm volatile("ld.shared.f32 {%0}, [%1];" : "=f"(x) : "l"(s1));
}
DI void lds(float (&x)[1], float* addr) {
auto s1 = __cvta_generic_to_shared(reinterpret_cast<float*>(addr));
asm volatile("ld.shared.f32 {%0}, [%1];" : "=f"(x[0]) : "l"(s1));
}
DI void lds(float (&x)[2], float* addr) {
auto* s2 = reinterpret_cast<float2*>(addr);
auto v2 = *s2;
x[0] = v2.x;
x[1] = v2.y;
auto s2 = __cvta_generic_to_shared(reinterpret_cast<float2*>(addr));
asm volatile("ld.shared.v2.f32 {%0, %1}, [%2];"
: "=f"(x[0]), "=f"(x[1])
: "l"(s2));
}
DI void lds(float (&x)[4], float* addr) {
auto* s4 = reinterpret_cast<float4*>(addr);
auto v4 = *s4;
x[0] = v4.x;
x[1] = v4.y;
x[2] = v4.z;
x[3] = v4.w;
}
DI void lds(double& x, double* addr) { x = *addr; }
DI void lds(double (&x)[1], double* addr) { x[0] = *addr; }
auto s4 = __cvta_generic_to_shared(reinterpret_cast<float4*>(addr));
asm volatile("ld.shared.v4.f32 {%0, %1, %2, %3}, [%4];"
: "=f"(x[0]), "=f"(x[1]), "=f"(x[2]), "=f"(x[3])
: "l"(s4));
}
DI void lds(double& x, double* addr) {
auto s1 = __cvta_generic_to_shared(reinterpret_cast<double*>(addr));
asm volatile("ld.shared.f64 {%0}, [%1];" : "=d"(x) : "l"(s1));
}
DI void lds(double (&x)[1], double* addr) {
auto s1 = __cvta_generic_to_shared(reinterpret_cast<double*>(addr));
asm volatile("ld.shared.f64 {%0}, [%1];" : "=d"(x[0]) : "l"(s1));
}
DI void lds(double (&x)[2], double* addr) {
auto* s2 = reinterpret_cast<double2*>(addr);
auto v2 = *s2;
x[0] = v2.x;
x[1] = v2.y;
auto s2 = __cvta_generic_to_shared(reinterpret_cast<double2*>(addr));
asm volatile("ld.shared.v2.f64 {%0, %1}, [%2];"
: "=d"(x[0]), "=d"(x[1])
: "l"(s2));
}
/** @} */

Expand Down
Loading