Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding HeterogeneousCore/AlpakaUtilities and changes into HeterogeneousCore/AlpakaInterface #40932

Closed
wants to merge 4 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
201 changes: 201 additions & 0 deletions HeterogeneousCore/AlpakaInterface/interface/workdivision.h
Original file line number Diff line number Diff line change
Expand Up @@ -302,6 +302,207 @@ namespace cms::alpakatools {
const Vec extent_;
};

/*********************************************
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@borzari @nothingface0 could you point me to the use cases for

element_index_range_in_block
element_index_range_in_block_truncated
element_index_range_in_grid
element_index_range_in_grid_truncated

and

for_each_element_in_block
for_each_element_in_block_strided
for_each_element_in_grid_strided

?

Regarding the latter, do you prefer the lambda approach (pass a function object or lambda to the for_each_element_in_... call) or the range loop approach (get the index via a range loop, as with elements_with_stride) ?

Copy link
Contributor

@nothingface0 nothingface0 Mar 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

element_index_range_in_block

workdivision.h

here and here. It's then used by element_index_range_in_block_truncated and element_index_range_in_grid.

gpuClusterChargeCut.h

here

gpuPixelRecHits.h

here

gpuFishbone.h

here

gpuPixelDoubletsAlgos.h

here

element_index_range_in_block_truncated

workdivision.h

here. See for_each_element_in_block for implicit uses.

element_index_range_in_grid

workdivision.h

Inside for_each_element_in_grid_strided here.

gpuFishbone.h

here

gpuPixelDoubletsAlgos.h

here

element_index_range_in_grid_truncated

workdivision.h

Only used from overloaded function here. Not used anywhere else, can be deleted.


for_each_element_in_block

workdivision.h

Call from overloaded function here.

radixSort.h

here, here, here, here, here.

for_each_element_in_block_strided

workdivision.h

Call from overloaded function here.

radixSort.h

here, here, here, here, here, here, here, here, here, here, here, and here.

for_each_element_in_grid_strided

workdivision.h

Call from overloaded function here.

HistoContainer.h

here, here and here.

Comments

Since we didn't have the time to go in-depth in the patatrack-standalone algorithms, we did not try to change different ways of looping over the elements so that they're more uniform.
That said, personally speaking, the range approach (elements_with_stride) seems more intuitive, and it's cleaner in the code, too.

A visualized overview of the functions that depend on those functions can be seen below for easier(?) comprehension:

workdivision_usage drawio

* RANGE COMPUTATION
********************************************/

/*
* Computes the range of the elements indexes, local to the block.
* Warning: the max index is not truncated by the max number of elements of interest.
*/
template <typename TAcc>
ALPAKA_FN_ACC std::pair<Idx, Idx> element_index_range_in_block(const TAcc& acc,
const Idx elementIdxShift,
const unsigned int dimIndex = 0u) {
// Take into account the thread index in block.
const Idx threadIdxLocal(alpaka::getIdx<alpaka::Block, alpaka::Threads>(acc)[dimIndex]);
const Idx threadDimension(alpaka::getWorkDiv<alpaka::Thread, alpaka::Elems>(acc)[dimIndex]);

// Compute the elements indexes in block.
// Obviously relevant for CPU only.
// For GPU, threadDimension == 1, and elementIdx == firstElementIdx == threadIdx + elementIdxShift.
const Idx firstElementIdxLocal = threadIdxLocal * threadDimension;
const Idx firstElementIdx = firstElementIdxLocal + elementIdxShift; // Add the shift!
const Idx endElementIdxUncut = firstElementIdx + threadDimension;

// Return element indexes, shifted by elementIdxShift.
return {firstElementIdx, endElementIdxUncut};
}

/*
* Computes the range of the elements indexes, local to the block.
* Truncated by the max number of elements of interest.
*/
template <typename TAcc>
ALPAKA_FN_ACC std::pair<Idx, Idx> element_index_range_in_block_truncated(const TAcc& acc,
const Idx maxNumberOfElements,
const Idx elementIdxShift,
const unsigned int dimIndex = 0u) {
auto [firstElementIdxLocal, endElementIdxLocal] = element_index_range_in_block(acc, elementIdxShift, dimIndex);

// Truncate
endElementIdxLocal = std::min(endElementIdxLocal, maxNumberOfElements);

// Return element indexes, shifted by elementIdxShift, and truncated by maxNumberOfElements.
return {firstElementIdxLocal, endElementIdxLocal};
}

/*
* Computes the range of the elements indexes in grid.
* Warning: the max index is not truncated by the max number of elements of interest.
*/
template <typename TAcc>
ALPAKA_FN_ACC std::pair<Idx, Idx> element_index_range_in_grid(const TAcc& acc,
Idx elementIdxShift,
const unsigned int dimIndex = 0u) {
// Take into account the block index in grid.
const Idx blockIdxInGrid(alpaka::getIdx<alpaka::Grid, alpaka::Blocks>(acc)[dimIndex]);
const Idx blockDimension(alpaka::getWorkDiv<alpaka::Block, alpaka::Elems>(acc)[dimIndex]);

// Shift to get global indices in grid (instead of local to the block)
elementIdxShift += blockIdxInGrid * blockDimension;

// Return element indexes, shifted by elementIdxShift.
return element_index_range_in_block(acc, elementIdxShift, dimIndex);
}

/*
* Loop on all (CPU) elements.
* Elements loop makes sense in CPU case only. In GPU case, elementIdx = firstElementIdx = threadIdx + shift.
* Indexes are local to the BLOCK.
*/
template <typename TAcc, typename Func>
ALPAKA_FN_ACC void for_each_element_in_block(const TAcc& acc,
const Idx maxNumberOfElements,
const Idx elementIdxShift,
const Func func,
const unsigned int dimIndex = 0) {
const auto& [firstElementIdx, endElementIdx] =
element_index_range_in_block_truncated(acc, maxNumberOfElements, elementIdxShift, dimIndex);

for (Idx elementIdx = firstElementIdx; elementIdx < endElementIdx; ++elementIdx) {
func(elementIdx);
}
}

/*
* Overload for elementIdxShift = 0
*/
template <typename TAcc, typename Func>
ALPAKA_FN_ACC void for_each_element_in_block(const TAcc& acc,
const Idx maxNumberOfElements,
const Func func,
const unsigned int dimIndex = 0) {
const Idx elementIdxShift = 0;
for_each_element_in_block(acc, maxNumberOfElements, elementIdxShift, func, dimIndex);
}

/**************************************************************
* LOOP ON ALL ELEMENTS WITH ONE LOOP
**************************************************************/

/*
* Case where the input index i has reached the end of threadDimension: strides the input index.
* Otherwise: do nothing.
* NB 1: This helper function is used as a trick to only have one loop (like in legacy), instead of 2 loops
* (like in all the other Alpaka helpers, 'for_each_element_in_block_strided' for example,
* because of the additional loop over elements in Alpaka model).
* This allows to keep the 'continue' and 'break' statements as-is from legacy code,
* and hence avoids a lot of legacy code reshuffling.
* NB 2: Modifies i, firstElementIdx and endElementIdx.
*/
ALPAKA_FN_ACC ALPAKA_FN_INLINE bool next_valid_element_index_strided(
Idx& i, Idx& firstElementIdx, Idx& endElementIdx, const Idx stride, const Idx maxNumberOfElements) {
bool isNextStrideElementValid = true;
if (i == endElementIdx) {
firstElementIdx += stride;
endElementIdx += stride;
i = firstElementIdx;
if (i >= maxNumberOfElements) {
isNextStrideElementValid = false;
}
}
return isNextStrideElementValid;
}

template <typename TAcc, typename Func>
ALPAKA_FN_ACC void for_each_element_in_block_strided(const TAcc& acc,
const Idx maxNumberOfElements,
const Idx elementIdxShift,
const Func func,
const unsigned int dimIndex = 0) {
// Get thread / element indices in block.
const auto& [firstElementIdxNoStride, endElementIdxNoStride] =
element_index_range_in_block(acc, elementIdxShift, dimIndex);

// Stride = block size.
const Idx blockDimension(alpaka::getWorkDiv<alpaka::Block, alpaka::Elems>(acc)[dimIndex]);

// Strided access.
for (Idx threadIdx = firstElementIdxNoStride, endElementIdx = endElementIdxNoStride;
threadIdx < maxNumberOfElements;
threadIdx += blockDimension, endElementIdx += blockDimension) {
// (CPU) Loop on all elements.
if (endElementIdx > maxNumberOfElements) {
endElementIdx = maxNumberOfElements;
}
for (Idx i = threadIdx; i < endElementIdx; ++i) {
func(i);
}
}
}

/*
* Overload for elementIdxShift = 0
*/
template <typename TAcc, typename Func>
ALPAKA_FN_ACC void for_each_element_in_block_strided(const TAcc& acc,
const Idx maxNumberOfElements,
const Func func,
const unsigned int dimIndex = 0) {
const Idx elementIdxShift = 0;
for_each_element_in_block_strided(acc, maxNumberOfElements, elementIdxShift, func, dimIndex);
}

template <typename TAcc, typename Func>
ALPAKA_FN_ACC void for_each_element_in_grid_strided(const TAcc& acc,
const Idx maxNumberOfElements,
const Idx elementIdxShift,
const Func func,
const unsigned int dimIndex = 0) {
// Get thread / element indices in block.
const auto& [firstElementIdxNoStride, endElementIdxNoStride] =
element_index_range_in_grid(acc, elementIdxShift, dimIndex);

// Stride = grid size.
const Idx gridDimension(alpaka::getWorkDiv<alpaka::Grid, alpaka::Elems>(acc)[dimIndex]);

// Strided access.
for (Idx threadIdx = firstElementIdxNoStride, endElementIdx = endElementIdxNoStride;
threadIdx < maxNumberOfElements;
threadIdx += gridDimension, endElementIdx += gridDimension) {
// (CPU) Loop on all elements.
if (endElementIdx > maxNumberOfElements) {
endElementIdx = maxNumberOfElements;
}
for (Idx i = threadIdx; i < endElementIdx; ++i) {
func(i);
}
}
}

/*
* Overload for elementIdxShift = 0
*/
template <typename TAcc, typename Func>
ALPAKA_FN_ACC void for_each_element_in_grid_strided(const TAcc& acc,
const Idx maxNumberOfElements,
const Func func,
const unsigned int dimIndex = 0) {
const Idx elementIdxShift = 0;
for_each_element_in_grid_strided(acc, maxNumberOfElements, elementIdxShift, func, dimIndex);
}

} // namespace cms::alpakatools

#endif // HeterogeneousCore_AlpakaInterface_interface_workdivision_h
6 changes: 6 additions & 0 deletions HeterogeneousCore/AlpakaUtilities/BuildFile.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
<use name="HeterogeneousCore/AlpakaInterface"/>
<export>
<lib name="1"/>
</export>


53 changes: 53 additions & 0 deletions HeterogeneousCore/AlpakaUtilities/interface/AtomicPairCounter.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
#ifndef AlpakaCore_AtomicPairCounter_h
#define AlpakaCore_AtomicPairCounter_h

#include <cstdint>

#include <alpaka/alpaka.hpp>

namespace cms::alpakatools {

class AtomicPairCounter {
public:
using c_type = unsigned long long int;

ALPAKA_FN_HOST_ACC AtomicPairCounter() {}
ALPAKA_FN_HOST_ACC AtomicPairCounter(c_type i) { counter.ac = i; }

ALPAKA_FN_HOST_ACC AtomicPairCounter& operator=(c_type i) {
counter.ac = i;
return *this;
}

struct Counters {
uint32_t n; // in a "One to Many" association is the number of "One"
uint32_t m; // in a "One to Many" association is the total number of associations
};

union Atomic2 {
Counters counters;
c_type ac;
};

static constexpr c_type incr = 1UL << 32;

ALPAKA_FN_ACC Counters get() const { return counter.counters; }

// increment n by 1 and m by i. return previous value
template <typename TAcc>
ALPAKA_FN_ACC ALPAKA_FN_INLINE Counters add(const TAcc& acc, uint32_t i) {
c_type c = i;
c += incr;

Atomic2 ret;
ret.ac = alpaka::atomicAdd(acc, &counter.ac, c, alpaka::hierarchy::Blocks{});
return ret.counters;
}

private:
Atomic2 counter;
};

} // namespace cms::alpakatools

#endif // AlpakaCore_AtomicPairCounter_h
Loading