Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add AlpakaCore/HistoContainer.h + HistoContainer_t, OneHistoContainer_t and OneToManyAssoc_t tests #165

Merged
merged 32 commits into from
Feb 2, 2021

Conversation

ghugo83
Copy link
Contributor

@ghugo83 ghugo83 commented Jan 27, 2021

Run smoothly and results are consistent with 'pure CUDA' versions.

#include "AlpakaCore/alpakastdAlgorithm.h"
#include "AlpakaCore/prefixScan.h"

using namespace ALPAKA_ACCELERATOR_NAMESPACE;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not do using namespace ... in headers.

Which reminds me that one using namespace alpaka_common; slipped through in #160, apparently I even forgot to comment about it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ha yes will remove

assert(ih < int(nh));
h->count(acc, v[i], ih);
}
endElementIdx += gridDimension;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a thought, I'd find it a bit clearer if this increment would be moved in the outer for-loop statement

for (uint32_t threadIdx = firstElementIdxNoStride[0u]; threadIdx < nt; threadIdx += gridDimension, endElementIdx += gridDimension) {

I was very confused first, partially because I was missing this increment of endElementIdx.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes sure. Anyway will probably need to add an additional helper, as the present one is not adapted for the += gridDimension stride.


for (uint32_t i = firstElementIdxGlobal[0u]; i < endElementIdxGlobal[0u]; ++i) {
h->off[i] = 0;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CUDA implementation uses memset. I suppose we could use alpaka::mem::view::set, but would that work only with buffers from alpaka::mem::buf::alloc() (i.e. not with bare pointers)?

Copy link
Contributor Author

@ghugo83 ghugo83 Jan 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes at first I had just tried a alpaka::mem::view::set on the raw pointer (equivalent to the poff in the CUDA version), but that was not working. Also with the additional complication that we do not want to set h to null, but only h->off.
Since ideally we want to keep the same interface as for CUDA (passing around a pointer to Histo as argument of fillManyFromVector), this was the only way to access h->off: on the device, which in a way, makes sense.


const int num_items = Histo::totbins();

auto psum_dBuf = alpaka::mem::buf::alloc<uint32_t, Idx>(device, Vec1::all(num_items));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not using HistoContainer::psws like in CUDA implementation?

Copy link
Contributor Author

@ghugo83 ghugo83 Jan 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically the prefixScan have slightly different interfaces in the CUDA and ALPAKA versions. In the Alpaka version, there is no psws argument for doing a prefixScan.
I had looked at that specific point, and from what I remember, psws also exists in the Alpaka implementation of prefixScan, but as an internal variable (ws or sth).

The issue is that here we are actually interested in storing the ws value to Histo (even if I had looked up, and at least in the pixeltrack-standalone, myHisto->psws seems to never be used eventually).
To preserve the same Histo interface, I have, by safety, decided to set it later on (it actually very simply corresponds to the number of blocks). It is what storePrefixScanWorkingSpace is doing.
I have compared the Histo off and psws data members between CUDA and Alpaka versions and they are identical.

Copy link
Contributor Author

@ghugo83 ghugo83 Jan 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess in ideal it would be better to have the exact same interface for a prefixScan with CUDA and Alpaka, but maybe there are reasons I miss.
In any case was planning to do a review of prefixScan for my understanding:

  • There are calls to alpaka::wait::wait(queue); in between kernels, which I do not understand and maybe are not needed (we should only need those after a copy from device to host, when interested in what has been copied, or before a host function returns; well, whenever the host needs info from device).
  • I see multiplications of the number of elements by sizeof(type), while with Alpaka only the number of elements should be specified (the multiplication with sizeof(type) is already done internally). Hence I guess the tests are done with 'too many' elements.
  • I would like to check the SERIAL and TBB cases and handling of elements to be more confident about it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the Alpaka version, there is no psws argument for doing a prefixScan.

Ok, so the situation is a bit more complicated.

The CUDA version takes pc, that is used for the "synchronization point" to know which of the threads belong to the "last block". Alpaka version does not need that because that synchronization is handled with splitting the code in two kernels.

The Alpaka version takes psum (well the ...FirstStep doesn't seem to really need it), which in CUDA is handled as extern __shared__ T psum[] (does Alpaka support dynamic size array in shared memory?).

In both cases the actual value of pc/psum is irrelevant outside of the kernels, so maybe we could re-purpose HistoContainer::ppsws for the psum here? I think we are not interested in the actual content of psws, it seems to be used only as temporary workspace (and e.g. in Kokkos version HistoContainer does not have psws member at all).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In any case was planning to do a review of prefixScan for my understanding:

Sounds like it could be a good idea.

  • There are calls to alpaka::wait::wait(queue); in between kernels, which I do not understand and maybe are not needed (we should only need those after a copy from device to host, when interested in what has been copied, or before a host function returns; well, whenever the host needs info from device).

I agree, we should not have alpaka::wait::wait() calls within these helper functions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are not interested in the actual content of psws, it seems to be used only as temporary workspace

Yes I had only made a grep within the pixeltrack-standalone project, and there the Histogram data member called psws was never used.
Just doing a git grep in CMSSW, it appears it is never used there either.
I tend to think it is easier to just systematically keep the same interfaces, as it often eventually save issues dowsntream, but true that here, we can indeed also skip it, and simply just remove psws data member from Histo class.

Copy link
Collaborator

@makortel makortel Jan 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can indeed also skip it, and simply just remove psws data member from Histo class.

Note that in https://github.com/cms-patatrack/pixeltrack-standalone/pull/165/files#r565772291 I suggested to keep it and use it for the psum argument of the multiBlockPrefixScan*(). Then you could avoid this memory allocation.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd still like to avoid this memory allocation (either by keeping Histo::psws or trying dynamic shared memory allocation in multiBlockPrefixScanSecondStep.

Copy link
Contributor Author

@ghugo83 ghugo83 Feb 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd still like to avoid this memory allocation (either by keeping Histo::psws or trying dynamic shared memory allocation in multiBlockPrefixScanSecondStep.

Yes doing that now.
Basically, since pc / psum are used internally only in PrefixScan, I do not really see why we should store psum inside Histo (even in place of psws). Additionally, psws was just a pointer to an uint32_t, whereas psum should be an array of num_items elements.

That's why I have just removed psws from Histo.
I thinkpsum should be addressed internally in PrefixScan, by indeed dynamic shared memory allocation, as for CUDA version.
I was thinking doing a prefixScan related PR, also modifying the prefixScan call sites, but ok this point can also be addressed in this PR. Doing a commit ~now :)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, psws could not be used for psum because that is used as an array (whereas pc is used as an atomic counter).

queue,
alpaka::kernel::createTaskKernel<Acc1>(
workDivWith1Block, multiBlockPrefixScanSecondStep<uint32_t>(), poff, poff, psum_d, num_items, nblocks));
alpaka::wait::wait(queue);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this wait() really needed or could it be left for the caller to decide whether to block or not?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that was a painful segfault to fix. Basically here, workdiv and kernels are defined in host function launchFinalize scope.
If that function returns before the device work is completed, we run into segfaults / spurious results.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I presume a clean way would be to only have one host function instead of intricated host functions in which device work is defined.
But since here this is just a test where we do not care about perf, and want to keep the code similar to the CUDA versions, I just added those.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds strange. I would have naively expected alpaka::queue::enqueue() etc to copy the work division. Or could it be the allocated buffer that must stay alive until the work finishes (that I believe)?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By quick look through Alpaka code they seem to copy both the work division and the functor within the call alpaka::kernel::createTaskKernel() call. This would need to be investigated further (because I'd really want these wait()s to go), but that can be done after this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds strange. I would have naively expected alpaka::queue::enqueue() etc to copy the work division. Or could it be the allocated buffer that must stay alive until the work finishes (that I believe)?

Yes, so looking at the execution task class (here for example with GPU CUDA backend):
http://alpaka-group.github.io/alpaka/TaskKernelGpuUniformCudaHipRt_8hpp_source.html#l00134

Now, a raw pointer is passed as argument (poff), and was obtained from the Alpaka-equivalent to the get() (getPtrNative) from a reference-counting buffer handle.
So, if the handle is reference-counted to 0 before the device work finishes, we end up with a dangling pointer, and we are still screwed.
However, that is not the case, because the owning pointer is h_d, and is defined out of the scope of this function.

So evth should be fine.

Trying to run without the alpaka::wait::wait(queue), I still sometimes get a segfault with ./HistoContainer_t.tbb. It is a spurious segfault, which makes it hard to know whether sth fixes it or not. It seems to be somehow stemming from a synchronization issue indeed, but the issue is not compulsory (should not) be here.
Hence will remove the alpaka::wait::wait(queue).


alpaka::queue::enqueue(queue,
alpaka::kernel::createTaskKernel<Acc1>(workDiv, fillFromVector(), h, nh, v, offsets));
alpaka::wait::wait(queue);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this wait() really needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as before: end of host function scope.


alpaka::mem::view::copy(queue, v_d, v_buf, N);

alpaka::mem::view::set(queue, h_d, 0, 1u);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see a corresponding memset in CUDA code.

Copy link
Contributor Author

@ghugo83 ghugo83 Jan 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes true, I have added it because while using h_d several times in a loop, we want to be sure that the next iteration starts with sth clean.
This was also causing segfault.

Copy link
Contributor Author

@ghugo83 ghugo83 Jan 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In practice, I would just declare h_d inside the loop (and in general obviously just keep variable declarations as close as possible to when they are used).
But here I would avoid changing too much the code in addition to the port to Alpaka, to be able to keep the portability 'portable'.


alpaka::queue::enqueue(
queue,
alpaka::kernel::createTaskKernel<Acc1>(workDiv, setZeroBins(), alpaka::mem::view::getPtrNative(hist_dbuf)));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In CUDA all of these are in a single kernel, separated by __syncthreads(). Why are they split here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes here I have followed what was also done in the Kokkos version, since in this test did not care about perf.
Can also push a one-kernel version.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, you followed Kokkos version. I don't remember why we split the code there. Probably doesn't matter much (for a unit test).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For memo: In the end, I have made the OneHistoContainer test based on a single kernel version, to be closest as possible to CUDA test version.

alpaka::mem::view::copy(queue, v_dbuf, tr_hbuf, N);

auto a_dbuf = alpaka::mem::buf::alloc<Assoc, Idx>(device, 1u);
alpaka::mem::view::set(queue, a_dbuf, 0, 1u);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see a corresponding memset in CUDA code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes just ended up being cleaner.


uint32_t endElementIdx = endElementIdxNoStride[0u];
for (uint32_t threadIdx = firstElementIdxNoStride[0u]; threadIdx < nt; threadIdx += gridDimension) {
for (uint32_t i = threadIdx; i < std::min(endElementIdx, nt); ++i) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see this pattern

const auto gridDim = ...;
const auto [firstNoStride, endNoStride] = ...;
for (auto threadIdx = firstNoStride, endElementIdx = endNoStride; threadIdx < MAX; threadIdx += gridDimension, endElementIdx += gridDimension) {
  for (auto i = threadIdx; i < std::min(endElementIdx, MAX); ++i ) {
    <body>
  }
}

repeats in almost(?) every kernel in this PR. That's 4+ lines of repetitive and error prone code. I'm wondering if it would be worth to abstract that along

// the name should be more descriptive...
template <typename T_Acc, typename N, typename Func>
void for_each_element(const T_Acc& acc, const N nitems, Func func) {
  const auto gridDim = ...;
  const auto [firstNoStride, endNoStride] = ...;
  for (auto threadIdx = firstNoStride, endElementIdx = endNoStride; threadIdx < nitems; threadIdx += gridDimension, endElementIdx += gridDimension) {
    for (auto i = threadIdx; i < std::min(endElementIdx, nitems); ++i ) {
      func(i);
    }
  }
}

that could be called here along

const uint32_t nt = offsets[nh];
for_each_element(acc, nt, [&](uint32_t i) {
  auto off = alpaka_std::upper_bound(offsets, offsets + nh + 1, i);
  ...
}

? I know this starts to look like we would be building our own abstraction layer on top of Alpaka, but to me the boilerplace calls for something.

Written that I'm fine if this is left for a future PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes absolutely, was thinking marking this as a to-do comment for this PR.
Maybe in an additional PR indeed is better.

…All workdiv / function object / arguments info are copied anyway, and the owning pointer to the histogram is defined outside the function scope
…pend ALPAKA_ACCELERATOR_NAMESPACE when needed. Could also place entire callers within ALPAKA_ACCELERATOR_NAMESPACE namespace.
@ghugo83
Copy link
Contributor Author

ghugo83 commented Jan 28, 2021

Ok just removed the using namespace ALPAKA_ACCELERATOR_NAMESPACE; and prepended it when needed.
That commit looks a bit ugly though somehow, if not ok let me know, I could also just include the caller functions within a ALPAKA_ACCELERATOR_NAMESPACE namespace

@ghugo83
Copy link
Contributor Author

ghugo83 commented Jan 28, 2021

As a memo 2 points will be addressed outside of this PR:

  • Add an helper function to directly handle thread/element indices in the += gridDimension stride case.
  • Dig a bit more inside prefixScan.

@@ -8,10 +8,10 @@
#include "AlpakaCore/alpakaWorkDivHelper.h"
#include "AlpakaCore/HistoContainer.h"

using namespace ALPAKA_ACCELERATOR_NAMESPACE;

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, but actually here (and in other unit test files) the using namespace ... is ok. It's just in headers where it causes problems (although it does make the namespace of the various types more clear in source files as well).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes but at this stage, found it more consistent to remove it everywhere, rather than having a mix of 'using namespace' and ALPAKA_ACCELERATOR_NAMESPACE::

@makortel
Copy link
Collaborator

That commit looks a bit ugly though somehow, if not ok let me know, I could also just include the caller functions within a ALPAKA_ACCELERATOR_NAMESPACE namespace

Good question. In the Kokkos port (which has a similar structure) the aim was to either put functions in the ALPAKA_ACCELERATOR_NAMESPACE or use template parameters to distinguish with different Accelerators. Although looking at e.g. launchFinalize(), both approaches feel a bit unappealing.

@makortel
Copy link
Collaborator

That commit looks a bit ugly though somehow, if not ok let me know, I could also just include the caller functions within a ALPAKA_ACCELERATOR_NAMESPACE namespace

Good question. In the Kokkos port (which has a similar structure) the aim was to either put functions in the ALPAKA_ACCELERATOR_NAMESPACE or use template parameters to distinguish with different Accelerators. Although looking at e.g. launchFinalize(), both approaches feel a bit unappealing.

I think the template approach would actually become feasible (with look&feel close to Kokkos version) if we'd have only one Accelerator type for each backend (#144).

@ghugo83
Copy link
Contributor Author

ghugo83 commented Jan 29, 2021

Yes usually I would not have used 'using namespace', but here the fact that the namespace is acc-dependent feels a bit weird.
Or maybe this namespace could be renamed to sth shorter?
Like alpaka_acc::

I think the template approach would actually become feasible (with look&feel close to Kokkos version) if we'd have only one Accelerator type for each backend (#144).

Ok. Or why not templating on each possible Acc type?
But fully agree, not sure it is nicer than just using a namespace.

…g to call element_global_index_range for each possible max number of elements.
…elper function in cms::alpakatools, but this is already nicer.
…which already makes things clearer. NB: TO DO: add a dedicated helper function.
@ghugo83
Copy link
Contributor Author

ghugo83 commented Jan 29, 2021

Ok for the OneHistoContainer test, just added the 1-kernel version instead, as it makes it easier to debug and compare with CUDA tests.

@makortel
Copy link
Collaborator

Yes usually I would not have used 'using namespace', but here the fact that the namespace is acc-dependent feels a bit weird.
Or maybe this namespace could be renamed to sth shorter?
Like alpaka_acc::

The ALPAKA_ACCELERATOR_NAMESPACE is a macro, that is set based on compiler arguments. I think it is good to keep it clear everywhere that it is a macro.

In the original idea (and what happens in Kokkos version) the ALPAKA_ACCELERATOR_NAMESPACE would have to written at most once per file along

namespace ALPAKA_ACCELERATOR_NAMESPACE {
  namespace cms::alpakatools {
  }
}

(which we could do here too)

I think the template approach would actually become feasible (with look&feel close to Kokkos version) if we'd have only one Accelerator type for each backend (#144).

Ok. Or why not templating on each possible Acc type?

I'm not really sure what you mean but I'll guess anyway. The annoyance stems from the function deciding the dimension of an index, and the caller of the function not having to know that. One option would be for the caller to give all the Acc1, Acc2, Acc3 types as three template parameters, but to me that would looks cumbersome as a general pattern. (if you meant something different, please elaborate)

Another option that came to my mind would be to define our own "tag type" for each accelerator, and then a traits class template that could be used to get all the AccN, DevAccN, PltfAccN based on a single template argument. But given other motivations to use only single Accelerator type per backend (#144), I'm not very enthusiastic on this option.

ALPAKA_FN_ACC std::pair<Vec<T_Dim>, Vec<T_Dim>> element_global_index_range(const T_Acc& acc,
const Vec<T_Dim>& maxNumberOfElements) {
template <typename T_Acc, typename T_Dim = alpaka::dim::Dim<T_Acc>>
ALPAKA_FN_ACC std::pair<Vec<T_Dim>, Vec<T_Dim>> element_global_index_range_uncut(const T_Acc& acc) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would name this function as element_global_index_range, and the other something along element_global_index_range_max (preferring to tell what a function does rather than what it does not do). But a minor point for a prototype.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok no pb
yes was to insist on the risky version (the one which has not truncation with max), but yes the other way round can be more elegant. Will change it.


const int num_items = Histo::totbins();

auto psum_dBuf = alpaka::mem::buf::alloc<uint32_t, Idx>(device, Vec1::all(num_items));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd still like to avoid this memory allocation (either by keeping Histo::psws or trying dynamic shared memory allocation in multiBlockPrefixScanSecondStep.

…nstead of global memory. This changes the prefixSCan interface (closer to CUDA version), hence the call sites. Important: To be noted is that in any case, the amount of memory needed was not num_items * sizeof(T), only num_blocks * sizeof(T) is sufficient.
…and element_global_index_range_truncated to compute range truncated by max number of elements of interest.
@ghugo83
Copy link
Contributor Author

ghugo83 commented Feb 1, 2021

but to me that would looks cumbersome as a general pattern

Yes you are right.

The ALPAKA_ACCELERATOR_NAMESPACE is a macro, that is set based on compiler arguments. I think it is good to keep it clear everywhere that it is a macro.

Yes just ended up having ALPAKA_ACCELERATOR_NAMESPACE::Queue, etc, in the end.

template <typename Histo>
ALPAKA_FN_HOST ALPAKA_FN_INLINE __attribute__((always_inline)) void launchFinalize(
Histo *__restrict__ h,
const ALPAKA_ACCELERATOR_NAMESPACE::DevAcc1 &device,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After removing the memory allocation, the device parameter is not needed anymore

Suggested change
const ALPAKA_ACCELERATOR_NAMESPACE::DevAcc1 &device,

uint32_t const *__restrict__ offsets,
uint32_t totSize,
unsigned int nthreads,
const ALPAKA_ACCELERATOR_NAMESPACE::DevAcc1 &device,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

device parameter can be removed from here as well.

Suggested change
const ALPAKA_ACCELERATOR_NAMESPACE::DevAcc1 &device,

alpaka::queue::enqueue(queue,
alpaka::kernel::createTaskKernel<ALPAKA_ACCELERATOR_NAMESPACE::Acc1>(
workDiv, countFromVector(), h, nh, v, offsets));
launchFinalize(h, device, queue);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following removal of device.

Suggested change
launchFinalize(h, device, queue);
launchFinalize(h, queue);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ha yes true, this argument is not necessary anymore now, thanks. I just removed it.

@makortel
Copy link
Collaborator

makortel commented Feb 2, 2021

I created two issues #168 and #169 to remind some of the things discussed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants