Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add {Copy,Move}ToDeviceCache<T> class templates and moveToDeviceAsync function template #43969

Merged
merged 9 commits into from
Dec 18, 2024

Conversation

makortel
Copy link
Contributor

@makortel makortel commented Feb 14, 2024

PR description:

Resolves cms-sw/framework-team#790

CopyToDeviceCache<T>

The primary purpose of this PR is to add CopyToDeviceCache<T> class template, that allows copying "arbitrary data" to all devices of a backend within an EDProducer (e.g. in a constructor or initializeGlobalCache(). The use case is to avoid (mis)using EventSetup for the purposes of copying EDProducer configuration parameters to the devices, following the discussion in #43130 (comment). The CopyToDeviceCache relies on the specializations CopyToDevice<T> to deduce the corresponding device-side data type, and to do the actual copy. The copy is done synchronously in the object constructor.

For testing purposes I wanted to use Alpaka buffers, so I added partial specializations of CopyToDevice<T> for 0- and 1-dimensional Alpaka buffers.

moveToDeviceAsync()

CopyToDeviceCache<T> copies the object also for the host backends. In order to get some feeling what it would mean to avoid that copy, I explored first similar shortcut for the simpler case of doing it for one device e.g. in EDProducer::produce() body (effectively addressing #43796 (comment)). I took the approach of exploring if the "move concept" could be used (which came up as an idea in some earlier discussion with @fwyzard). This lead to moveToDeviceAsync() function. For host backends it just moves the argument object, and for non-host backends it uses the CopyToDevice<T> to deduce the corresponding device-side type, and to copy the data. Also in the non-host backend case the argument object is moved-from, so any use of the argument object in the caller will result a "use after move" (whatever that behavior then is for the object).

I think the result is not too bad. However, given that Alpaka buffers behave like std::shared_ptr, I decided to require the host-side type to be non-copyable in order to be even remotely sure that the data of the moved-in host object would not be used (mostly written into) in the host code concurrently to the asynchronous copy to the device. I was concerned that otherwise the following kind of mistake would be too easy to do

HostBuffer h_buf;
fill(h_buf);
HostBuffer h_buf2 = h_buf; // h_buf2 points to the same host memory as h_buf
DeviceBuffer d_buf = moveToDeviceAsync(queue, std::move(h_buf));
alpaka::wait(queue);
fill(h_buf2); // on host backend would modify the same memory as h_buf!

The function can be easily used with PortableHostCollection<T> and PortableHostObject<T> as they explicitly disable copying (for similar reasons).

MoveToDeviceCache<T>

MoveToDeviceCache<T> combines the "copy to all devices" aspect of CopyToDeviceCache<T> with the move semantics and "T must be non-copyable" requirement of moveToDeviceAsync().

Given that the only difference of MoveToDeviceCache<T> and CopyToDeviceCache<T> is the behavior of the constructor, an alternative could be to have a single class, where the desired behavior would be selected e.g. with an explicit tag argument. Theoretically one could implement copy and move constructors such that the object is "moved" when possible and otherwise copied, but I though maybe we'd want to explicitly specify the desired behavior (copy vs move on host) to catch cases where a data type becomes accidentally copyable (leading to slower code path on host).

Allow contained object of PortableHostObject<T> to be initialized in the constructor

While crafting tests for the earlier cases I came to wonder if it would make sense to allow the contained object of PortableHostObject<T> (i.e. T) to be initialized directly in the PortableHostObject<T> constructor, rather than always having to first create the PortableHostObject<T>, and then fill the content. This could be particularly useful when constructing MoveToDeviceCache<PortableHostObject<T>> in the EDProducer initialization list. I added additional constructors that use placement new, and added a requirement that the T must be trivially destructible.

PR validation:

Unit tests run.

@cmsbuild
Copy link
Contributor

cmsbuild commented Feb 14, 2024

cms-bot internal usage

@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-43969/38872

  • This PR adds an extra 40KB to repository

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @makortel (Matti Kortelainen) for master.

It involves the following packages:

  • DataFormats/Portable (heterogeneous)
  • HeterogeneousCore/AlpakaCore (heterogeneous)
  • HeterogeneousCore/AlpakaInterface (heterogeneous)
  • HeterogeneousCore/AlpakaTest (heterogeneous)

@fwyzard, @makortel, @cmsbuild can you please review it and eventually sign? Thanks.
@rovere, @missirol this is something you requested to watch as well.
@rappoccio, @antoniovilela, @sextonkennedy you are the release manager for this.

cms-bot commands are listed here

@makortel
Copy link
Contributor Author

enable gpu

@makortel
Copy link
Contributor Author

@cmsbuild, please test

1 similar comment
@makortel
Copy link
Contributor Author

@cmsbuild, please test

@makortel
Copy link
Contributor Author

@fwyzard Any thoughts?

@@ -182,6 +182,24 @@ namespace ALPAKA_ACCELERATOR_NAMESPACE {
output[i] = {x, input[i].y(), input[i].z(), input[i].id(), input[i].flags(), input[i].m()};
}
}

template <typename TAcc, typename = std::enable_if_t<alpaka::isAccelerator<TAcc>>>
ALPAKA_FN_ACC void operator()(TAcc const& acc,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(independently of this PR)

For kernels that are defined in ALPAKA_ACCELERATOR_NAMESPACE, I came to wonder the necessity or usefulness of defining these (member) functions as templates over TAcc. The kernel itself tends to have some kind of assumption on the dimensions, and the caller uses one of the Acc<N>D explicitly with alpaka::exec(). In this particular case the argument could be directly

Suggested change
ALPAKA_FN_ACC void operator()(TAcc const& acc,
ALPAKA_FN_ACC void operator()(Acc1D const& acc,

@cmsbuild
Copy link
Contributor

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-53fbcd/37467/summary.html
COMMIT: 11a42bf
CMSSW: CMSSW_14_1_X_2024-02-14-1100/el8_amd64_gcc12
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/43969/37467/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

GPU Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 48 differences found in the comparisons
  • DQMHistoTests: Total files compared: 3
  • DQMHistoTests: Total histograms compared: 39740
  • DQMHistoTests: Total failures: 2128
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 37612
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 2 files compared)
  • Checked 8 log files, 10 edm output root files, 3 DQM output files
  • TriggerResults: no differences found

@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-43969/43065

@cmsbuild
Copy link
Contributor

Pull request #43969 was updated. @cmsbuild, @fwyzard, @makortel can you please check and sign again.

@makortel
Copy link
Contributor Author

@cmsbuild, please test

@fwyzard
Copy link
Contributor

fwyzard commented Dec 18, 2024

+heterogeneous

@cmsbuild
Copy link
Contributor

This pull request is fully signed and it will be integrated in one of the next master IBs after it passes the integration tests. This pull request will now be reviewed by the release team before it's merged. @antoniovilela, @mandrenguyen, @sextonkennedy, @rappoccio (and backports should be raised in the release meeting by the corresponding L2)

@cmsbuild
Copy link
Contributor

+1

Size: This PR adds an extra 32KB to repository
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-53fbcd/43513/summary.html
COMMIT: 25fe7a6
CMSSW: CMSSW_15_0_X_2024-12-17-1100/el8_amd64_gcc12
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/43969/43513/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

GPU Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 24 differences found in the comparisons
  • DQMHistoTests: Total files compared: 7
  • DQMHistoTests: Total histograms compared: 53058
  • DQMHistoTests: Total failures: 906
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 52152
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 6 files compared)
  • Checked 24 log files, 30 edm output root files, 7 DQM output files
  • TriggerResults: no differences found

@mandrenguyen
Copy link
Contributor

+1

@cmsbuild cmsbuild merged commit c7b3ed7 into cms-sw:master Dec 18, 2024
14 checks passed
@makortel makortel deleted the alpakaDeviceCache branch December 18, 2024 15:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Finalize utility class to copy a host side object to all devices
5 participants