Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add debugging capabilities to the CachingAllocator #45341

Merged
merged 2 commits into from
Jul 1, 2024

Conversation

fwyzard
Copy link
Contributor

@fwyzard fwyzard commented Jun 28, 2024

PR description:

Extend the alpaka CachingAllocator to optionally fill with a configurable value all memory blocks that are: allocated, cached for re-use, re-used, or deallocated.

Extend the AlpakaService to configure the host and device CachingAllocators.

Add a simple test to load the AlpakaService.


To fill the NVIDIA GPU memory before every allocation or reuse with 0xA5, you can now use

process.AlpakaServiceCudaAsync.deviceAllocator.fillAllocations = True

To fill the NVIDIA GPU memory before every deallocation or caching with 0x5A, you can now use

process.AlpakaServiceCudaAsync.deviceAllocator.fillDeallocations = True

To use different values and combination for allocations, deallocation, caching, and reuse, the full options are

process.AlpakaServiceCudaAsync.deviceAllocator.fillAllocations = True,
process.AlpakaServiceCudaAsync.deviceAllocator.fillAllocationValue = 0xA5,
process.AlpakaServiceCudaAsync.deviceAllocator.fillReallocations = True,
process.AlpakaServiceCudaAsync.deviceAllocator.fillReallocationValue = 0x69,
process.AlpakaServiceCudaAsync.deviceAllocator.fillDeallocations = True,
process.AlpakaServiceCudaAsync.deviceAllocator.fillDeallocationValue = 0x5A,
process.AlpakaServiceCudaAsync.deviceAllocator.fillCaches = True,
process.AlpakaServiceCudaAsync.deviceAllocator.fillCacheValue = 0x96

To do the same for the pinned host memory used in the GPU transfers, process.AlpakaServiceCudaAsync.hostAllocator accepts the same options.

To do the same for AMD GPUs, replace AlpakaServiceCudaAsync with AlpakaServiceROCmAsync.

To do the same for the CPU memory used by the alpaka modules running on the host, replace AlpakaServiceCudaAsync with AlpakaServiceSerialSync.

PR validation:

The new unit tests pass.

If this PR is a backport please specify the original PR and why you need to backport that PR. If this PR will be backported please specify to which release cycle the backport is meant for:

To be backported to 14.0.x for data taking.

fwyzard added 2 commits June 28, 2024 16:34
Extend the CachingAllocator to optionally fill with a configurable value all
memory blocks that are: allocated, cached for re-use, re-used, or deallocated.

Extend the AlpakaService to configure the host and device CachingAllocators.
@fwyzard
Copy link
Contributor Author

fwyzard commented Jun 28, 2024

enable gpu

@fwyzard
Copy link
Contributor Author

fwyzard commented Jun 28, 2024

please test

@cmsbuild
Copy link
Contributor

cmsbuild commented Jun 28, 2024

cms-bot internal usage

@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-45341/40754

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @fwyzard for master.

It involves the following packages:

  • HeterogeneousCore/AlpakaInterface (heterogeneous)
  • HeterogeneousCore/AlpakaServices (heterogeneous)

@fwyzard, @makortel can you please review it and eventually sign? Thanks.
@makortel, @missirol, @rovere this is something you requested to watch as well.
@antoniovilela, @rappoccio, @sextonkennedy you are the release manager for this.

cms-bot commands are listed here

@fwyzard
Copy link
Contributor Author

fwyzard commented Jun 28, 2024

+heterogeneous

Self-signed because @makortel is still away. I'm happy to address any comments and accept any suggestions to improve the system when he comes back.

@cmsbuild
Copy link
Contributor

This pull request is fully signed and it will be integrated in one of the next master IBs after it passes the integration tests. This pull request will now be reviewed by the release team before it's merged. @rappoccio, @sextonkennedy, @antoniovilela (and backports should be raised in the release meeting by the corresponding L2)

@cmsbuild
Copy link
Contributor

-1

Failed Tests: HeaderConsistency
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-753c9a/40148/summary.html
COMMIT: 660603a
CMSSW: CMSSW_14_1_X_2024-06-28-1100/el8_amd64_gcc12
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/45341/40148/install.sh to create a dev area with all the needed externals and cmssw changes.

  • DAS Queries: The DAS query tests failed, see the summary page for details.

Comparison Summary

Summary:

GPU Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 0 differences found in the comparisons
  • DQMHistoTests: Total files compared: 3
  • DQMHistoTests: Total histograms compared: 39744
  • DQMHistoTests: Total failures: 18
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 39726
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 2 files compared)
  • Checked 8 log files, 10 edm output root files, 3 DQM output files
  • TriggerResults: no differences found

@mmusich
Copy link
Contributor

mmusich commented Jul 1, 2024

ignore tests-rejected with external-failure

@rappoccio
Copy link
Contributor

+1

@rappoccio
Copy link
Contributor

merge

@cmsbuild cmsbuild merged commit 919a242 into cms-sw:master Jul 1, 2024
13 of 14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants