Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add debugging capabilities to the CachingAllocator [14.0.x] #45342

Merged

Conversation

fwyzard
Copy link
Contributor

@fwyzard fwyzard commented Jun 28, 2024

PR description:

Extend the alpaka CachingAllocator to optionally fill with a configurable value all memory blocks that are: allocated, cached for re-use, re-used, or deallocated.

Extend the AlpakaService to configure the host and device CachingAllocators.

Add a simple test to load the AlpakaService.


To fill the NVIDIA GPU memory before every allocation or reuse with 0xA5, you can now use

process.AlpakaServiceCudaAsync.deviceAllocator.fillAllocations = True

To fill the NVIDIA GPU memory before every deallocation or caching with 0x5A, you can now use

process.AlpakaServiceCudaAsync.deviceAllocator.fillDeallocations = True

To use different values and combination for allocations, deallocation, caching, and reuse, the full options are

process.AlpakaServiceCudaAsync.deviceAllocator.fillAllocations = True,
process.AlpakaServiceCudaAsync.deviceAllocator.fillAllocationValue = 0xA5,
process.AlpakaServiceCudaAsync.deviceAllocator.fillReallocations = True,
process.AlpakaServiceCudaAsync.deviceAllocator.fillReallocationValue = 0x69,
process.AlpakaServiceCudaAsync.deviceAllocator.fillDeallocations = True,
process.AlpakaServiceCudaAsync.deviceAllocator.fillDeallocationValue = 0x5A,
process.AlpakaServiceCudaAsync.deviceAllocator.fillCaches = True,
process.AlpakaServiceCudaAsync.deviceAllocator.fillCacheValue = 0x96

To do the same for the pinned host memory used in the GPU transfers, process.AlpakaServiceCudaAsync.hostAllocator accepts the same options.

To do the same for AMD GPUs, replace AlpakaServiceCudaAsync with AlpakaServiceROCmAsync.

To do the same for the CPU memory used by the alpaka modules running on the host, replace AlpakaServiceCudaAsync with AlpakaServiceSerialSync.

PR validation:

The new unit tests pass.

If this PR is a backport please specify the original PR and why you need to backport that PR. If this PR will be backported please specify to which release cycle the backport is meant for:

Backport of #45341 to 14.0.x for data taking.

fwyzard added 2 commits June 28, 2024 16:39
Extend the CachingAllocator to optionally fill with a configurable value all
memory blocks that are: allocated, cached for re-use, re-used, or deallocated.

Extend the AlpakaService to configure the host and device CachingAllocators.
@fwyzard
Copy link
Contributor Author

fwyzard commented Jun 28, 2024

backport #45341

@fwyzard
Copy link
Contributor Author

fwyzard commented Jun 28, 2024

enable gpu

@fwyzard
Copy link
Contributor Author

fwyzard commented Jun 28, 2024

please test

@cmsbuild
Copy link
Contributor

cmsbuild commented Jun 28, 2024

A new Pull Request was created by @fwyzard for CMSSW_14_0_X.

It involves the following packages:

  • HeterogeneousCore/AlpakaInterface (heterogeneous)
  • HeterogeneousCore/AlpakaServices (heterogeneous)

@fwyzard, @makortel can you please review it and eventually sign? Thanks.
@makortel, @missirol, @rovere this is something you requested to watch as well.
@antoniovilela, @rappoccio, @sextonkennedy you are the release manager for this.

cms-bot commands are listed here

@cmsbuild
Copy link
Contributor

cmsbuild commented Jun 28, 2024

cms-bot internal usage

@fwyzard
Copy link
Contributor Author

fwyzard commented Jun 28, 2024

+heterogeneous

Self-signed because @makortel is still away. I'm happy to address any comments and accept any suggestions to improve the system when he comes back.

@cmsbuild
Copy link
Contributor

This pull request is fully signed and it will be integrated in one of the next CMSSW_14_0_X IBs after it passes the integration tests and once validation in the development release cycle CMSSW_14_1_X is complete. This pull request will now be reviewed by the release team before it's merged. @sextonkennedy, @antoniovilela, @rappoccio (and backports should be raised in the release meeting by the corresponding L2)

@cmsbuild
Copy link
Contributor

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-102098/40147/summary.html
COMMIT: 3830ca7
CMSSW: CMSSW_14_0_X_2024-06-28-1100/el8_amd64_gcc12
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/45342/40147/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

GPU Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 48 differences found in the comparisons
  • DQMHistoTests: Total files compared: 3
  • DQMHistoTests: Total histograms compared: 39744
  • DQMHistoTests: Total failures: 1496
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 38248
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 2 files compared)
  • Checked 8 log files, 10 edm output root files, 3 DQM output files
  • TriggerResults: no differences found

@rappoccio
Copy link
Contributor

+1

@cmsbuild cmsbuild merged commit 683d65b into cms-sw:CMSSW_14_0_X Jul 2, 2024
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants