Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] OOM errors in STATISTICS_TEST and TRACKING_TEST #1486

Closed
bdice opened this issue Feb 27, 2024 · 0 comments · Fixed by #1487
Closed

[BUG] OOM errors in STATISTICS_TEST and TRACKING_TEST #1486

bdice opened this issue Feb 27, 2024 · 0 comments · Fixed by #1487
Assignees
Labels
bug Something isn't working cpp Pertains to C++ code tests Related to unit tests

Comments

@bdice
Copy link
Contributor

bdice commented Feb 27, 2024

Describe the bug
Recently, several errors have been reported by @harrism, @nvdbaranec, and @KyleFromNVIDIA where STATISTICS_TEST and TRACKING_TEST have failed with out-of-memory errors.

I am including several examples copied from the logs.
https://github.com/rapidsai/rmm/actions/runs/8054518004/job/21999601232?pr=1479
https://github.com/rapidsai/rmm/actions/runs/8056918096/job/22007169303?pr=1469

[ RUN      ] StatisticsTest.AllFreed
unknown file: Failure
C++ exception with description "std::bad_alloc: out_of_memory: CUDA error at: /opt/conda/conda-bld/work/include/rmm/mr/device/cuda_memory_resource.hpp:60: cudaErrorMemoryAllocation out of memory" thrown in the test body.

[  FAILED  ] StatisticsTest.AllFreed (2340 ms)
[ RUN      ] StatisticsTest.PeakAllocations
unknown file: Failure
C++ exception with description "std::bad_alloc: out_of_memory: CUDA error at: /opt/conda/conda-bld/work/include/rmm/mr/device/cuda_memory_resource.hpp:60: cudaErrorMemoryAllocation out of memory" thrown in the test body.

[  FAILED  ] StatisticsTest.PeakAllocations (16 ms)
 [ RUN      ] StatisticsTest.PeakAllocations
unknown file: Failure
C++ exception with description "std::bad_alloc: out_of_memory: CUDA error at: /opt/conda/conda-bld/work/include/rmm/mr/device/cuda_memory_resource.hpp:60: cudaErrorMemoryAllocation out of memory" thrown in the test body.

[  FAILED  ] StatisticsTest.PeakAllocations (1115 ms)
[ RUN      ] StatisticsTest.MultiTracking
unknown file: Failure
C++ exception with description "std::bad_alloc: out_of_memory: CUDA error at: /opt/conda/conda-bld/work/include/rmm/mr/device/cuda_memory_resource.hpp:60: cudaErrorMemoryAllocation out of memory" thrown in the test body.

[  FAILED  ] StatisticsTest.MultiTracking (1 ms)
[ RUN      ] TrackingTest.AllFreed
unknown file: Failure
C++ exception with description "std::bad_alloc: out_of_memory: CUDA error at: /opt/conda/conda-bld/work/include/rmm/mr/device/cuda_memory_resource.hpp:60: cudaErrorMemoryAllocation out of memory" thrown in the test body.
[ RUN      ] TrackingTest.AllFreed
unknown file: Failure
C++ exception with description "std::bad_alloc: out_of_memory: CUDA error at: /opt/conda/conda-bld/work/include/rmm/mr/device/cuda_memory_resource.hpp:60: cudaErrorMemoryAllocation out of memory" thrown in the test body.

[  FAILED  ] TrackingTest.AllFreed (4556 ms)
[ RUN      ] TrackingTest.AllocationsLeftWithStacks
unknown file: Failure
C++ exception with description "std::bad_alloc: out_of_memory: CUDA error at: /opt/conda/conda-bld/work/include/rmm/mr/device/cuda_memory_resource.hpp:60: cudaErrorMemoryAllocation out of memory" thrown in the test body.

[  FAILED  ] TrackingTest.AllocationsLeftWithStacks (1 ms)

Across the logs I saw, the list of failing tests includes:

StatisticsTest.AllFreed
StatisticsTest.MultiTracking
StatisticsTest.PeakAllocations
TrackingTest.AllFreed
TrackingTest.AllocationsLeftWithoutStacks
TrackingTest.AllocationsLeftWithStacks
TrackingTest.MultiTracking

Expected behavior
No OOM errors in the test suite.

Additional context
I will open a PR to serialize the execution of these tests.

@bdice bdice added bug Something isn't working ? - Needs Triage Need team to review and classify and removed ? - Needs Triage Need team to review and classify labels Feb 27, 2024
@bdice bdice self-assigned this Feb 27, 2024
rapids-bot bot pushed a commit that referenced this issue Feb 27, 2024
…1487)

There have been out-of-memory errors reported in `STATISTICS_TEST` and `TRACKING_TEST`. This PR serializes the execution of those tests, in an attempt to avoid the reported failures.

Closes #1486.

Authors:
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - Kyle Edwards (https://github.com/KyleFromNVIDIA)
  - Robert Maynard (https://github.com/robertmaynard)

URL: #1487
@harrism harrism added tests Related to unit tests cpp Pertains to C++ code labels Feb 27, 2024
@harrism harrism moved this from Todo to Review in RMM Project Board Feb 27, 2024
@harrism harrism moved this from Review to Done in RMM Project Board Feb 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cpp Pertains to C++ code tests Related to unit tests
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

2 participants