[FEA] PyTorch and RMM sharing memory pool #501

brhodes10 · 2020-08-17T16:51:07Z

Is your feature request related to a problem? Please describe.
Currently I'm running a streamz workflow that uses pytorch. I notice that I continue to encounter errors like below where pytorch is not able to allocate enough memory.

RuntimeError: CUDA out of memory. Tried to allocate 376.00 MiB (GPU 0; 31.72 GiB total capacity; 29.02 GiB already allocated; 244.88 MiB free; 29.80 GiB reserved in total by PyTorch)

I'm wondering if pytorch and rmm are competing for memory and if so if there's a recommended way to manage
Describe the solution you'd like
If possible, for pytorch and rmm to potentially use the same memory pool. Or a recommended method to resolve this type of memory issue

Describe alternatives you've considered
None

Additional context
The streamz workflow end-to-end can be found here. In short summary, it first initializes a streamz worklfow that uses dask to read in data from kafka. Then processes that data using cyBERT inferencing which can be found here. cyBERT uses cudf for data pre-processing steps and a BERT model for inferencing. Then the processed data is published back to kafka.

The text was updated successfully, but these errors were encountered:

VibhuJawa · 2020-08-17T17:02:55Z

There was some internal discussion about a related issue that plagued 27 HF implimentation and it was suggested that a path forward can be:

We create a PyTorch memory resource for RMM to allow RMM to use the same memory pool as PyTorch.

jakirkham · 2020-08-17T17:30:40Z

Another idea that came up was using RMM within PyTorch possibly using an external memory allocator (as was done with CuPy and Numba) or possibly even direct usage (as has recently been done with XGBoost). Have filed this as issue ( pytorch/pytorch#43144 ).

jakirkham · 2020-08-17T23:21:43Z

We create a PyTorch memory resource for RMM to allow RMM to use the same memory pool as PyTorch.

On this usage pattern it's worth looking at how CuPy did something similar.

xref: pytorch/pytorch#33860
xref: cupy/cupy#3126

github-actions · 2021-02-16T17:29:25Z

This issue has been marked rotten due to no recent activity in the past 90d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

github-actions · 2021-02-16T17:29:43Z

This issue has been marked stale due to no recent activity in the past 30d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be marked rotten if there is no activity in the next 60d.

VibhuJawa · 2023-01-16T21:50:14Z

This was closed by: #1168 , Can we close this ?

brhodes10 added ? - Needs Triage Need team to review and classify feature request New feature or request labels Aug 17, 2020

github-actions bot added the inactive-90d label Feb 16, 2021

github-actions bot added the inactive-30d label Feb 16, 2021

VibhuJawa mentioned this issue Mar 10, 2021

[FEA]Decrease Pool Size on the fly #724

Open

mcarilli mentioned this issue Sep 20, 2021

Adds cudaMallocAsync as an alternative backend for the CUDA allocator pytorch/pytorch#65365

Closed

jarmak-nv added this to RMM Project Board Nov 15, 2022

jakirkham closed this as completed Jan 17, 2023

github-project-automation bot moved this to Done in RMM Project Board Jan 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] PyTorch and RMM sharing memory pool #501

[FEA] PyTorch and RMM sharing memory pool #501

brhodes10 commented Aug 17, 2020

VibhuJawa commented Aug 17, 2020

jakirkham commented Aug 17, 2020

jakirkham commented Aug 17, 2020

github-actions bot commented Feb 16, 2021

github-actions bot commented Feb 16, 2021

VibhuJawa commented Jan 16, 2023

[FEA] PyTorch and RMM sharing memory pool #501

[FEA] PyTorch and RMM sharing memory pool #501

Comments

brhodes10 commented Aug 17, 2020

VibhuJawa commented Aug 17, 2020

jakirkham commented Aug 17, 2020

jakirkham commented Aug 17, 2020

github-actions bot commented Feb 16, 2021

github-actions bot commented Feb 16, 2021

VibhuJawa commented Jan 16, 2023