cupy streams and parla memory leak #121

milindasf · 2023-07-31T22:40:36Z

    task_space = TaskSpace("T")
    a=[None] * num_gpu
    for i in range(num_gpu):
        @spawn(task_space[i], placement=[gpu(i)], vcus=0.0)
        def t1():
            with cp.cuda.Device(i):
                a[i] = cp.zeros(recieve_partitions[i], dtype = sbuff.dtype)
                for j in range(num_gpu):
                    #with cp.cuda.Stream(non_blocking=True) as stream:
                        #a[i][roffsets[i,j] : roffsets[i,j] + rcounts[i,j]] = cp.asarray(sbuff.blockview[j][soffsets[j, i] : soffsets[j, i] + scounts[j, i]])
                    dst = a[i][roffsets[i,j] : roffsets[i,j] + rcounts[i,j]]
                    src = sbuff.blockview[j][soffsets[j, i] : soffsets[j, i] + scounts[j, i]]
                    dst.data.copy_from_async(src.data, src.nbytes)
                
                cp.cuda.runtime.deviceSynchronize()
    
    @spawn(task_space[num_gpu], placement=[cpu], vcus=0.0, dependencies=task_space[0:num_gpu])
    def t2():
        out[0] = xp.array(a, axis=0)
    await task_space[num_gpu]

(ParlaDemo) c196-011rtx$ python sample_sort.py -r 10 -w 5 -n 1000000 -gpu 2 -check 1 -m parla
USE_PYTHON_RUNAHEAD: True
CUPY_ENABLED: True
PREINIT_THREADS: True
DEFAULT SYNC: 0
Namespace(n=1000000, gpu=2, warm_up=5, runs=10, mode='parla', check=1)
[warmup] sample sort passed : True
[warmup] sample sort passed : True
[warmup] sample sort passed : True
[warmup] sample sort passed : True
[warmup] sample sort passed : True
Exception ignored in: <object repr() failed>
Traceback (most recent call last):
File "cupy/cuda/stream.pyx", line 481, in cupy.cuda.stream.Stream.del
AttributeError: 'Stream' object has no attribute 'ptr'
Traceback (most recent call last):
File "/home1/03727/tg830270/Research/parla-experimental/miniapps/samplesort/sample_sort.py", line 427, in
y = sort_func(x)
^^^^^^^^^^^^
File "/home1/03727/tg830270/Research/parla-experimental/miniapps/samplesort/sample_sort.py", line 281, in parla_sample_sort
with Parla():
^^^^^^^
File "/home1/03727/tg830270/miniconda3/envs/ParlaDemo/lib/python3.11/site-packages/parla/init.py", line 61, in init
self._device_manager = DeviceManager(dev_config_file)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "device_manager.pyx", line 151, in parla.cython.device_manager.PyDeviceManager.init
File "device_manager.pyx", line 90, in parla.cython.device_manager.StreamPool.init
File "device_manager.pyx", line 92, in parla.cython.device_manager.StreamPool.init
File "device.pyx", line 501, in parla.cython.device.CupyStream.init
File "device.pyx", line 503, in parla.cython.device.CupyStream.init
File "cupy/cuda/stream.pyx", line 471, in cupy.cuda.stream.Stream.init
File "cupy_backends/cuda/api/runtime.pyx", line 840, in cupy_backends.cuda.api.runtime.streamCreateWithFlags
File "cupy_backends/cuda/api/runtime.pyx", line 144, in cupy_backends.cuda.api.runtime.check_status
cupy_backends.cuda.api.runtime.CUDARuntimeError: cudaErrorMemoryAllocation: out of memory

The text was updated successfully, but these errors were encountered:

wlruys · 2023-07-31T23:12:41Z

The streams fall out of scope when the device manager is destroyed:

parla-experimental/src/python/parla/cython/device_manager.pyx

Line 151 in 8d2326b

self.stream_pool = StreamPool(self.get_devices(DeviceType.CUDA))

References to the device_manager are held by (1) The Scheduler and (2) the Parla() context.
Both are destroyed at Parla().__exit__:

parla-experimental/src/python/parla/__init__.py

Lines 107 to 108 in 8d2326b

    
           del self._device_manager 
        
           core.py_write_log(self.logfile)

Which triggers unless the program crashes or is interrupted. The streams should be deleted between Runtime start/stops.

My best guess here would be that there is another memory leak happening elsewhere.
When enough memory is leaked and with Parla() is called, it attempts to create 8*num_gpus streams but fails to find space and crashes.

wlruys · 2023-07-31T23:21:53Z

A reference is also held by:

parla-experimental/src/python/parla/__init__.py

Lines 107 to 108 in 8d2326b

    
           del self._device_manager 
        
           core.py_write_log(self.logfile)

But this is cleaned up on the scheduler thread when the runtime finishes (and on worker threads whenever they complete their work)

wlruys added help wanted Extra attention is needed bug Something isn't working labels Jul 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cupy streams and parla memory leak #121

cupy streams and parla memory leak #121

milindasf commented Jul 31, 2023

wlruys commented Jul 31, 2023

wlruys commented Jul 31, 2023

cupy streams and parla memory leak #121

cupy streams and parla memory leak #121

Comments

milindasf commented Jul 31, 2023

wlruys commented Jul 31, 2023

wlruys commented Jul 31, 2023