Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cupy streams and parla memory leak #121

Open
milindasf opened this issue Jul 31, 2023 · 2 comments
Open

cupy streams and parla memory leak #121

milindasf opened this issue Jul 31, 2023 · 2 comments
Labels
bug Something isn't working help wanted Extra attention is needed

Comments

@milindasf
Copy link
Collaborator

    task_space = TaskSpace("T")
    a=[None] * num_gpu
    for i in range(num_gpu):
        @spawn(task_space[i], placement=[gpu(i)], vcus=0.0)
        def t1():
            with cp.cuda.Device(i):
                a[i] = cp.zeros(recieve_partitions[i], dtype = sbuff.dtype)
                for j in range(num_gpu):
                    #with cp.cuda.Stream(non_blocking=True) as stream:
                        #a[i][roffsets[i,j] : roffsets[i,j] + rcounts[i,j]] = cp.asarray(sbuff.blockview[j][soffsets[j, i] : soffsets[j, i] + scounts[j, i]])
                    dst = a[i][roffsets[i,j] : roffsets[i,j] + rcounts[i,j]]
                    src = sbuff.blockview[j][soffsets[j, i] : soffsets[j, i] + scounts[j, i]]
                    dst.data.copy_from_async(src.data, src.nbytes)
                
                cp.cuda.runtime.deviceSynchronize()
    
    @spawn(task_space[num_gpu], placement=[cpu], vcus=0.0, dependencies=task_space[0:num_gpu])
    def t2():
        out[0] = xp.array(a, axis=0)
    await task_space[num_gpu]

(ParlaDemo) c196-011rtx$ python sample_sort.py -r 10 -w 5 -n 1000000 -gpu 2 -check 1 -m parla
USE_PYTHON_RUNAHEAD: True
CUPY_ENABLED: True
PREINIT_THREADS: True
DEFAULT SYNC: 0
Namespace(n=1000000, gpu=2, warm_up=5, runs=10, mode='parla', check=1)
[warmup] sample sort passed : True
[warmup] sample sort passed : True
[warmup] sample sort passed : True
[warmup] sample sort passed : True
[warmup] sample sort passed : True
Exception ignored in: <object repr() failed>
Traceback (most recent call last):
File "cupy/cuda/stream.pyx", line 481, in cupy.cuda.stream.Stream.del
AttributeError: 'Stream' object has no attribute 'ptr'
Traceback (most recent call last):
File "/home1/03727/tg830270/Research/parla-experimental/miniapps/samplesort/sample_sort.py", line 427, in
y = sort_func(x)
^^^^^^^^^^^^
File "/home1/03727/tg830270/Research/parla-experimental/miniapps/samplesort/sample_sort.py", line 281, in parla_sample_sort
with Parla():
^^^^^^^
File "/home1/03727/tg830270/miniconda3/envs/ParlaDemo/lib/python3.11/site-packages/parla/init.py", line 61, in init
self._device_manager = DeviceManager(dev_config_file)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "device_manager.pyx", line 151, in parla.cython.device_manager.PyDeviceManager.init
File "device_manager.pyx", line 90, in parla.cython.device_manager.StreamPool.init
File "device_manager.pyx", line 92, in parla.cython.device_manager.StreamPool.init
File "device.pyx", line 501, in parla.cython.device.CupyStream.init
File "device.pyx", line 503, in parla.cython.device.CupyStream.init
File "cupy/cuda/stream.pyx", line 471, in cupy.cuda.stream.Stream.init
File "cupy_backends/cuda/api/runtime.pyx", line 840, in cupy_backends.cuda.api.runtime.streamCreateWithFlags
File "cupy_backends/cuda/api/runtime.pyx", line 144, in cupy_backends.cuda.api.runtime.check_status
cupy_backends.cuda.api.runtime.CUDARuntimeError: cudaErrorMemoryAllocation: out of memory

@wlruys wlruys added help wanted Extra attention is needed bug Something isn't working labels Jul 31, 2023
@wlruys
Copy link
Contributor

wlruys commented Jul 31, 2023

The streams fall out of scope when the device manager is destroyed:

self.stream_pool = StreamPool(self.get_devices(DeviceType.CUDA))

References to the device_manager are held by (1) The Scheduler and (2) the Parla() context.
Both are destroyed at Parla().__exit__:

del self._device_manager
core.py_write_log(self.logfile)

Which triggers unless the program crashes or is interrupted. The streams should be deleted between Runtime start/stops.

My best guess here would be that there is another memory leak happening elsewhere.
When enough memory is leaked and with Parla() is called, it attempts to create 8*num_gpus streams but fails to find space and crashes.

@wlruys
Copy link
Contributor

wlruys commented Jul 31, 2023

A reference is also held by:

del self._device_manager
core.py_write_log(self.logfile)

But this is cleaned up on the scheduler thread when the runtime finishes (and on worker threads whenever they complete their work)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants