[WIP] Shared memory for NBLAST #87

schlegelp · 2022-04-01T13:38:13Z

With larger NBLASTs, memory consumption becomes an issue. During the NBLAST each child process holds only the scores it works on in memory. On completion, however, that data has to be sent to the parent process which takes time and causes a memory spike (I'm pretty sure it makes a copy). Subsequently, we need to stitch the scores into one big matrix which again takes time and memory.

This PR is work in progress but the basic idea is this:

Use multiprocessing.shared_memory.SharedMemory to reserve some memory
The parent process wraps that reserved memory buffer as numpy array
Each NBLASTing child process is given a reference to the memory buffer and writes the scores it works on directly to that buffer. Because the writes are non-overlapping, we don't have to bother with a queue.

Advantages:

no data sent back from the child processes
scores only ever exist once
no annoying post-NBLAST stitching required (this will also be super useful for nblast_smart)

Disadvantages:

The memory management is a bit more annoying because unlike your run-of-the-mill numpy array the SharedMemory needs to be explicitly released by calling SharedMemory.close(). If one doesn't do this before exiting the session, there is a UserWarning. For now I solved this by hooking into atexit but I'm pretty sure that if the user simply deletes the array/dataframe (i.e. something like del scores) the memory is not freed. That bit needs a bit of exploration - perhaps we can implement some custom __del__ method or something.

I have so far only implemented this for navis.nblast - all other functions in this branch still use the old approach. Some quick and dirty benchmarks show some decent speed up, in particular for large NBLASTs. Haven't done proper memory profiling yet.

@clbarnes any thoughts on this? In particular on the somewhat opaque issue with memory not being freed.

Code reviews welcome (the whole NBLAST module is becoming fairly complex)!

- creates scores array from shared memory buffer - fills array with scores

clbarnes · 2022-04-01T15:10:11Z

Could we use a SharedMemoryManager to handle the cleanup? The context manager form should make it fairly convenient. Doing a single copy of the shared numpy array at the end (to get it out of the shared buffer before it's unlinked) should be a negligible overhead.

I'd probably write a wrapper class around that with a copy method.

schlegelp · 2022-04-01T15:33:32Z

Yes context manager + making a copy of the array would solve that issue. I was thinking mostly of scenarios where the NBLAST score matrix is almost larger than the available memory in which case a copy might tip it over the edge.

Btw: here is a benchmark of NBLASTing the 25k hemibrain neurons against each other. I did cheat by only running a small fraction of the pairwise comparisons (hence this test finished after ~5mins) but the array sizes should have been the same as in a real NBLAST.

Not sure how much I trust the profiler though: these drops in memory consumption are rather odd - are those spin dumps?

clbarnes · 2022-04-01T15:49:26Z

If you don't have the memory to copy the score matrix in memory then you can just do whatever you want to it inside the context manager, I suppose. If you're that tight on memory then you definitely want to be able to manually free the memory (by jumping out of a context manager) rather than waiting for GC to get it in atexit, I think.

clbarnes · 2022-04-01T16:33:13Z

This looks like it would be a good fit: https://pypi.org/project/shared-ndarray2/ , but it's py3.8+ only.

Those memory drops might be the garbage collection running, I guess?

clbarnes · 2022-04-01T16:43:55Z

You could also avoid the copies all being done at the end by writing them in to the dataframe as they come out:

    scores = pd.DataFrame(np.zeros((len(query_dps), len(target_dps)),
                                    dtype=this.dtype),
                          index=query_dps.id, columns=target_dps.id)

    with ProcessPoolExecutor(max_workers=n_cores) as pool:
        # Each nblaster is passed to its own process
        futures = [pool.submit(this.multi_query_target,
                               q_idx=this.queries,
                               t_idx=this.targets,
                               scores=scores) for this in nblasters]
        for f in concurrent.futures.as_completed(futures):
            res = f.result()
            scores.loc[res.index, res.columns] = res.values

I have a suspicion pool.map may be a little better than a list of pool.submit, because it can start pulling results out before the last tasks have even been submitted (although I don't know if it's actually implemented that way).

schlegelp · 2022-04-02T11:51:45Z

Had a look at shared-ndarray2: this would reduce the boiler plate a little but under the hood it does the same as the current code, I think. More importantly: what I hadn't realised is that, multiprocessing.shared_memory (which shared-ndarray2 is also using) is new in Python 3.8. So this PR would require us to bump the minimum Python version (or keep the current implementation) as backup.

clbarnes · 2022-04-04T09:32:20Z

I've voiced my opinions on bumping the minimum version 😅 It doesn't break existing versions of navis, which people are welcome to keep using on their old python versions (this can be noted in the docs); it just means that new improvements and features in navis can make use of new improvements and features in python, rather than the library having to work around limitations which have already been solved in the language. Most of the scientific ecosystem does seem to follow NEP29 so it's not an outlandish stance for navis to take.

schlegelp · 2022-04-04T14:37:45Z

For what it's worth: the main hold up was reticulate catching up and it is now at least using 3.8 - I think they also made it easy to install newer versions with e.g. pyenv. Happy to bump to 3.8 but to be honest except for the shared memory, 3.8 doesn't really add much :D

clbarnes · 2022-04-04T15:34:17Z

Just raised a PR on reticulate rstudio/reticulate#1185 which (I think) explicitly runs tests against python 3.8, 3.9, and 3.10; passes for me https://github.com/clbarnes/reticulate/actions/runs/2090867132 !

schlegelp · 2022-04-04T18:16:29Z

Nice! Wait... so what are they running their tests against now?

clbarnes · 2022-04-04T18:21:25Z

Seems to be whatever python happens to be lying around in the CI environment. Fortunately a certain subset of the python ABI doesn't change very quickly so I wasn't expecting any issues with the newer versions, but it just didn't look like they were tested explicitly.

clbarnes · 2022-04-06T11:56:56Z

Reticulate now explicitly tests against 3.8-10!

schlegelp · 2024-08-16T14:36:02Z

Note to self: we might be better off using multiprocessing.sharedctypes.RawArray. It's basically the same thing as SharedMemory without the need to unlink/close - but also without any guard rails...

Minimal example:

>>> import os 
>>> import psutil
>>> import ctypes
>>> import numpy as np
>>> from multiprocessing.sharedctypes import RawArray
>>> from multiprocessing import Process

>>> # Initialize the RawArray
>>> ARRAY_SHAPE = (10_000, 10_000)
>>> ARRAY_SIZE = int(np.prod(ARRAY_SHAPE))
>>> shared_array = RawArray(ctypes.c_double, ARRAY_SIZE)

>>> # Wrap shared array as numpy array in parent thread
>>> shared_array_np = np.ndarray(ARRAY_SHAPE, dtype=np.float64, buffer=shared_array)

>>> def test_write(buff):
>>>    shared_array = np.ndarray(ARRAY_SHAPE, dtype=np.float64, buffer=buff)
>>>    shared_array[0, :10] = 1  # write a few values

>>> # Pass the shared array to the child process
>>> p = Process(target=test_write, args=(shared_array,))
>>> p.run()
>>> del p   # delete process to remove reference to shared_array
>>> print(shared_array_np)
array([[1., 1., 1., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

>>> # Check memory footprint of parent process
>>> mem_info = psutil.Process(os.getpid()).memory_full_info()
>>> print(f"Resident Set Size: {mem_info.rss / 1e9:.2f}")
Resident Set Size: 0.84Gb

>>> # Importantly, we can delete the buffer variable and the numpy array is just fine 
>>> del shared_array 
>>> import gc
>>> gc.collect()
>>> print(shared_array_np)
array([[1., 1., 1., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

>>> # Memory is also still reserved
>>> mem_info = psutil.Process(os.getpid()).memory_full_info()
>>> print(f"Resident Set Size: {mem_info.rss / 1e9:.2f}")
Resident Set Size: 0.84Gb

>>> # Only after actually deleting the `shared_array_np` is the memory released 
>>> del shared_array_np
>>> mem_info = psutil.Process(os.getpid()).memory_full_info()
>>> print(f"Resident Set Size: {mem_info.rss / 1e9:.2f}")
Resident Set Size: 0.04Gb

Note: when running the above in iPython/Jupyter, memory might not be released right away because of a reference to the array/buffer in the history (check globals()).

schlegelp added 5 commits April 1, 2022 14:09

teach base NBLASTER to accept an array to write scores to

862920a

new class SharedBlaster:

ba03a22

- creates scores array from shared memory buffer - fills array with scores

put precision parsing into extra function

c36b288

make navis.nblast work with shared memory

37661ce

fix docstring

a42e61e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Shared memory for NBLAST #87

[WIP] Shared memory for NBLAST #87

schlegelp commented Apr 1, 2022 •

edited

Loading

clbarnes commented Apr 1, 2022 •

edited

Loading

schlegelp commented Apr 1, 2022

clbarnes commented Apr 1, 2022 •

edited

Loading

clbarnes commented Apr 1, 2022 •

edited

Loading

clbarnes commented Apr 1, 2022 •

edited

Loading

schlegelp commented Apr 2, 2022 •

edited

Loading

clbarnes commented Apr 4, 2022 •

edited

Loading

schlegelp commented Apr 4, 2022

clbarnes commented Apr 4, 2022

schlegelp commented Apr 4, 2022

clbarnes commented Apr 4, 2022

clbarnes commented Apr 6, 2022

schlegelp commented Aug 16, 2024 •

edited

Loading

[WIP] Shared memory for NBLAST #87

Are you sure you want to change the base?

[WIP] Shared memory for NBLAST #87

Conversation

schlegelp commented Apr 1, 2022 • edited Loading

Advantages:

Disadvantages:

clbarnes commented Apr 1, 2022 • edited Loading

schlegelp commented Apr 1, 2022

clbarnes commented Apr 1, 2022 • edited Loading

clbarnes commented Apr 1, 2022 • edited Loading

clbarnes commented Apr 1, 2022 • edited Loading

schlegelp commented Apr 2, 2022 • edited Loading

clbarnes commented Apr 4, 2022 • edited Loading

schlegelp commented Apr 4, 2022

clbarnes commented Apr 4, 2022

schlegelp commented Apr 4, 2022

clbarnes commented Apr 4, 2022

clbarnes commented Apr 6, 2022

schlegelp commented Aug 16, 2024 • edited Loading

schlegelp commented Apr 1, 2022 •

edited

Loading

clbarnes commented Apr 1, 2022 •

edited

Loading

clbarnes commented Apr 1, 2022 •

edited

Loading

clbarnes commented Apr 1, 2022 •

edited

Loading

clbarnes commented Apr 1, 2022 •

edited

Loading

schlegelp commented Apr 2, 2022 •

edited

Loading

clbarnes commented Apr 4, 2022 •

edited

Loading

schlegelp commented Aug 16, 2024 •

edited

Loading