Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible performance bug with predict nodes and shared memory #131

Open
trivoldus28 opened this issue Oct 10, 2020 · 0 comments
Open

Possible performance bug with predict nodes and shared memory #131

trivoldus28 opened this issue Oct 10, 2020 · 0 comments
Assignees

Comments

@trivoldus28
Copy link
Collaborator

Turns out that the default value for max_shared_memory
https://github.com/funkey/gunpowder/blob/e523b49ca846a9fd46ab6fc0dd1040cc4a4d53b4/gunpowder/tensorflow/nodes/predict.py#L71
does not allocate 1GB of rather 4GB because the value type ctypes.c_float is used when for creating the RawArray.
https://github.com/funkey/gunpowder/blob/e523b49ca846a9fd46ab6fc0dd1040cc4a4d53b4/gunpowder/tensorflow/nodes/predict.py#L90

With 4GB per each input and output array, this means each predict worker is allocating 8GB of shared memory. With 4 workers, job memory should then be at least 32GB. Now, here is the performance bug. I did not know about the shared memory requirement and have always run my inference pipeline with 4 workers with only 8GB of memory (to minimize my resource usage counting :)). You'd think that gunpower would run out of memory and be killed, but actually it will not! Turns out, the reason for this is because Python multiprocessing package creates a temp file and mmap that whenever RawArray is used: src. So it does not matter how many workers are being run and how much shared memory is being allocating Python will happily chug along although with possible slow downs of memory being swapped in and out to disk.

The most immediate slow down is during initialization when the array is set to zero. I have seen my inference jobs taking more than 15m to initialize all of the RawArrays to disk tempfiles (vs less than 1m if there were enough memory).

The second-order bug would be during runtime and data is paged out from main memory to disk. In my inference jobs, once getting past the initialization, 8GB was actually enough for four workers, but I can imagine scenarios where not enough memory was requested for the jobs and data is paged out on every transfer. I don't know exactly what the mechanism is inside the OS for it to decide when to page something out from a memory mapped file, but we probably should avoid this scenario at all time because it can be an opaque performance bug.

My recommendations are:

  1. At the very least, the max_shared_memory argument should be made more transparent to the user. Maybe something like shared_memory_per_worker_GB and then calculate the appropriate max_shared_memory from that.
  2. The default for max_shared_memory should be decreased substantially. I'm guessing that most production jobs won't be transferring more than a few hundred MBs, so maybe the default should be capped to something like 64MB or 128MB and the more experimental users can increase it accordingly.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants