[QST] dask-cuda memory management #89

matthieubulte · 2019-07-03T22:14:44Z

Hello. First of all thank you very much for the great work on this project, it's really very useful!

I'm currently exploring different options for fitting some simple (and later more complex) statistical models on GPUs and one of the issue I was having was concerning memory management.

When using Dask + CuPy, I often ran into OOM errors on the GPU. This can probably be explained by the fast that Dask has no idea how much memory is available on the GPU and thus how much chunks it can load at once. This is quite an issue because it would require to manually handle loading/unloading of chunk, which isn't very fun.

I came across this project and realized that you have implemented a 2 level cache system to avoid filling the GPU's memory, which is exactly what I'm looking for. However, you also mentioned something else that confused me a little bit in #43

If using Yarn for one-worker-per-gpu workloads then I would probably just not use dask-cuda-worker, and would instead ask for a single GPU in your Yarn [...]

How would this solve the memory issue? If one worker loads too much data on the GPU by creating too many cupy arrays, then the problem will still be there.

As a more general question: while I understand this is still very much in development (and I would be very happy to contribute if necessary), are there some general guidelines for using Dask with GPUs?

And a last question: It seems to me that the only thing that I currently need from this project is the memory management offered by the CUDA worker. Would it be reasonable to only use this part of the library without using the whole CUDA cluster? This would be useful for deployment on a yarn cluster for instance.

Thanks!

pentschev · 2019-07-05T20:50:51Z

I came across this project and realized that you have implemented a 2 level cache system to avoid filling the GPU's memory, which is exactly what I'm looking for.

That is correct, we do have an open PR (#35) for an active memory monitor too (just as in dask distributed), for the time being, we've decided for the cache system only for simplicity of code, since we're still not absolutely certain that an active memory monitor would provide a huge benefit. Would you care to test that and provide feedback on whether it helps/solves your issue? Having some feedback would be great in helping us decide how to continue with this.

However, you also mentioned something else that confused me a little bit in #43

If using Yarn for one-worker-per-gpu workloads then I would probably just not use dask-cuda-worker, and would instead ask for a single GPU in your Yarn [...]

How would this solve the memory issue? If one worker loads too much data on the GPU by creating too many cupy arrays, then the problem will still be there.

I think he meant it in a different situation. If you have multiple workers sharing a single GPU, they would share its memory, making the problem likely worse. Feel free to correct me on this one @mrocklin.

As a more general question: while I understand this is still very much in development (and I would be very happy to contribute if necessary), are there some general guidelines for using Dask with GPUs?

Unfortunately, we don't have guidelines yet, in part because we're still uncertain about the ways people would use dask-cuda, but feel free to give us more details/share code, this would also help us in understanding what users would like to do. What we do have are some blog posts, but we don't make any claims they necessarily point to best practices or provide stable APIs:

https://blog.dask.org/2019/01/03/dask-array-gpus-first-steps
https://blog.dask.org/2019/03/18/dask-nep18
https://blog.dask.org/2019/04/09/numba-stencil

And a last question: It seems to me that the only thing that I currently need from this project is the memory management offered by the CUDA worker. Would it be reasonable to only use this part of the library without using the whole CUDA cluster? This would be useful for deployment on a yarn cluster for instance.

Are you asking if you can use only the memory manager without the CUDA worker/cluster created here? If that's your question, I'm not experienced with Yarn, but I don't think you can just use that particular part of the repository with Yarn, at least not without porting the code to Yarn.

matthieubulte · 2019-07-07T19:39:43Z

Thank you for this long answer @pentschev !

That is correct, we do have an open PR (#35) for an active memory monitor too (just as in dask distributed), for the time being, we've decided for the cache system only for simplicity of code, since we're still not absolutely certain that an active memory monitor would provide a huge benefit. Would you care to test that and provide feedback on whether it helps/solves your issue? Having some feedback would be great in helping us decide how to continue with this.

Yes, I saw this PR. To my understanding this is already partially merged and usable in 9.0, correct? Assuming that it is the case, I have been using it and it was exactly what I need in terms of stability. The only issue I had with this is that the data loading was very slow. I'll get to it in the second part of my comment.

I think he meant it in a different situation. If you have multiple workers sharing a single GPU, they would share its memory, making the problem likely worse. Feel free to correct me on this one @mrocklin.

Thanks for the clarification.

Unfortunately, we don't have guidelines yet, in part because we're still uncertain about the ways people would use dask-cuda, but feel free to give us more details/share code, this would also help us in understanding what users would like to do. What we do have are some blog posts, but we don't make any claims they necessarily point to best practices or provide stable APIs:

Thanks. I saw these resources already, they were very useful to better understand what's going on under the hood!

Are you asking if you can use only the memory manager without the CUDA worker/cluster created here? If that's your question, I'm not experienced with Yarn, but I don't think you can just use that particular part of the repository with Yarn, at least not without porting the code to Yarn.

Sorry if my question wasn't clear there. What I really need from this repository is the CUDAWorker which -- thanks to the multi-level cache -- allows me to write stable code independent of the capabilities of the GPU(s) I'm using. However, it's not clear to me how to use the CUDAWorker in distributed. I wouldn't mind having to put some work into it, but I haven't used Dask enough to be able to properly estimate the effort.

Going back to my situation and problem, let me give you some context. I'm currently looking for a good way to use GPUs to speed-up fitting various statistical models. The end-goal will be to have a fast implementation of generalized linear mixed models capable of handling large-ish datasets (~10^9 obs) which can be shared with people with very different technical backgrounds. Right now I'm still looking into available tools and trying to build some understanding, but Dask + CuPy looks really good!

My testing code right now is simply generating a random Dask array of 160GB on the CPU and summing its elements on the GPU. The reason why I'm generating the data on the CPU is that in a more realistic scenario, data will have to be loaded by some custom logic on the CPU. The exact code I'm currently looking at is:

n = 10**8
d = 200

cluster = LocalCUDACluster(ip='0.0.0.0', n_workers=1, device_memory_limit='7000 MiB')
client = Client(cluster)

da.random.random((n, d), chunks=(10**5, d))\
      .map_blocks(cp.asarray)\
      .sum().compute()

Running this on a g3s.xlarge EC2 instance [1] roughly takes 4min35s. Looking at the task graph, we see that most of the time (>95%) is spent generating the data and sending it to the GPU. I can't exactly read from the graph how much is due to generating and how much is due to data transfer. Furthermore, I tried playing with the chunk size as suggested in #57 but I couldn't observe any difference.

I then tried running the same code with a LocalCluster with two workers (still against 1 GPU) thinking that having two workers generating and loading the data in parallel might speed up the computation, and it did indeed bring the computation down to 2min10s.

My last test was to use 4 workers to see if this reasoning scales up, and unfortunately it doesn't, I guess it was obvious that I would eventually hit some physical limit.

Anyways, the problem here is that the LocalCluster option is pretty bad since it requires each "user" to tune its chunksize to avoid any OOM. I'm thus currently trying to figure what would be the best option to have the transparent memory management offered by the CUDAWorker but without the resulting loss in performances. Do you have any experience to share about this kind of problem?

Otherwise I would be very happy to continue working on that on my own and keep discussing this matter with you!

[1] https://aws.amazon.com/ec2/instance-types/

pentschev · 2019-07-08T12:18:49Z

Sorry if my question wasn't clear there. What I really need from this repository is the CUDAWorker which -- thanks to the multi-level cache -- allows me to write stable code independent of the capabilities of the GPU(s) I'm using. However, it's not clear to me how to use the CUDAWorker in distributed. I wouldn't mind having to put some work into it, but I haven't used Dask enough to be able to properly estimate the effort.

CUDAWorker is only introduced in #35, which hasn't been merged yet. As I've mentioned before, we currently support the multi-level cache, for that case we don't need a CUDAWorker, and only pass a different data= argument to distributed.Worker, that handles the memory spilling for GPUs too. We only need a CUDAWorker for an active memory monitor, which is the fundamental difference from what's already in branch-0.8 and #35. In either case, you would not launch clusters/workers with LocalCluster nor the command line worker dask-worker anymore, but replace them for LocalCUDACluster or dask-cuda-worker, respectively.

The reason why I'm generating the data on the CPU is that in a more realistic scenario, data will have to be loaded by some custom logic on the CPU.

A bit unrelated, but this depends on your data. If for example you're working with Pandas DataFrames, you might want to check cuDF, which will allow you to load data directly (and faster) to the GPU.

Furthermore, I tried playing with the chunk size as suggested in #57 but I couldn't observe any difference.

That is a bit of a different case, it uses cuDF only, no CuPy involved. There the issue is cuDF uses too much memory, and dask-cuda has no control over that, since it's not exposed on the Python side, thus that memory can't be spilled and ends up crashing the application. Chunk sizes still play a role even with CuPy, obviously, we can't have chunks larger than the entire GPU memory, for example.

I then tried running the same code with a LocalCluster with two workers (still against 1 GPU) thinking that having two workers generating and loading the data in parallel might speed up the computation, and it did indeed bring the computation down to 2min10s.

My last test was to use 4 workers to see if this reasoning scales up, and unfortunately it doesn't, I guess it was obvious that I would eventually hit some physical limit.

In this case, you were most likely bound by CPU computation (e.g. Dask graph creation), that was probably overcome with two workers, but it may vary heavily on the nature of computation, and also the amount of tasks you have.

Anyways, the problem here is that the LocalCluster option is pretty bad since it requires each "user" to tune its chunksize to avoid any OOM. I'm thus currently trying to figure what would be the best option to have the transparent memory management offered by the CUDAWorker but without the resulting loss in performances. Do you have any experience to share about this kind of problem?

Just to be clear, are you really talking about LocalCluster, or you meant to say LocalCUDACluster, note that while the first may also work with CuPy, the more advanced CUDA-related features (such as device memory spilling) is only available in the latter. Also when you say "chunksize" are you referring to the chunksize of your Dask arrays or the device_memory_limit argument?

matthieubulte · 2019-07-08T12:23:41Z

Thank you for your comments.

CUDAWorker is only introduced in #35, which hasn't been merged yet. As I've mentioned before, we currently support the multi-level cache, for that case we don't need a CUDAWorker, and only pass a different data= argument to distributed.Worker, that handles the memory spilling for GPUs too. We only need a CUDAWorker for an active memory monitor, which is the fundamental difference from what's already in branch-0.8 and #35. In either case, you would not launch clusters/workers with LocalCluster nor the command line worker dask-worker anymore, but replace them for LocalCUDACluster or dask-cuda-worker, respectively.

Great, thanks. This distinction wasn't clear to me.

A bit unrelated, but this depends on your data. If for example you're working with Pandas DataFrames, you might want to check cuDF, which will allow you to load data directly (and faster) to the GPU.

Ah, good to know as well, thank you!

Just to be clear, are you really talking about LocalCluster, or you meant to say LocalCUDACluster,

No, I really meant using a LocalCluster with multiple workers talking to one GPU, but this might not be necessary given your other comments.

note that while the first may also work with CuPy, the more advanced CUDA-related features (such as device memory spilling) is only available in the latter.

Good to know, thanks.

Also when you say "chunksize" are you referring to the chunksize of your Dask arrays or the device_memory_limit argument?

I really mean the chunk size of my Dask array.

pentschev · 2019-07-08T12:43:23Z

I really mean the chunk size of my Dask array.

This is more of a Dask problem in itself, and we don't intend to customize things in that direction, this library is supposed to be an extension of distributed, but the way Dask is used will remain the same. As an alternative, you can do automatic chunking, but to achieve best performance, some chunk size tweaking may be required by the user.

matthieubulte · 2019-07-09T08:23:46Z

Ok, thank you very much for your answers. I will get back to work and see how far I get!

matthieubulte closed this as completed Jul 9, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QST] dask-cuda memory management #89

[QST] dask-cuda memory management #89

matthieubulte commented Jul 3, 2019 •

edited

Loading

pentschev commented Jul 5, 2019

matthieubulte commented Jul 7, 2019

pentschev commented Jul 8, 2019

matthieubulte commented Jul 8, 2019

pentschev commented Jul 8, 2019

matthieubulte commented Jul 9, 2019

[QST] dask-cuda memory management #89

[QST] dask-cuda memory management #89

Comments

matthieubulte commented Jul 3, 2019 • edited Loading

pentschev commented Jul 5, 2019

matthieubulte commented Jul 7, 2019

[1] https://aws.amazon.com/ec2/instance-types/

pentschev commented Jul 8, 2019

matthieubulte commented Jul 8, 2019

pentschev commented Jul 8, 2019

matthieubulte commented Jul 9, 2019

matthieubulte commented Jul 3, 2019 •

edited

Loading