-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[QST] dask-cuda memory management #89
Comments
That is correct, we do have an open PR (#35) for an active memory monitor too (just as in dask distributed), for the time being, we've decided for the cache system only for simplicity of code, since we're still not absolutely certain that an active memory monitor would provide a huge benefit. Would you care to test that and provide feedback on whether it helps/solves your issue? Having some feedback would be great in helping us decide how to continue with this.
I think he meant it in a different situation. If you have multiple workers sharing a single GPU, they would share its memory, making the problem likely worse. Feel free to correct me on this one @mrocklin.
Unfortunately, we don't have guidelines yet, in part because we're still uncertain about the ways people would use dask-cuda, but feel free to give us more details/share code, this would also help us in understanding what users would like to do. What we do have are some blog posts, but we don't make any claims they necessarily point to best practices or provide stable APIs: https://blog.dask.org/2019/01/03/dask-array-gpus-first-steps
Are you asking if you can use only the memory manager without the CUDA worker/cluster created here? If that's your question, I'm not experienced with Yarn, but I don't think you can just use that particular part of the repository with Yarn, at least not without porting the code to Yarn. |
Thank you for this long answer @pentschev !
Yes, I saw this PR. To my understanding this is already partially merged and usable in 9.0, correct? Assuming that it is the case, I have been using it and it was exactly what I need in terms of stability. The only issue I had with this is that the data loading was very slow. I'll get to it in the second part of my comment.
Thanks for the clarification.
Thanks. I saw these resources already, they were very useful to better understand what's going on under the hood!
Sorry if my question wasn't clear there. What I really need from this repository is the Going back to my situation and problem, let me give you some context. I'm currently looking for a good way to use GPUs to speed-up fitting various statistical models. The end-goal will be to have a fast implementation of generalized linear mixed models capable of handling large-ish datasets (~10^9 obs) which can be shared with people with very different technical backgrounds. Right now I'm still looking into available tools and trying to build some understanding, but Dask + CuPy looks really good! My testing code right now is simply generating a random Dask array of 160GB on the CPU and summing its elements on the GPU. The reason why I'm generating the data on the CPU is that in a more realistic scenario, data will have to be loaded by some custom logic on the CPU. The exact code I'm currently looking at is: n = 10**8
d = 200
cluster = LocalCUDACluster(ip='0.0.0.0', n_workers=1, device_memory_limit='7000 MiB')
client = Client(cluster)
da.random.random((n, d), chunks=(10**5, d))\
.map_blocks(cp.asarray)\
.sum().compute() Running this on a I then tried running the same code with a My last test was to use 4 workers to see if this reasoning scales up, and unfortunately it doesn't, I guess it was obvious that I would eventually hit some physical limit. Anyways, the problem here is that the Otherwise I would be very happy to continue working on that on my own and keep discussing this matter with you! [1] https://aws.amazon.com/ec2/instance-types/ |
A bit unrelated, but this depends on your data. If for example you're working with Pandas DataFrames, you might want to check cuDF, which will allow you to load data directly (and faster) to the GPU.
That is a bit of a different case, it uses cuDF only, no CuPy involved. There the issue is cuDF uses too much memory, and
In this case, you were most likely bound by CPU computation (e.g. Dask graph creation), that was probably overcome with two workers, but it may vary heavily on the nature of computation, and also the amount of tasks you have.
Just to be clear, are you really talking about |
Thank you for your comments.
Great, thanks. This distinction wasn't clear to me.
Ah, good to know as well, thank you!
No, I really meant using a
Good to know, thanks.
I really mean the chunk size of my Dask array. |
This is more of a Dask problem in itself, and we don't intend to customize things in that direction, this library is supposed to be an extension of |
Ok, thank you very much for your answers. I will get back to work and see how far I get! |
Hello. First of all thank you very much for the great work on this project, it's really very useful!
I'm currently exploring different options for fitting some simple (and later more complex) statistical models on GPUs and one of the issue I was having was concerning memory management.
When using Dask + CuPy, I often ran into OOM errors on the GPU. This can probably be explained by the fast that Dask has no idea how much memory is available on the GPU and thus how much chunks it can load at once. This is quite an issue because it would require to manually handle loading/unloading of chunk, which isn't very fun.
I came across this project and realized that you have implemented a 2 level cache system to avoid filling the GPU's memory, which is exactly what I'm looking for. However, you also mentioned something else that confused me a little bit in #43
How would this solve the memory issue? If one worker loads too much data on the GPU by creating too many cupy arrays, then the problem will still be there.
As a more general question: while I understand this is still very much in development (and I would be very happy to contribute if necessary), are there some general guidelines for using Dask with GPUs?
And a last question: It seems to me that the only thing that I currently need from this project is the memory management offered by the CUDA worker. Would it be reasonable to only use this part of the library without using the whole CUDA cluster? This would be useful for deployment on a yarn cluster for instance.
Thanks!
The text was updated successfully, but these errors were encountered: