-
-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Add large svd draft #38
Conversation
Cool, thanks @mrocklin. I took a scan through and looks good, I'll try to follow up with more detailed feedback. |
If you want to give it a try, see this notebook https://gist.github.com/mrocklin/3af8a428e18a3047ab501161ffe17cf9 It should run fine on the single-node GPU machine to which you have access today (sending the address by e-mail) |
No |
@quasiben @jakirkham @pentschev I suspect that you all are busy, but it would be fun to try this experiment again, now with the new and improved UCX integration. |
I gave this a whirl on a DGX2 (16GPUs) with NVLink enabled. Here's a performance report: In general things are looking ok -- I am a bit concerned with the NumPy transfers (don't know where they are coming from yet) and there is a fairly large gap in the middle of the workflow. But things worked! |
Probably metadata in UCX send/recv, which is mostly being addressed in dask/distributed#3732 . |
It would be nice to know how much was transferred (this is mostly a mental note to me to maybe improve diagnostics) |
Can I ask that you maybe also call |
Do you mean |
From the performance report it looks like we're spending essentially all of our time generating random numbers. There are a few things that we might experiment with here to improve the focus on SVD, and also give this example a bit more reality. cc'ing @alimanfoo, @eric-czech, and @jakirkham who I think might all care about this problem.
Anyway, those are some ideas. I'm actually pretty excited about trying them all. |
@quasiben I'm somewhat concerned about the 2GB/s bandwidth for cupy arrays. This seems low? I hope that once we remove the random number generation that this becomes our next bottleneck to squash. |
Oh indeed you did. My apologies. v is good. u is probably huge.
…On Wed, Apr 22, 2020 at 7:56 AM Benjamin Zaitlen ***@***.***> wrote:
Can I ask that you maybe also call v.compute(). That might slightly change
things.
Do you mean u.compute() ? I called v
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#38 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKZTFNO3ELCCMCEJ2NJ4DRN4ATVANCNFSM4H6J3EXQ>
.
|
FWIW in a real genomics problem the integers can take values 0, 1 or 2 (if working with a diploid organism like humans or mosquitoes). They are also not uniformly distributed. These may or may not make a different, depending on what you're testing. Biggest different I expect would be compressibility (real data will be much more compressible than uniform random). Give me a shout if you'd like some real data to play with, I can point you at some zarrs on GCS. |
Some zarr datasets on gcs would be great. We could probably download
locally.
…On Fri, Apr 24, 2020, 8:18 AM Alistair Miles ***@***.***> wrote:
1. We can mimic a genomics problem I think by using random integers
uniformly distributed between 0 and 3, stored as uint8.
FWIW in a real genomics problem the integers can take values 0, 1 or 2 (if
working with a diploid organism like humans or mosquitoes). They are also
not uniformly distributed. These may or may not make a different, depending
on what you're testing. Biggest different I expect would be compressibility
(real data will be much more compressible than uniform random). Give me a
shout if you'd like some real data to play with, I can point you at some
zarrs on GCS.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#38 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKZTCLYRVKJMBZ6QUJFQTROGUV3ANCNFSM4H6J3EXQ>
.
|
Here is an example using Zarr with dask array together to compress in-memory data. https://gist.github.com/mrocklin/ae18a1b51ccf90a2fa9872de23144ffa |
Here's a gist showing location of some genome variation data in zarr on GCS, and also the couple of preprocessing steps required before can be run through SVD: https://gist.github.com/alimanfoo/ef6d9e70750b5273f95cbf681d162de8 Hth :-) |
Btw this example is "tall and skinny" with ~10 million features and ~1,000 individuals, but the data are growing to become more square, e.g., biobank datasets are more like ~10 million features and ~1 million individuals. So don't optimise for the tall and skinny case. At some point would probably be useful to simulate a biobank-scale dataset. |
That's cunning :-) |
Yeah, I then followed it up by actually swapping out zarr with functions that do bit-twiddling def compress(x: np.ndarray) -> np.ndarray:
out = np.zeros_like(x, shape=(x.shape[0], x.shape[1] // 4))
out += x[:, 0::4]
out += x[:, 1::4] << 2
out += x[:, 2::4] << 4
out += x[:, 3::4] << 6
return out
def decompress(out: np.ndarray) -> np.ndarray:
back = np.zeros_like(out, shape=(out.shape[0], out.shape[1] * 4))
back[:, 0::4] = out & 0b00000011
back[:, 1::4] = (out & 0b00001100) >> 2
back[:, 2::4] = (out & 0b00110000) >> 4
back[:, 3::4] = (out & 0b11000000) >> 6
return back This is a bit faster, and it also works wtih CuPy (we don't have good compression algorithms with Zarr on GPU yet). This lets us store a dataset pretty compactly, and expand it on-demand. It didn't quite work, I suspect because of a known memory pressure issue with the dask scheduler (a few other groups have run into something similar, so I hope that there is good pressure to solve it). Speaking of CuPy, @quasiben and I were playing with a DGX-2 (16GPUs, half terrabyte of device memory) last night and solved something like a 10,000,000 by 20,000 svd in 10 seconds or so (if I recall correctly). That's after it was in RAM though. Probably the main cost will be collecting it from wherever it's stored.
That's good to know. So far it looks like we're not overly sensitive to chunking on the y-axis, assuming that we're comfortable with approximate results. |
@mrocklin do you have any thoughts on trying to do bitpacking like that with the dask API rather than as @alimanfoo mentioned experimenting with something like that in the past a bit but I'd love to know if you think doing lots of bit-fiddling at the dask level would be a bad practice. Apologies for a somewhat off-topic post but your example shook some thoughts loose so thanks for sharing it! |
In principle, yes, this could be made to work. Dask array isn't as smooth on in-place operations, so to make the packing process here as fast then we might want something like dask/dask#6123 . If you want to pack with map_blocks and then operate on the packed representation with the Dask array API though then yeah, I imagine that that would be easier. |
In general, the experiment with @quasiben was interesting in order to see how much of this data we could effectively operate on in a fixed amount of device memory. Things are very fast if we can stay in that RAM, but there are a lot of interesting complications to doing so. This was more for us to play with technology than to solve an actual science problem. This leads to a larger question of "what are the actual science problems here?" For example, you all may not even care about doing this quickly, but instead just care about being able to do it at all. I encourage folks to speak up about what would be motivating. |
I'd say for genomics the main driver is to be able to do it at all. If it can then be done in an interactive environment, even better, where "interactive time" is anything less than the time it takes to make a cup of tea. I.e., you don't need to leave your normal working environment (Jupyter notebook) or worry about any long-running tasks. Also probably relevant as analysis moves to the cloud is cost, i.e., it is not an expensive computation. Sorry this is a bit vague, but hopefully useful. Main point is we are not running SVDs over and over again all day, so this is not something where there is strong pressure to run as fast and cheaply as possible. But making it possible and straightforward to setup and run large SVDs is valuable. |
@alimanfoo can you make tea in 20 seconds ? In my last experiment I was able to push to 400GB of in memory data (10_000_000, 30_000). SVD on this size was also not greatly impacted. I measured a compute time of 19.3 seconds. This is probably close to the limit (as of now) of the amount of data we can naively perform an SVD on. To my snarky tea comment before, and as Matt said earlier, reading from disk will be somewhat burdensome. It took roughly a minute to generate the data in memory -- this will mostly like be longer when reading from disk and push us into tea construction time. There are two things which could be done here to alleviate performance and memory constraints:
Another interesting thing to note is that after loading data into memory, SVD is a communication heavy process. I didn't expect that. A distributed SVD spends more time transferring than computing. Using a box like a DGX2 is hugely impactful here as we can leverage NVLink for communication between GPUs (NVLink has a 50GB/s bandwidth) Breakdown of actions for an SVD
I also wanted to note that working on these problems is also of general interest to NVIDIA. |
It would be fun, I think, to do this on a practical problem (like the zarr
dataset that @alimanfoo pointed to earlier) on both a consumer laptop and a
DGX-2, and quantify all of the costs involved (reading from disk, moving to
GPU, transfer time, compute time). That might give us a sense about what a
range of performance looks like, and where technology efforts would need to
focus in the future to improve things.
…On Tue, Apr 28, 2020 at 7:05 AM Benjamin Zaitlen ***@***.***> wrote:
@alimanfoo <https://github.com/alimanfoo> can you make tea in 20sec ?
In my last experiment I was able to push to 400GB of in memory data
(10_000_000, 30_000). SVD on this size was also not greatly impacted. I
measured a compute time of 19.3 seconds. This is probably close to the
limit (as of now) of the amount of data we can naively perform an SVD on.
To my snarky tea comment before, and as Matt said earlier, reading from
disk will be somewhat burdensome. It took roughly a minute to generate the
data in memory -- this will mostly like be longer when reading from disk
and push us into tea construction time.
There are two things which *could* be done here to alleviate performance
and memory constraints:
- Optimized GPU readers from disk
<https://devblogs.nvidia.com/gpudirect-storage/> which integrate
seamlessly with Zarr or other formats
- Increase GPU RAM (don't know how hard this is)
Another interesting thing to note is that after loading data into memory,
SVD is a communication heavy process. I didn't expect that. A distributed
SVD spends more time transferring than computing. Using a box like a DGX2
is hugely impactful here as we can leverage NVLink for communication
between GPUs (NVLink has a 50GB/s bandwidth)
Breakdown of actions for an SVD
defaultdict(int,
{'compute': 55.52175450325012,
'transfer': 141.12734627723694,
'disk-read': 0.00934290885925293})
I also wanted to note that working on these problems is also of general
interest to NVIDIA.
ClaraGenomics would be an example of NVIDIA in the genetics space:
https://github.com/clara-genomics/Compute4COVID
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#38 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKZTEAFD4YM7TFC7EHFKLRO3PEJANCNFSM4H6J3EXQ>
.
|
Nope :-) Just to elaborate a little, obviously interactive is ideal, but tea break would be OK too. Another way to think about it might be, imagine a researcher who has access to a pangeo-like or saturncloud-like research computing environment, and can fire up notebook servers and dask-kubernetes clusters as needed based on standard node types offered by public cloud providers. They need to run an SVD over a large dataset, and ideally it's (a) convenient to fire up the necessary compute resources and setup the computation, (2) doesn't take too long to run, and (iii) cloud costs are small enough to be negligible, i.e., don't require any special accounting. Hth. |
Like this? Notebook: https://nbviewer.jupyter.org/urls/gist.githubusercontent.com/mrocklin/0c4e97b9164c55d85ce457fce0b87d41/raw/90a07e2f6fd1563aa65eb83f190e4c853494a4dd/coiled-compressed-svd.ipynb |
Added quasiben#1 with a bit more explanation of the genomics. |
Thanks @alimanfoo ! I added them directly here. Matt gave me write access and thought it best to centralize now that we can |
@mrocklin when you have some time can you take a pass over this ? |
_posts/2019-07-02-very-large-svds.md
Outdated
On A DGX2 we can calculate an SVD on a 200GB Dask array between 10 and15 seconds: [SVD Multi-GPU Notebook](https://gist.github.com/quasiben/98ee254920837313946f621e103d41f4) | ||
|
||
To see this run, we recommend viewing | ||
[the attached screencast](https://youtu.be/4X5yky2lvEw) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My computer may be borked? Can someone else check if this has audio?
Thanks @quasiben I took a look. Some high level thoughts:
|
@mrocklin thanks for the review. I just recently upgraded to Catalina and I think that changed various app permissions. I'll re-record and upload. I can also update the post with your suggestions as well. Thanks |
_posts/2019-07-02-very-large-svds.md
Outdated
layout: post | ||
title: Very Large SVDs | ||
tagline: Dask + CuPy + Zarr + Genomics | ||
author: Alistair Miles (Oxford Big Data Institute), Ben Zaitlen (NVIDIA), Matthew Rocklin (Coiled) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
author: Alistair Miles (Oxford Big Data Institute), Ben Zaitlen (NVIDIA), Matthew Rocklin (Coiled) | |
author: Ben Zaitlen (NVIDIA), Matthew Rocklin (Coiled), Alistair Miles (Oxford University) |
I don't deserve to be first author here :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should do it alphabetically by first name. I believe that happens to coincide with the current ordering of names
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would feel a bit strange for me, my contribution to this nice work is very minor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed the author order
This is looking really good. @mrocklin left a few small comments on the latest commits |
Co-authored-by: Benjamin Zaitlen <[email protected]>
Read through the post again and I'm good with posting. @alimanfoo ? |
LGTM :-) |
I will plan to merge this and tweet about it tomorrow morning US Eastern time |
cc @alimanfoo
I need to put in actual code samples and probably do a screencast, but in the mean time if you'd be willing to look through this post for the general structure I'd welcome the collaboration.