Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NVIDIA GPU direct storage #51

Open
weiji14 opened this issue Sep 10, 2022 · 4 comments
Open

NVIDIA GPU direct storage #51

weiji14 opened this issue Sep 10, 2022 · 4 comments

Comments

@weiji14
Copy link
Contributor

weiji14 commented Sep 10, 2022

Hi there,

Was thinking if it's possible to enable NVIDIA GPU Direct Storage on Microsoft Planetary Computer? This could enable reading Zarr files directly into GPU memory from cloud storage, and we'd be excited to have a demo use-case running (xref xarray-contrib/xbatcher#87).

Packages that need to be installed:

References:

Might need to check if the Azure cluster supports GPU direct storage first, but if it does, I can open up PRs to add these into the Pytorch and/or Tensorflow containers 😄

@TomAugspurger
Copy link

Might need to check if the Azure cluster supports GPU direct storage first

Yeah, any way to easily verify this? Maybe @quasiben has an idea what hardware / networking combination might work?

@weiji14
Copy link
Contributor Author

weiji14 commented Sep 16, 2022

There's this script https://github.com/rapidsai/kvikio/blob/29c52f76035002d91f301895250c0ff14f18f50a/python/benchmarks/single-node-io.py to check for GDS compatibility. MIght need to install a few other packages to fix ImportErrors, but the gist is:

wget https://github.com/rapidsai/kvikio/blob/29c52f76035002d91f301895250c0ff14f18f50a/python/benchmarks/single-node-io.py
python single-node-io.py

These are the results I got on Microsoft Planetary Computer Pytorch container (copied from xarray-contrib/xbatcher#87 (comment)):

----------------------------------
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
   WARNING - KvikIO compat mode   
      libcufile.so not used       
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
GPU               | Unknown (install pynvml)
GPU Memory Total  | Unknown (install pynvml)
BAR1 Memory Total | Unknown (install pynvml)
GDS driver        | N/A (Compatibility Mode)
GDS config.json   | /etc/cufile.json
----------------------------------
nbytes            | 10485760 bytes (10.00 MiB)
4K aligned        | True
pre-reg-buf       | True
diretory          | /tmp/tmp9a8nd5kz
nthreads          | 1
nruns             | 1
==================================
cufile read       |   4.28 GiB/s
cufile write      |  92.59 MiB/s
posix read        |   1.23 GiB/s
posix write       |   1.24 GiB/s

I don't have sudo permissions, but if you have time, maybe try sudo apt install nvidia-gds on the staging container and see if NVIDIA GPU Direct Storage is supported?

@quasiben
Copy link

Unfortunately, I don't think GDS is supported on cloud infra (even with mounted NVMe) but the GDS team is working on it. @cnewburn can you comment with additional thoughts ?

@quasiben
Copy link

I spoke with GDS team and they are working on addressing this issue. We expect this to be available in next CUDA release

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants