-
Notifications
You must be signed in to change notification settings - Fork 781
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
how to use multiple GPUs without horovod #1906
Comments
|
Hi, data-parallel acceleration is currently supported only with Horovod + TensorFlow 1.x and random collocation points sampling. "Horovod also supports PyTorch, TensorFlow 2.x, paving the way for multiple backend acceleration'' (link). You could either implement Horovod for the PyTorch backend or directly use PyTorch's Since Horovod for TF 1.x is already implemented, it might be easier to port it to Horovod + PyTorch. However, the principles behind |
Thanks a lot for your replay. Happy new year. I tried torch.nn.parallel.DistributedDataParallel but I failed to work with deepxde. However, I solved it with fourier-deeponet-fwi. Thanks a lot! |
Thanks. I'll try Horovod later |
你好,请问可以分享一下在pytorch下,怎么用多卡训练deepnet吗? |
我使用了两种办法,一种是根据deepxde文档(https://deepxde.readthedocs.io/en/latest/user/parallel.html),在anaconda环境中安装Horovod,然后使用$ horovodrun -np 2 -H localhost:2 python script.py进行训练。第二种是不使用deepxde,使用torch.nn.parallel.DistributedDataParallel进行训练(网上有很多教程),可以根据deepxde的训练部分的代码修改也可以自己写 |
谢谢您的回复。我想问一下,用DDP的话,pytorch会要求是用分布式的sampler,但是deepxde没有用到torch库中的dataset类,请问这个问题你有碰到吗?想请教一下您解决办法。 |
我记得用的是torch.utils.data.DistributedSampler 来实现数据分片,自己重写继承torch.utils.data.Dataset的类来加载数据。模型加载数据部分也进行了修改,deepxde是使用元组读取,我改成了多输入。 |
谢谢您给我解答疑惑,我想也是这样,相比较于tf,pytorch的实现要复杂得多。 |
很抱歉我对tf并不熟悉,但是很高兴能帮到你 |
Dear Lu Lu:
I have a problem. I am working on a school server which I fail to install horovod. Horever my model and data are too large to fit on a single GPU, and I don't know how the deepxde works to load data to GPU. Using torch.nn.parallel.DistributedDataParallel(model) would cause the problem: "unexpected keyword argument 'lr'" and other problems. I would like to know is there any other solution to use multiple GPUs without horvod. Thanks!
The text was updated successfully, but these errors were encountered: