Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to use multiple GPUs without horovod #1906

Open
flashal1 opened this issue Dec 5, 2024 · 10 comments
Open

how to use multiple GPUs without horovod #1906

flashal1 opened this issue Dec 5, 2024 · 10 comments

Comments

@flashal1
Copy link

flashal1 commented Dec 5, 2024

Dear Lu Lu:
I have a problem. I am working on a school server which I fail to install horovod. Horever my model and data are too large to fit on a single GPU, and I don't know how the deepxde works to load data to GPU. Using torch.nn.parallel.DistributedDataParallel(model) would cause the problem: "unexpected keyword argument 'lr'" and other problems. I would like to know is there any other solution to use multiple GPUs without horvod. Thanks!

@lululxvi
Copy link
Owner

torch.nn.parallel.DistributedDataParallel should also work maybe with some code modification, but I am not familiar with it.

@pescap
Copy link
Contributor

pescap commented Jan 22, 2025

Hi, data-parallel acceleration is currently supported only with Horovod + TensorFlow 1.x and random collocation points sampling. "Horovod also supports PyTorch, TensorFlow 2.x, paving the way for multiple backend acceleration'' (link).

You could either implement Horovod for the PyTorch backend or directly use PyTorch's DistributedDataParallel, depending on your preferences.

Since Horovod for TF 1.x is already implemented, it might be easier to port it to Horovod + PyTorch. However, the principles behind DistributedDataParallel seem very similar to those of Horovod.

@flashal1
Copy link
Author

flashal1 commented Feb 1, 2025

torch.nn.parallel.DistributedDataParallel should also work maybe with some code modification, but I am not familiar with it.

Thanks a lot for your replay. Happy new year. I tried torch.nn.parallel.DistributedDataParallel but I failed to work with deepxde. However, I solved it with fourier-deeponet-fwi. Thanks a lot!

@flashal1
Copy link
Author

flashal1 commented Feb 1, 2025

Hi, data-parallel acceleration is currently supported only with Horovod + TensorFlow 1.x and random collocation points sampling. "Horovod also supports PyTorch, TensorFlow 2.x, paving the way for multiple backend acceleration'' (link).

You could either implement Horovod for the PyTorch backend or directly use PyTorch's DistributedDataParallel, depending on your preferences.

Since Horovod for TF 1.x is already implemented, it might be easier to port it to Horovod + PyTorch. However, the principles behind DistributedDataParallel seem very similar to those of Horovod.

Thanks. I'll try Horovod later

@Chic-J
Copy link

Chic-J commented Feb 11, 2025

Hi, data-parallel acceleration is currently supported only with Horovod + TensorFlow 1.x and random collocation points sampling. "Horovod also supports PyTorch, TensorFlow 2.x, paving the way for multiple backend acceleration'' (link).
You could either implement Horovod for the PyTorch backend or directly use PyTorch's DistributedDataParallel, depending on your preferences.
Since Horovod for TF 1.x is already implemented, it might be easier to port it to Horovod + PyTorch. However, the principles behind DistributedDataParallel seem very similar to those of Horovod.

Thanks. I'll try Horovod later

你好,请问可以分享一下在pytorch下,怎么用多卡训练deepnet吗?

@flashal1
Copy link
Author

Hi, data-parallel acceleration is currently supported only with Horovod + TensorFlow 1.x and random collocation points sampling. "Horovod also supports PyTorch, TensorFlow 2.x, paving the way for multiple backend acceleration'' (link).
You could either implement Horovod for the PyTorch backend or directly use PyTorch's DistributedDataParallel, depending on your preferences.
Since Horovod for TF 1.x is already implemented, it might be easier to port it to Horovod + PyTorch. However, the principles behind DistributedDataParallel seem very similar to those of Horovod.

Thanks. I'll try Horovod later

你好,请问可以分享一下在pytorch下,怎么用多卡训练deepnet吗?

我使用了两种办法,一种是根据deepxde文档(https://deepxde.readthedocs.io/en/latest/user/parallel.html),在anaconda环境中安装Horovod,然后使用$ horovodrun -np 2 -H localhost:2 python script.py进行训练。第二种是不使用deepxde,使用torch.nn.parallel.DistributedDataParallel进行训练(网上有很多教程),可以根据deepxde的训练部分的代码修改也可以自己写

@Chic-J
Copy link

Chic-J commented Feb 13, 2025

Hi, data-parallel acceleration is currently supported only with Horovod + TensorFlow 1.x and random collocation points sampling. "Horovod also supports PyTorch, TensorFlow 2.x, paving the way for multiple backend acceleration'' (link).
You could either implement Horovod for the PyTorch backend or directly use PyTorch's DistributedDataParallel, depending on your preferences.
Since Horovod for TF 1.x is already implemented, it might be easier to port it to Horovod + PyTorch. However, the principles behind DistributedDataParallel seem very similar to those of Horovod.

Thanks. I'll try Horovod later

你好,请问可以分享一下在pytorch下,怎么用多卡训练deepnet吗?

我使用了两种办法,一种是根据deepxde文档(https://deepxde.readthedocs.io/en/latest/user/parallel.html),在anaconda环境中安装Horovod,然后使用$ horovodrun -np 2 -H localhost:2 python script.py进行训练。第二种是不使用deepxde,使用torch.nn.parallel.DistributedDataParallel进行训练(网上有很多教程),可以根据deepxde的训练部分的代码修改也可以自己写

谢谢您的回复。我想问一下,用DDP的话,pytorch会要求是用分布式的sampler,但是deepxde没有用到torch库中的dataset类,请问这个问题你有碰到吗?想请教一下您解决办法。

@flashal1
Copy link
Author

Hi, data-parallel acceleration is currently supported only with Horovod + TensorFlow 1.x and random collocation points sampling. "Horovod also supports PyTorch, TensorFlow 2.x, paving the way for multiple backend acceleration'' (link).
You could either implement Horovod for the PyTorch backend or directly use PyTorch's DistributedDataParallel, depending on your preferences.
Since Horovod for TF 1.x is already implemented, it might be easier to port it to Horovod + PyTorch. However, the principles behind DistributedDataParallel seem very similar to those of Horovod.

Thanks. I'll try Horovod later

你好,请问可以分享一下在pytorch下,怎么用多卡训练deepnet吗?

我使用了两种办法,一种是根据deepxde文档(https://deepxde.readthedocs.io/en/latest/user/parallel.html),在anaconda环境中安装Horovod,然后使用$ horovodrun -np 2 -H localhost:2 python script.py进行训练。第二种是不使用deepxde,使用torch.nn.parallel.DistributedDataParallel进行训练(网上有很多教程),可以根据deepxde的训练部分的代码修改也可以自己写

谢谢您的回复。我想问一下,用DDP的话,pytorch会要求是用分布式的sampler,但是deepxde没有用到torch库中的dataset类,请问这个问题你有碰到吗?想请教一下您解决办法。

我记得用的是torch.utils.data.DistributedSampler 来实现数据分片,自己重写继承torch.utils.data.Dataset的类来加载数据。模型加载数据部分也进行了修改,deepxde是使用元组读取,我改成了多输入。

@Chic-J
Copy link

Chic-J commented Feb 13, 2025

Hi, data-parallel acceleration is currently supported only with Horovod + TensorFlow 1.x and random collocation points sampling. "Horovod also supports PyTorch, TensorFlow 2.x, paving the way for multiple backend acceleration'' (link).
You could either implement Horovod for the PyTorch backend or directly use PyTorch's DistributedDataParallel, depending on your preferences.
Since Horovod for TF 1.x is already implemented, it might be easier to port it to Horovod + PyTorch. However, the principles behind DistributedDataParallel seem very similar to those of Horovod.

Thanks. I'll try Horovod later

你好,请问可以分享一下在pytorch下,怎么用多卡训练deepnet吗?

我使用了两种办法,一种是根据deepxde文档(https://deepxde.readthedocs.io/en/latest/user/parallel.html),在anaconda环境中安装Horovod,然后使用$ horovodrun -np 2 -H localhost:2 python script.py进行训练。第二种是不使用deepxde,使用torch.nn.parallel.DistributedDataParallel进行训练(网上有很多教程),可以根据deepxde的训练部分的代码修改也可以自己写

谢谢您的回复。我想问一下,用DDP的话,pytorch会要求是用分布式的sampler,但是deepxde没有用到torch库中的dataset类,请问这个问题你有碰到吗?想请教一下您解决办法。

我记得用的是torch.utils.data.DistributedSampler 来实现数据分片,自己重写继承torch.utils.data.Dataset的类来加载数据。模型加载数据部分也进行了修改,deepxde是使用元组读取,我改成了多输入。

谢谢您给我解答疑惑,我想也是这样,相比较于tf,pytorch的实现要复杂得多。

@flashal1
Copy link
Author

Hi, data-parallel acceleration is currently supported only with Horovod + TensorFlow 1.x and random collocation points sampling. "Horovod also supports PyTorch, TensorFlow 2.x, paving the way for multiple backend acceleration'' (link).
You could either implement Horovod for the PyTorch backend or directly use PyTorch's DistributedDataParallel, depending on your preferences.
Since Horovod for TF 1.x is already implemented, it might be easier to port it to Horovod + PyTorch. However, the principles behind DistributedDataParallel seem very similar to those of Horovod.

Thanks. I'll try Horovod later

你好,请问可以分享一下在pytorch下,怎么用多卡训练deepnet吗?

我使用了两种办法,一种是根据deepxde文档(https://deepxde.readthedocs.io/en/latest/user/parallel.html),在anaconda环境中安装Horovod,然后使用$ horovodrun -np 2 -H localhost:2 python script.py进行训练。第二种是不使用deepxde,使用torch.nn.parallel.DistributedDataParallel进行训练(网上有很多教程),可以根据deepxde的训练部分的代码修改也可以自己写

谢谢您的回复。我想问一下,用DDP的话,pytorch会要求是用分布式的sampler,但是deepxde没有用到torch库中的dataset类,请问这个问题你有碰到吗?想请教一下您解决办法。

我记得用的是torch.utils.data.DistributedSampler 来实现数据分片,自己重写继承torch.utils.data.Dataset的类来加载数据。模型加载数据部分也进行了修改,deepxde是使用元组读取,我改成了多输入。

谢谢您给我解答疑惑,我想也是这样,相比较于tf,pytorch的实现要复杂得多。

很抱歉我对tf并不熟悉,但是很高兴能帮到你

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants