how to use multiple GPUs without horovod #1906

flashal1 · 2024-12-05T03:19:47Z

Dear Lu Lu:
I have a problem. I am working on a school server which I fail to install horovod. Horever my model and data are too large to fit on a single GPU, and I don't know how the deepxde works to load data to GPU. Using torch.nn.parallel.DistributedDataParallel(model) would cause the problem: "unexpected keyword argument 'lr'" and other problems. I would like to know is there any other solution to use multiple GPUs without horvod. Thanks!

lululxvi · 2024-12-31T20:24:59Z

torch.nn.parallel.DistributedDataParallel should also work maybe with some code modification, but I am not familiar with it.

pescap · 2025-01-22T16:25:05Z

Hi, data-parallel acceleration is currently supported only with Horovod + TensorFlow 1.x and random collocation points sampling. "Horovod also supports PyTorch, TensorFlow 2.x, paving the way for multiple backend acceleration'' (link).

You could either implement Horovod for the PyTorch backend or directly use PyTorch's DistributedDataParallel, depending on your preferences.

Since Horovod for TF 1.x is already implemented, it might be easier to port it to Horovod + PyTorch. However, the principles behind DistributedDataParallel seem very similar to those of Horovod.

flashal1 · 2025-02-01T07:23:54Z

torch.nn.parallel.DistributedDataParallel should also work maybe with some code modification, but I am not familiar with it.

Thanks a lot for your replay. Happy new year. I tried torch.nn.parallel.DistributedDataParallel but I failed to work with deepxde. However, I solved it with fourier-deeponet-fwi. Thanks a lot!

flashal1 · 2025-02-01T07:25:29Z

Hi, data-parallel acceleration is currently supported only with Horovod + TensorFlow 1.x and random collocation points sampling. "Horovod also supports PyTorch, TensorFlow 2.x, paving the way for multiple backend acceleration'' (link).

You could either implement Horovod for the PyTorch backend or directly use PyTorch's DistributedDataParallel, depending on your preferences.

Since Horovod for TF 1.x is already implemented, it might be easier to port it to Horovod + PyTorch. However, the principles behind DistributedDataParallel seem very similar to those of Horovod.

Thanks. I'll try Horovod later

Chic-J · 2025-02-11T09:01:45Z

Hi, data-parallel acceleration is currently supported only with Horovod + TensorFlow 1.x and random collocation points sampling. "Horovod also supports PyTorch, TensorFlow 2.x, paving the way for multiple backend acceleration'' (link).
You could either implement Horovod for the PyTorch backend or directly use PyTorch's DistributedDataParallel, depending on your preferences.
Since Horovod for TF 1.x is already implemented, it might be easier to port it to Horovod + PyTorch. However, the principles behind DistributedDataParallel seem very similar to those of Horovod.

Thanks. I'll try Horovod later

你好，请问可以分享一下在pytorch下，怎么用多卡训练deepnet吗？

flashal1 · 2025-02-13T02:04:23Z

Hi, data-parallel acceleration is currently supported only with Horovod + TensorFlow 1.x and random collocation points sampling. "Horovod also supports PyTorch, TensorFlow 2.x, paving the way for multiple backend acceleration'' (link).
You could either implement Horovod for the PyTorch backend or directly use PyTorch's DistributedDataParallel, depending on your preferences.
Since Horovod for TF 1.x is already implemented, it might be easier to port it to Horovod + PyTorch. However, the principles behind DistributedDataParallel seem very similar to those of Horovod.

Thanks. I'll try Horovod later

你好，请问可以分享一下在pytorch下，怎么用多卡训练deepnet吗？

我使用了两种办法，一种是根据deepxde文档（https://deepxde.readthedocs.io/en/latest/user/parallel.html），在anaconda环境中安装Horovod，然后使用$ horovodrun -np 2 -H localhost:2 python script.py进行训练。第二种是不使用deepxde，使用torch.nn.parallel.DistributedDataParallel进行训练（网上有很多教程），可以根据deepxde的训练部分的代码修改也可以自己写

Chic-J · 2025-02-13T02:07:34Z

Hi, data-parallel acceleration is currently supported only with Horovod + TensorFlow 1.x and random collocation points sampling. "Horovod also supports PyTorch, TensorFlow 2.x, paving the way for multiple backend acceleration'' (link).
You could either implement Horovod for the PyTorch backend or directly use PyTorch's DistributedDataParallel, depending on your preferences.
Since Horovod for TF 1.x is already implemented, it might be easier to port it to Horovod + PyTorch. However, the principles behind DistributedDataParallel seem very similar to those of Horovod.

Thanks. I'll try Horovod later

你好，请问可以分享一下在pytorch下，怎么用多卡训练deepnet吗？

我使用了两种办法，一种是根据deepxde文档（https://deepxde.readthedocs.io/en/latest/user/parallel.html），在anaconda环境中安装Horovod，然后使用$ horovodrun -np 2 -H localhost:2 python script.py进行训练。第二种是不使用deepxde，使用torch.nn.parallel.DistributedDataParallel进行训练（网上有很多教程），可以根据deepxde的训练部分的代码修改也可以自己写

谢谢您的回复。我想问一下，用DDP的话，pytorch会要求是用分布式的sampler，但是deepxde没有用到torch库中的dataset类，请问这个问题你有碰到吗？想请教一下您解决办法。

flashal1 · 2025-02-13T02:15:09Z

Hi, data-parallel acceleration is currently supported only with Horovod + TensorFlow 1.x and random collocation points sampling. "Horovod also supports PyTorch, TensorFlow 2.x, paving the way for multiple backend acceleration'' (link).
You could either implement Horovod for the PyTorch backend or directly use PyTorch's DistributedDataParallel, depending on your preferences.
Since Horovod for TF 1.x is already implemented, it might be easier to port it to Horovod + PyTorch. However, the principles behind DistributedDataParallel seem very similar to those of Horovod.

Thanks. I'll try Horovod later

你好，请问可以分享一下在pytorch下，怎么用多卡训练deepnet吗？

我使用了两种办法，一种是根据deepxde文档（https://deepxde.readthedocs.io/en/latest/user/parallel.html），在anaconda环境中安装Horovod，然后使用$ horovodrun -np 2 -H localhost:2 python script.py进行训练。第二种是不使用deepxde，使用torch.nn.parallel.DistributedDataParallel进行训练（网上有很多教程），可以根据deepxde的训练部分的代码修改也可以自己写

谢谢您的回复。我想问一下，用DDP的话，pytorch会要求是用分布式的sampler，但是deepxde没有用到torch库中的dataset类，请问这个问题你有碰到吗？想请教一下您解决办法。

我记得用的是torch.utils.data.DistributedSampler 来实现数据分片，自己重写继承torch.utils.data.Dataset的类来加载数据。模型加载数据部分也进行了修改，deepxde是使用元组读取，我改成了多输入。

Chic-J · 2025-02-13T02:18:07Z

Hi, data-parallel acceleration is currently supported only with Horovod + TensorFlow 1.x and random collocation points sampling. "Horovod also supports PyTorch, TensorFlow 2.x, paving the way for multiple backend acceleration'' (link).
You could either implement Horovod for the PyTorch backend or directly use PyTorch's DistributedDataParallel, depending on your preferences.
Since Horovod for TF 1.x is already implemented, it might be easier to port it to Horovod + PyTorch. However, the principles behind DistributedDataParallel seem very similar to those of Horovod.

Thanks. I'll try Horovod later

你好，请问可以分享一下在pytorch下，怎么用多卡训练deepnet吗？

我使用了两种办法，一种是根据deepxde文档（https://deepxde.readthedocs.io/en/latest/user/parallel.html），在anaconda环境中安装Horovod，然后使用$ horovodrun -np 2 -H localhost:2 python script.py进行训练。第二种是不使用deepxde，使用torch.nn.parallel.DistributedDataParallel进行训练（网上有很多教程），可以根据deepxde的训练部分的代码修改也可以自己写

谢谢您的回复。我想问一下，用DDP的话，pytorch会要求是用分布式的sampler，但是deepxde没有用到torch库中的dataset类，请问这个问题你有碰到吗？想请教一下您解决办法。

我记得用的是torch.utils.data.DistributedSampler 来实现数据分片，自己重写继承torch.utils.data.Dataset的类来加载数据。模型加载数据部分也进行了修改，deepxde是使用元组读取，我改成了多输入。

谢谢您给我解答疑惑，我想也是这样，相比较于tf，pytorch的实现要复杂得多。

flashal1 · 2025-02-13T02:20:01Z

Hi, data-parallel acceleration is currently supported only with Horovod + TensorFlow 1.x and random collocation points sampling. "Horovod also supports PyTorch, TensorFlow 2.x, paving the way for multiple backend acceleration'' (link).
You could either implement Horovod for the PyTorch backend or directly use PyTorch's DistributedDataParallel, depending on your preferences.
Since Horovod for TF 1.x is already implemented, it might be easier to port it to Horovod + PyTorch. However, the principles behind DistributedDataParallel seem very similar to those of Horovod.

Thanks. I'll try Horovod later

你好，请问可以分享一下在pytorch下，怎么用多卡训练deepnet吗？

我使用了两种办法，一种是根据deepxde文档（https://deepxde.readthedocs.io/en/latest/user/parallel.html），在anaconda环境中安装Horovod，然后使用$ horovodrun -np 2 -H localhost:2 python script.py进行训练。第二种是不使用deepxde，使用torch.nn.parallel.DistributedDataParallel进行训练（网上有很多教程），可以根据deepxde的训练部分的代码修改也可以自己写

谢谢您的回复。我想问一下，用DDP的话，pytorch会要求是用分布式的sampler，但是deepxde没有用到torch库中的dataset类，请问这个问题你有碰到吗？想请教一下您解决办法。

我记得用的是torch.utils.data.DistributedSampler 来实现数据分片，自己重写继承torch.utils.data.Dataset的类来加载数据。模型加载数据部分也进行了修改，deepxde是使用元组读取，我改成了多输入。

谢谢您给我解答疑惑，我想也是这样，相比较于tf，pytorch的实现要复杂得多。

很抱歉我对tf并不熟悉，但是很高兴能帮到你

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

how to use multiple GPUs without horovod #1906

how to use multiple GPUs without horovod #1906

flashal1 commented Dec 5, 2024

lululxvi commented Dec 31, 2024

pescap commented Jan 22, 2025 •

edited

Loading

flashal1 commented Feb 1, 2025

flashal1 commented Feb 1, 2025

Chic-J commented Feb 11, 2025

flashal1 commented Feb 13, 2025

Chic-J commented Feb 13, 2025

flashal1 commented Feb 13, 2025

Chic-J commented Feb 13, 2025

flashal1 commented Feb 13, 2025

how to use multiple GPUs without horovod #1906

how to use multiple GPUs without horovod #1906

Comments

flashal1 commented Dec 5, 2024

lululxvi commented Dec 31, 2024

pescap commented Jan 22, 2025 • edited Loading

flashal1 commented Feb 1, 2025

flashal1 commented Feb 1, 2025

Chic-J commented Feb 11, 2025

flashal1 commented Feb 13, 2025

Chic-J commented Feb 13, 2025

flashal1 commented Feb 13, 2025

Chic-J commented Feb 13, 2025

flashal1 commented Feb 13, 2025

pescap commented Jan 22, 2025 •

edited

Loading