-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
reduce_op breaks on empty local tensor for some reduce operations #369
Comments
@ClaudiaComito Do you have any idea how an elegant solution might look like? |
@TheSlimvReal thanks and sorry I missed this message. I'll look into this. |
Hi again @TheSlimvReal. I could reproduce the error although not 100% (my system simply gets stuck, it doesn't throw an exception, possibly I just haven't waited long enough). After researching a bit, I'm not sure that we should do anything at all, apart from maybe raising a warning when chunking and an exception when calling reduce_op. At that point we already know that some nodes have no data and will run into trouble. I haven't been able to find a way to exclude nodes from MPI collective operations. I've been playing around with the option of using comm.Exscan instead of Allreduce. But that would imply:
That would be the result of operation MPI.OP applied to all ranks from 0 to k, which is what we want. We could probably do this, but I'm not sure we should. Does anybody among the MPI experts have a good solution? @d1saster @coquelin77 @Cdebus ? |
yeah ive noticed some buggy operations where there is no data on a process. However, it is my belief that this is something that, while not good, it okay. Since we are targeting large datasets for our analyses, it is my belief that we can disregard these errors on a global scale and instead solve these problems within the functions which might create empty tensors. |
Hi @coquelin77, I agree, although in practice the only way of dealing with it would be for the factories to throw an exception when the distribution leaves empty nodes. Or can anybody think of another way? I'm not worried about anybody wanting to run calculations on a 3x3 tensor on 4 nodes. I worry more about the cases when you're running your job on n nodes, and the tensor size after convolution, maxpooling and whatnot, ends up being n-1 along the split axis. But I'm not sure how frequent this is going to be. |
Is this resolved by the introduction of the neutral element? |
Yes I think so! Thanks |
Description
Some of the torch reduce operations behave differently when passing in an empty tensor. For example
torch.sum
returns 0 buttorch.max
ortorch.min
throw an exception when an empty tensor is given as argument.To Reproduce
Steps to reproduce the behavior:
Run with 4 processes:
All methods using the 'reduce_op' function
Some of the internally used torch functions do not work on empty local tensors.
RuntimeError: invalid argument 1: cannot perform reduction function max on tensor with no elements because the operation does not have an identity at /pytorch/aten/src/TH/generic/THTensorEvenMoreMath.cpp:189
Expected behavior
The local functions should return a neutral value that does not affect the result (like the 0 for sum). The problem is to define this neutral argument for some other functions and I can not think of an general way to define this. Fact is that the mpi allreduce breaks when not every process provides values.
The text was updated successfully, but these errors were encountered: