-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Return device_count=0 in error case #285
Conversation
I think this is a reasonable change and won't break anything, @germasch what do you think? Regarding the implementation, you can use @td-mpcdf do you imagine writing code that would fallback to running on CPU if device count is 0? |
We mainly need it to print a meaningful error message that the user can understand and knows what to do. Currently, if you run a GPU built CUDA code on a CPU partition, you get some error message about insufficient driver version which is not that helpful for a user. With the proposed changes one could write something more meaningful. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs fallbacks and SYCL implementation. I can do the SYCL implementation as a separate PR if you don't have a machine with SYCL, just let me know.
include/gtensor/backend_cuda.h
Outdated
fprintf(stderr, "Did you start the job on a CPU partition?\n"); | ||
device_count = 0; | ||
break; | ||
case cudaSuccess: break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need a fallback case that calls gtGpuCheck(code)
for all other errors.
include/gtensor/backend_hip.h
Outdated
/* Set silently the return value to 0 */ | ||
device_count = 0; | ||
break; | ||
case hipSuccess: break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also needs fallback with gtGpuCheck(code)
.
I do not have a sycl system, so if you could do it, it would be nice. The fallback code has been added now. |
Clang-format wants to put gtGpuCheck on one line, but that breaks the macro :P. I'm trying to figure out the least disruptive solution here. I am thinking maybe changing to an if/elseif/else clause. |
I think part of the problem is that gtensor dies on errors in get device count, which does not give the application a chance to interpret the error and decide whether it should die or do something else. |
work around limitations in gtGpuCheck macro
} else if (code == cudaErrorInsufficientDriver) { | ||
fprintf(stderr, "Error in cudaGetDeviceCount: %d (%s)\n", code, | ||
cudaGetErrorString(code)); | ||
fprintf(stderr, "Did you start the job on a CPU partition?\n"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is kind of an app specific message, but given we have no way of the application defining it's own message, I guess I am ok with it. Really begs the question as to whether we should propagate certain errors up to the caller.
Well, I'd say as usual, if you makes your life too easy at first, it comes back to bite you eventually. I think I was kinda aware that we're not really putting any work into error handling, and actually a lot of time in HPC I think that's just fine (e.g., if you get an error from a kernel execution, it's not likely there's a way you can recover, so you might as well abort right there). But in a case like this, I agree that it's not gtensor which should make the decision what to do, it should be left to the application (GENE). I guess I'm not opposed to the current PR as a bandaid, but it'd be nice to handle this better in the future, which probably means finding a way to return an error code (potentially breaking the API). An alternate solution for the issue at hand might be to have some additional function that one could call first to make sure that CUDA (or whatever) is set up correctly? Actually, how about returning -1 when an error occurs, as opposed to 0 if there's just no actual device. That way, the application could at least be a bit more explicit about what went wrong. A potential problem to be mindful of is that someone might mean to run GENE on the GPU partition but due to some config / build issue, the CUDA versions don't match, and so GENE could end up using, e.g., just 8 CPU cores per node rather than the 8 GPUs it can't access. So I think it's important to still abort (in GENE, which I think it what you're suggesting), rather than just running on CPU, which would very inefficiently eat up computing time... --Kai |
This PR at least allows the application to raise some kind of helpful error to user if gpu count is 0 and it's trying to run in GPU mode. I think we merge this and can refine further in future. It could raise an exception or we could define gtensor error codes that standardize across the backends (hip and cuda are largely similar but SYCL tends to use exceptions). |
Regarding negative numbers as errors, I think that is reasonable, and we could have codes for negative numbers. However 0 for the special case of no device or no driver seems reasonable to me. My vote is to merge as is and refine in future PR. |
For me it seems to be more natural if the get_device_count function returns just 0 in the case of cudaErrorNoDevice and cudaErrorInsufficientDriver. The handling of this case should happen on the applications level, but at the moment the gtensor code aborts.