Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Return device_count=0 in error case #285

Merged
merged 3 commits into from
Dec 8, 2023
Merged

Return device_count=0 in error case #285

merged 3 commits into from
Dec 8, 2023

Conversation

td-mpcdf
Copy link
Contributor

@td-mpcdf td-mpcdf commented Dec 1, 2023

For me it seems to be more natural if the get_device_count function returns just 0 in the case of cudaErrorNoDevice and cudaErrorInsufficientDriver. The handling of this case should happen on the applications level, but at the moment the gtensor code aborts.

@bd4
Copy link
Contributor

bd4 commented Dec 1, 2023

I think this is a reasonable change and won't break anything, @germasch what do you think?

Regarding the implementation, you can use gtGpuCheck(code) to trigger standard error handling if the code is not success or no device (or anything else you want to special case). This would also need to be implemented for SYCL so behavior is consistent.

@td-mpcdf do you imagine writing code that would fallback to running on CPU if device count is 0?

@td-mpcdf
Copy link
Contributor Author

td-mpcdf commented Dec 1, 2023

We mainly need it to print a meaningful error message that the user can understand and knows what to do. Currently, if you run a GPU built CUDA code on a CPU partition, you get some error message about insufficient driver version which is not that helpful for a user. With the proposed changes one could write something more meaningful.

Copy link
Contributor

@bd4 bd4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs fallbacks and SYCL implementation. I can do the SYCL implementation as a separate PR if you don't have a machine with SYCL, just let me know.

fprintf(stderr, "Did you start the job on a CPU partition?\n");
device_count = 0;
break;
case cudaSuccess: break;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need a fallback case that calls gtGpuCheck(code) for all other errors.

/* Set silently the return value to 0 */
device_count = 0;
break;
case hipSuccess: break;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also needs fallback with gtGpuCheck(code).

@td-mpcdf
Copy link
Contributor Author

td-mpcdf commented Dec 8, 2023

Needs fallbacks and SYCL implementation. I can do the SYCL implementation as a separate PR if you don't have a machine with SYCL, just let me know.

I do not have a sycl system, so if you could do it, it would be nice. The fallback code has been added now.

@bd4
Copy link
Contributor

bd4 commented Dec 8, 2023

Clang-format wants to put gtGpuCheck on one line, but that breaks the macro :P. I'm trying to figure out the least disruptive solution here. I am thinking maybe changing to an if/elseif/else clause.

@bd4
Copy link
Contributor

bd4 commented Dec 8, 2023

I think part of the problem is that gtensor dies on errors in get device count, which does not give the application a chance to interpret the error and decide whether it should die or do something else.

work around limitations in gtGpuCheck macro
} else if (code == cudaErrorInsufficientDriver) {
fprintf(stderr, "Error in cudaGetDeviceCount: %d (%s)\n", code,
cudaGetErrorString(code));
fprintf(stderr, "Did you start the job on a CPU partition?\n");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is kind of an app specific message, but given we have no way of the application defining it's own message, I guess I am ok with it. Really begs the question as to whether we should propagate certain errors up to the caller.

@germasch
Copy link
Contributor

germasch commented Dec 8, 2023

Well, I'd say as usual, if you makes your life too easy at first, it comes back to bite you eventually.

I think I was kinda aware that we're not really putting any work into error handling, and actually a lot of time in HPC I think that's just fine (e.g., if you get an error from a kernel execution, it's not likely there's a way you can recover, so you might as well abort right there).

But in a case like this, I agree that it's not gtensor which should make the decision what to do, it should be left to the application (GENE). I guess I'm not opposed to the current PR as a bandaid, but it'd be nice to handle this better in the future, which probably means finding a way to return an error code (potentially breaking the API). An alternate solution for the issue at hand might be to have some additional function that one could call first to make sure that CUDA (or whatever) is set up correctly? Actually, how about returning -1 when an error occurs, as opposed to 0 if there's just no actual device. That way, the application could at least be a bit more explicit about what went wrong.

A potential problem to be mindful of is that someone might mean to run GENE on the GPU partition but due to some config / build issue, the CUDA versions don't match, and so GENE could end up using, e.g., just 8 CPU cores per node rather than the 8 GPUs it can't access. So I think it's important to still abort (in GENE, which I think it what you're suggesting), rather than just running on CPU, which would very inefficiently eat up computing time...

--Kai

@bd4
Copy link
Contributor

bd4 commented Dec 8, 2023

This PR at least allows the application to raise some kind of helpful error to user if gpu count is 0 and it's trying to run in GPU mode. I think we merge this and can refine further in future. It could raise an exception or we could define gtensor error codes that standardize across the backends (hip and cuda are largely similar but SYCL tends to use exceptions).

@bd4
Copy link
Contributor

bd4 commented Dec 8, 2023

Regarding negative numbers as errors, I think that is reasonable, and we could have codes for negative numbers. However 0 for the special case of no device or no driver seems reasonable to me. My vote is to merge as is and refine in future PR.

@bd4 bd4 merged commit 473e685 into wdmapp:main Dec 8, 2023
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants