-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
If a disk quota check errors out, we shouldn't fail the submission #539
Comments
@marcmengel What do you think about this idea? If you agree to it, I don't think we need to bring it to the whole working group, since IMO it's really a bug and not a new feature request. |
This is clearly a bug, or possibly 2 or 3 convolved :-(. If I'm parsing the output, I should be setting |
So I want to back up here. Why was the decision made to solve #506 by checking disk space and quota instead of improving the error handling of os errors? We should be looking to simplify whenever possible, not add more complexity. It’s supposed to be “lite” right? |
Checking disk space first is better, because filling up the disk and then noticing you've |
Filesystems have quotas to protect shared resources, and the error handling should be able to unwind and clean up, leaving the system in the same state it was before. That's where I'd like to see effort spent, instead of piling on complexity to try to foresee every thing, where the complexity itself causes more issues. |
So @retzkek you'd rather us catch the jobsub_lite/lib/render_files.py Line 77 in fcdb272
And print out a helpful error message rather do any checks there at all? Am I understanding correctly? |
Yes, and also go back through and clean up the files that were created before the error occurred (maybe that already happens higher up the stack?). And I'm not arguing against any checks, only against complex checks that substantially increase the scope, surface area, or potential side-effects of what the code is supposed to be doing. |
In 1.6, we added a disk space check to jobsub_lite. Right now, if that check fails (that is, there's any error in CHECKING the quota), jobsub_lite errors out (see ServiceNow incident INC000001168697). This was the traceback from that incident:
I think this is a bug in the behavior, as the quota check could fail due to a number of reasons (like the user submitting it from the wrong dir) that might not actually fail in the jobs getting submitted.
I propose that if the running of the quota check errors out in this way (NOT if the check itself runs properly and returns that there is insufficient space), we should catch that error and print a warning that there was an error running the check, but that we'll try to submit anyway.
The text was updated successfully, but these errors were encountered: