Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

If a disk quota check errors out, we shouldn't fail the submission #539

Closed
shreyb opened this issue Feb 26, 2024 · 7 comments · Fixed by #546
Closed

If a disk quota check errors out, we shouldn't fail the submission #539

shreyb opened this issue Feb 26, 2024 · 7 comments · Fixed by #546
Milestone

Comments

@shreyb
Copy link
Collaborator

shreyb commented Feb 26, 2024

In 1.6, we added a disk space check to jobsub_lite. Right now, if that check fails (that is, there's any error in CHECKING the quota), jobsub_lite errors out (see ServiceNow incident INC000001168697). This was the traceback from that incident:

Traceback (most recent call last):
  File "/opt/jobsub_lite/bin/jobsub_submit", line 224, in <module>
    main()
  File "/opt/jobsub_lite/lib/tracing.py", line 172, in wrapper
    res = func(*args, **kwargs)
  File "/opt/jobsub_lite/bin/jobsub_submit", line 189, in main
    set_extras_n_fix_units(varg, schedd_name, cred_set)
  File "/opt/jobsub_lite/lib/utils.py", line 261, in set_extras_n_fix_units
    if not check_space(args["submitdir"], need_blocks):
  File "/opt/jobsub_lite/lib/utils.py", line 512, in check_space
    fs, d_ok = check_space_df(path, min_kblocks, min_files, verbose)
  File "/opt/jobsub_lite/lib/utils.py", line 544, in check_space_df
    availcol = headers.index("Available")
ValueError: 'Available' is not in list

I think this is a bug in the behavior, as the quota check could fail due to a number of reasons (like the user submitting it from the wrong dir) that might not actually fail in the jobs getting submitted.

I propose that if the running of the quota check errors out in this way (NOT if the check itself runs properly and returns that there is insufficient space), we should catch that error and print a warning that there was an error running the check, but that we'll try to submit anyway.

@shreyb shreyb added this to the 1.7 milestone Feb 26, 2024
@shreyb shreyb changed the title If a disk quota check fails, we shouldn't fail the submission If a disk quota check errors out, we shouldn't fail the submission Feb 26, 2024
@shreyb
Copy link
Collaborator Author

shreyb commented Feb 26, 2024

@marcmengel What do you think about this idea? If you agree to it, I don't think we need to bring it to the whole working group, since IMO it's really a bug and not a new feature request.

@marcmengel
Copy link
Contributor

This is clearly a bug, or possibly 2 or 3 convolved :-(. If I'm parsing the output, I should be setting
LC_ALL=C or some such and I'm not, second, even so there could be some reason we don't
get any output from the command, and we should just admit we can't check here and ignore it...

@marcmengel marcmengel mentioned this issue Feb 28, 2024
@marcmengel marcmengel linked a pull request Feb 28, 2024 that will close this issue
@retzkek
Copy link
Contributor

retzkek commented Feb 28, 2024

So I want to back up here. Why was the decision made to solve #506 by checking disk space and quota instead of improving the error handling of os errors? We should be looking to simplify whenever possible, not add more complexity. It’s supposed to be “lite” right?

@marcmengel
Copy link
Contributor

Checking disk space first is better, because filling up the disk and then noticing you've
done so also breaks other things -- it's a shared resource. Of course, we should do both -- handle the os errors and check first...

@retzkek
Copy link
Contributor

retzkek commented Feb 28, 2024

Filesystems have quotas to protect shared resources, and the error handling should be able to unwind and clean up, leaving the system in the same state it was before. That's where I'd like to see effort spent, instead of piling on complexity to try to foresee every thing, where the complexity itself causes more issues.

@shreyb
Copy link
Collaborator Author

shreyb commented Feb 28, 2024

So @retzkek you'd rather us catch the OSError when trying to write files here, for example:

with open(rendered_file, "w", encoding="UTF-8") as of:

And print out a helpful error message rather do any checks there at all? Am I understanding correctly?

@retzkek
Copy link
Contributor

retzkek commented Feb 29, 2024

Yes, and also go back through and clean up the files that were created before the error occurred (maybe that already happens higher up the stack?).

And I'm not arguing against any checks, only against complex checks that substantially increase the scope, surface area, or potential side-effects of what the code is supposed to be doing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants