-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sync reopen_MOM_file existence inquiry #328
Conversation
Codecov Report
@@ Coverage Diff @@
## dev/gfdl #328 +/- ##
============================================
- Coverage 37.16% 37.16% -0.01%
============================================
Files 265 265
Lines 74505 74508 +3
Branches 13837 13839 +2
============================================
+ Hits 27690 27691 +1
- Misses 41720 41723 +3
+ Partials 5095 5094 -1
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@marshallward, I see how this new sync_PEs()
call is helpful or necessary if all of the PEs are calling reopen_MOM_file
, but wouldn't this create an unbalanced barrier (and cause the model to hang) if only a single PE were doing the writing and hence calling reopen_MOM_file
? I would anticipate that only having one PE call this routine would be perfectly sensible, if for example, only globally integrated data were being output. Perhaps this sync_PEs()
call is only necessary if thread /= SINGLE_FILE
?
Perhaps we should change the logic in the routines that call reopen_MOM_file so that only the root PE (or whichever PE actually does the writing to a single file) is making the calls to open files, etc. This would avoid the need for an extra barrier, and it could be more efficient by reducing resource contention.
Alternately, would it make sense to add an argument to reopen_MOM_file()
and create_MOM_file()
to indicate whether this is a single-PE call, and skip the sync if it is?
AFAIK, these functions ( When there is only a single file written by a single PE, such as We could make these functions more robust so that they could reside within blocks called by a single PE or an arbitrary subset of PEs (which is what I believe you are suggesting), but in that case I think we'd have to abandon the whole "open if present, create if absent" feature of this function, which yields some of the problems that you're raising. One alternative is to take the FMS approach and make the underlying netCDF layer more fault-tolerant to those scenarios. But that won't change the fact that the value of We could, as you suggest, use a flag to control this, but I'd think it just complicates and masks over the underlying problem. Regarding your question:
I think if this were the case, then it would have been detected by the regression suite, right? (Perhaps also worth noting that there is literally only one instance of this function.) |
I have tested whether the code would work if all of the (single file) netcdf IO calls from MOM_sum_output.F90 (the only place where Usually we seek the most parsimonious solution to a problem. In this case I think that which is most parsimonious is open for debate. Do we go for the fewest added or modified lines of code (the proposed change in this PR wins the code-golf prize), or do we try to minimize the number of stooges we are trying to send through a narrow door? |
Are you suggesting that it be used like this?
I suppose that would work, but it doesn't make In my opinion, the real problem is the attempt to do an |
d087110
to
bb43ac7
Compare
I edited the function to do something closer to your suggestion. It can now be called by either the root PE or a larger PE set containing the root. It now explicitly does not support threaded IO. I would suggest we leave it for now, and come back to this problem if the need arises. |
reopen_MOM_file includes an inquire test to determine whether we are attempting to reopen an existing file. If missing, it will attempt to create the missing file. The switch from FMS to netCDF I/O exposed a race condition here, where one rank may create the file, and another delayed rank may incorrectly identify this new file as already existing. This resulted in a segmentation fault. Unsure why this was not detected before; it could be that FMS was more resilient to the possibility of missing files. Regardless, the `exists` value was volatile and could lead to potential error. This patch introduces a temporary fix to the issue by checking the root PE and threading value. When threading is single-file, only the root PE participates in the existence test and file creation. This accounts for the case where either the root PE or any larger subset containing the root PE calls the function. It does not account for the more exotic case where a non-root PE many wish to create a file. If threading is set to MULTIPLE (i.e. IO domains) then an error is raised, since there's currently no safe way to implement an equivalent `inquire()` test. We can revisit this function when stronger controls around threaded IO are introduced. But for now, I believe that this is the best that we can do.
bb43ac7
to
e375aa7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that the revised code safely covers all of the possible use cases and that is should neatly address the issue that prompted these changed. I am approving this, pending a successful pipeline test result once this rises to the top of the PR queue.
This passed testing (done manually) |
Revised write_energy so that only the root_PE attempts to open, reopen or write to a netcdf file. Although FMS can handle cases where multiple PEs make the same calls with internal logic, this change avoids requiring such internal (hidden) logical tests, and instead is more explicit on what is actually intended. This change is complementary to MOM6 dev/gfdl PR #328, which adds internal logic to handle the case where all PEs are making a call to reopen a single netcdf file. All answers and output are bitwise identical.
reopen_MOM_file includes an inquire test to determine whether we are attempting to reopen an existing file. If missing, it will attempt to instead create the file.
The switch from FMS to netCDF I/O exposed a race condition here, where one rank may create the file, and a another rank may incorrectly identify the file as existing. This resulted in a segmentation fault.
Unsure why this was not detected before, it could be that FMS was more resilient to the possibility of missing files.
Regardless, it makes sense to use a barrier after the inquire call to ensure that each rank is reporting the correct state. This patch does this with a sync_PEs call.