-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Oblique OBCs have problematic restarts in some cases #208
Comments
Yes, that seems correct to me. I might have even known it when we implemented it, just accepting it as being better than not saving any of that stuff at all. |
Northwest Atlantic (NWA) experiment setups of @andrew-c-ross use the oblique OBCs and they have restart issues that we have not been able to fix. Can I replace OBLIQUE with something else to just test if the restart issues are related to this?
|
@nikizadehgfdl Yes, try ORLANSKI instead. |
Added separate restart variables for some of the oblique OBC restart variables at north-south or east-west faces. Before this fix, some of the full 3-d arrays for restarts were being overwritten for oblique OBC segments that join at adjacent corners on the north-east side of tracer points, as is noted in github.com/NOAA-GFDL/issues/208. The discussion surrounding this issue confirms that there are cases using oblique OBCs (like some North-West Atlantic cases) that do not past the restart reproducibility tests. Some halo updates were also corrected, in accordance with the introduction of these new variables and their proper staggering locations. In a number of places, the proper case-sensitive horizontal indexing convention, as described in github.com/NOAA-GFDL/MOM6/wiki/Code-style-guide#soft-case-convention, is now being used. This commit also corrected the comments describing a number of the variables in the OBC_segment_type to make it clearer where they are discretized. This changes the names of oblique OBC-related variables in the restart files, but since the previous version did not reproduce across restarts, there does not seem to be any point in retaining those incorrect answers. All answers in the MOM6-examples test suite are bitwise identical, but this will change (fix) solutions with oblique OBCs and OBC_RAD_VEL_WT < 1.
I may have a fix for this problem at github.com/Hallberg-NOAA/MOM6/tree/oblique_OBC_restart_fix, but because there were not any cases that exhibit this bug in the MOM6-examples test suite, I would appreciate it if someone else could test this code fix against a case that does exhibit the restart issues with oblique OBCs. @nikizadehgfdl, or @andrew-c-ross, or @kshedstrom, would you be willing to test this for me? The restart file contents have changed, so the old restart files will not work. Instead I am just looking for a confirmation that two short runs with oblique OBCs give an equivalent answer to one longer run, with all of these run segments using the new code. |
I can test this in the Bering domain, but will have to rerun it with OBLIQUE instead of ORLANSKI. Do you want confirmation that the unmodified code has a restart problem? The dumbbell tests with OBLIQUE should be fine, having only EW OBCs (or NS if rotated). |
I will test NWA and report back ASAP. |
I'm having some trouble testing this because it also fails to reproduce when I switch to ORLANSKI. With Orlanski, a 2 day and 1day/restart/1day set of experiments fails with:
The same variables fail with OBLIQUE. With OBLIQUE and this new code, tres_y_001 (south boundary) also doesn't match; with the old code tres_y_002 (north boundary) doesn't match. I tried setting the tracer reservoir length scale to 0 since that has historically been troublesome, but it didn't change the failures. I will try turning off some other options. |
Comparing the restart files can tell you if runs are identical, but because differences in one field can spread quickly to all others, such a comparison does not help much in tracking down where in the code the differences emerge. To accomplish this, we use setting DEBUG=True, which writes out checksum values for many different variables at various points in each timestep. The first point where the output diverges is usually a good indicator of where in the code the differences arise. Moreover, it is straightforward to (temporarily?) add other checksum calls to the code to further narrow down where the differences arise. (Be warned, though, that setting DEBUG=True can lead to a very large volume of output, so it is best carried out on a short run with all the output to stdout piped to a file.) When debugging a run with restarts, it may be helpful to append the output of the second segment to the output from the first, after removing some of the initialization-related output from the second run. This can then be compared with the output of a single run that covers the entire period in the two run segments. |
I can confirm not getting identical solutions with OBLIQUE with your branch. It differs from the get-go in: The h fields match in the center, but not the corners, but that has always been the case, with differing nonsense values in the outside halo region. |
Which test cases are demonstrating these inconsistencies (I am assuming the pan-Arctic and NWAtlantic configurations), and are there any relatively small test cases that would be good candidates for debugging these problems? Also, just to narrow down what is going on, would it be possible to try short runs with these configurations with all of the OBC segments save one disabled, to determine whether this is a problem with individual OBC segments or whether this only arises with interactions between OBC segments? (Closing off OBC segments might lead to big adjustments, but hopefully the model will be able to hold together, perhaps with a shortened timestep, long enough to test the reproducibility of the restarts.) |
Here's a diff using the NWA12 domain and Orlanski (FLATHER,ORLANSKI,NUDGED), and tracer reservoir lengths set to 0: https://gist.github.com/andrew-c-ross/d4bba834b26f222b70986d5ad3cba920 I think NWA12 can run without open boundaries without too many problems, so I should be able to run experiments with that. However, it may take me some time to do because of a possible biogeochemical issue I'm currently working on. |
i was actually trying it with the Bering domain. All the Bering files are on gaea if you switch back to your JRA forcing. Sorry, I mostly run it from a January restart file, but I've started the transfer of 2011 files. /lustre/f2/dev/gfdl/Katherine.Hedstrom/ESMG/ESMG-configs/Bering What I saw last night in the debugger is that it somehow only copied over 3 of the 4 halo points at a tile boundary in u, I have no idea why and I only checked u. |
Gaea is now set up to run Bering for real. I failed to run dueling debuggers there though. I can have a 300 core license, but not two 12 core licenses, it seems. The DEBUG output shows again that it fails to match from the very beginning of the restart run. Why would only 3 of the 4 halo points have non-zero values? Where is that set? Would that cause the checksums to disagree? |
Turning off OBLIQUE_TAN changes how the runs diverge. Now they diverge at u pre-btstep accel PF[uv], both u and v, u not matching on the eastern side only, v on the northern side only. |
The pre-btstep accel was a red herring. u and v match at the end of the first go through step_mom_dyn_split_rk2, diverge in the second call at u before vertvisc: up. |
Turning on DEBUG_OBC causes the restart to match. The diff between with and without DEBUG_OBC shows up at: |
@Hallberg-NOAA I made a PR to your branch to fix btstep. However, there's still a dependence on h outside the domain in MOM_mixed_layer_restrat.F90. You can compare with and without DEBUG_OBC to see. In the Bering, things |
Added separate restart variables for some of the oblique OBC restart variables at north-south or east-west faces. Before this fix, some of the full 3-d arrays for restarts were being overwritten for oblique OBC segments that join at adjacent corners on the north-east side of tracer points, as is noted in github.com/NOAA-GFDL/issues/208. The discussion surrounding this issue confirms that there are cases using oblique OBCs (like some North-West Atlantic cases) that do not past the restart reproducibility tests. Some halo updates were also corrected, in accordance with the introduction of these new variables and their proper staggering locations. In a number of places, the proper case-sensitive horizontal indexing convention, as described in github.com/NOAA-GFDL/MOM6/wiki/Code-style-guide#soft-case-convention, is now being used. This commit also corrected the comments describing a number of the variables in the OBC_segment_type to make it clearer where they are discretized. This changes the names of oblique OBC-related variables in the restart files, but since the previous version did not reproduce across restarts, there does not seem to be any point in retaining those incorrect answers. All answers in the MOM6-examples test suite are bitwise identical, but this will change (fix) solutions with oblique OBCs and OBC_RAD_VEL_WT < 1.
Added two new arrays (OBCmaskCu and OBCmaskCv) to the ocean_grid_type and dyn_horgrid_type to mask out values over land or at open boundary condition points. Without open boundary conditions, these arrays are identical to mask2dCu and mask2dCv. These arrays are used in some of the lateral parameterization modules to zero out certain gradient-dependent fluxes at open boundary points. With these changes, the Bering test case solutions no longer exhibit any dependence on whether DEBUG_OBC is true or false or the value of OBC_SILLY_THICK, so this commit should help to address some of the issues discussed in github.com/NOAA-GFDL/issues/208. All answers are bitwise identical in the MOM6-examples test cases, but they change for some other tests that use open boundary conditions more extensively, and there are new arrays in some transparent types.
@kshedstrom, I merged in your PR and added some more changes to github.com/Hallberg-NOAA/MOM6/tree/oblique_OBC_restart_fix that I think address the changes to the Bering when DEBUG_OBC is enabled, but I would appreciate it if you would verify this in your version of the Bering test case. (I am using an older coupler, and had to drop some of the more recent coupler.nml options.) |
Yes, this does pass the perfect restart test for OBLIQUE obcs. However, there is still work to be done on OBLIQUE_TAN. |
Thanks for this update, @kshedstrom. After thoroughly inspecting the OBLIQUE_TAN related code and trying (and failing) to spot what might be going with the OBLIQUE_TAN OBCs, there are three questions that I think might be instructive:
|
1., 2. Yes, with OBC_RAD_VEL_WT = 1.0, it does reproduce across restarts and across process counts.
|
Added separate restart variables for some of the oblique OBC restart variables at north-south or east-west faces. Before this fix, some of the full 3-d arrays for restarts were being overwritten for oblique OBC segments that join at adjacent corners on the north-east side of tracer points, as is noted in github.com/NOAA-GFDL/issues/208. The discussion surrounding this issue confirms that there are cases using oblique OBCs (like some North-West Atlantic cases) that do not past the restart reproducibility tests. Some halo updates were also corrected, in accordance with the introduction of these new variables and their proper staggering locations. In a number of places, the proper case-sensitive horizontal indexing convention, as described in github.com/NOAA-GFDL/MOM6/wiki/Code-style-guide#soft-case-convention, is now being used. This commit also corrected the comments describing a number of the variables in the OBC_segment_type to make it clearer where they are discretized. This changes the names of oblique OBC-related variables in the restart files, but since the previous version did not reproduce across restarts, there does not seem to be any point in retaining those incorrect answers. All answers in the MOM6-examples test suite are bitwise identical, but this will change (fix) solutions with oblique OBCs and OBC_RAD_VEL_WT < 1.
Added two new arrays (OBCmaskCu and OBCmaskCv) to the ocean_grid_type and dyn_horgrid_type to mask out values over land or at open boundary condition points. Without open boundary conditions, these arrays are identical to mask2dCu and mask2dCv. These arrays are used in some of the lateral parameterization modules to zero out certain gradient-dependent fluxes at open boundary points. With these changes, the Bering test case solutions no longer exhibit any dependence on whether DEBUG_OBC is true or false or the value of OBC_SILLY_THICK, so this commit should help to address some of the issues discussed in github.com/NOAA-GFDL/issues/208. All answers are bitwise identical in the MOM6-examples test cases, but they change for some other tests that use open boundary conditions more extensively, and there are new arrays in some transparent types.
Sorry, my test does now reproduce with OBLIQUE_TAN. I must have done something wrong in that other test. |
Thank you so much for that update, @kshedstrom. The fact that PR #219 corrects the restarts with OBLIQUE_TAN as well is a tremendous relief. With this update, I think that it is clear that this issue can be closed once PR #219 is merged into the dev/gfdl branch of the MOM6 code base. |
Added separate restart variables for some of the oblique OBC restart variables at north-south or east-west faces. Before this fix, some of the full 3-d arrays for restarts were being overwritten for oblique OBC segments that join at adjacent corners on the north-east side of tracer points, as is noted in github.com/NOAA-GFDL/issues/208. The discussion surrounding this issue confirms that there are cases using oblique OBCs (like some North-West Atlantic cases) that do not past the restart reproducibility tests. Some halo updates were also corrected, in accordance with the introduction of these new variables and their proper staggering locations. In a number of places, the proper case-sensitive horizontal indexing convention, as described in github.com/NOAA-GFDL/MOM6/wiki/Code-style-guide#soft-case-convention, is now being used. This commit also corrected the comments describing a number of the variables in the OBC_segment_type to make it clearer where they are discretized. This changes the names of oblique OBC-related variables in the restart files, but since the previous version did not reproduce across restarts, there does not seem to be any point in retaining those incorrect answers. All answers in the MOM6-examples test suite are bitwise identical, but this will change (fix) solutions with oblique OBCs and OBC_RAD_VEL_WT < 1.
Added two new arrays (OBCmaskCu and OBCmaskCv) to the ocean_grid_type and dyn_horgrid_type to mask out values over land or at open boundary condition points. Without open boundary conditions, these arrays are identical to mask2dCu and mask2dCv. These arrays are used in some of the lateral parameterization modules to zero out certain gradient-dependent fluxes at open boundary points. With these changes, the Bering test case solutions no longer exhibit any dependence on whether DEBUG_OBC is true or false or the value of OBC_SILLY_THICK, so this commit should help to address some of the issues discussed in github.com/NOAA-GFDL/issues/208. All answers are bitwise identical in the MOM6-examples test cases, but they change for some other tests that use open boundary conditions more extensively, and there are new arrays in some transparent types.
Added separate restart variables for some of the oblique OBC restart variables at north-south or east-west faces. Before this fix, some of the full 3-d arrays for restarts were being overwritten for oblique OBC segments that join at adjacent corners on the north-east side of tracer points, as is noted in github.com//issues/208. The discussion surrounding this issue confirms that there are cases using oblique OBCs (like some North-West Atlantic cases) that do not past the restart reproducibility tests. Some halo updates were also corrected, in accordance with the introduction of these new variables and their proper staggering locations. In a number of places, the proper case-sensitive horizontal indexing convention, as described in github.com/NOAA-GFDL/MOM6/wiki/Code-style-guide#soft-case-convention, is now being used. This commit also corrected the comments describing a number of the variables in the OBC_segment_type to make it clearer where they are discretized. This changes the names of oblique OBC-related variables in the restart files, but since the previous version did not reproduce across restarts, there does not seem to be any point in retaining those incorrect answers. All answers in the MOM6-examples test suite are bitwise identical, but this will change (fix) solutions with oblique OBCs and OBC_RAD_VEL_WT < 1.
Added two new arrays (OBCmaskCu and OBCmaskCv) to the ocean_grid_type and dyn_horgrid_type to mask out values over land or at open boundary condition points. Without open boundary conditions, these arrays are identical to mask2dCu and mask2dCv. These arrays are used in some of the lateral parameterization modules to zero out certain gradient-dependent fluxes at open boundary points. With these changes, the Bering test case solutions no longer exhibit any dependence on whether DEBUG_OBC is true or false or the value of OBC_SILLY_THICK, so this commit should help to address some of the issues discussed in github.com//issues/208. All answers are bitwise identical in the MOM6-examples test cases, but they change for some other tests that use open boundary conditions more extensively, and there are new arrays in some transparent types.
With oblique open boundary conditions, both u-face (east-west) and v-face (north-south) segments have both
segment%rx_norm_obl
andsegment%ry_norm_obl
fields. In order to have reproducing restarts, these are copied into the 3-d arraysOBC%rx_oblique
andOBC%ry_oblique
. Unfortunately, OBCs at both faces can have the same indices, so there could be cases where both the east-west and north-south segments are active with the same global indices, so these restart fields for OBC segments would be over-writing each other if active OBC segments join in the north-east corner of a tracer cell. A similar situation applies for theOBC%cff_normal
fields that are also used for oblique OBCs.The appropriate solution here would seem to be to use separate restart fields for the oblique OBCs for u- and v-face OBC segments. This would add 3 more restart fields when oblique OBCs are used, but then the staggering of all variables would be unambiguous and cases with pairs of oblique OBCs joining in the northeast corner of tracer cells would reproduce across restarts.
I do not have any examples where this situation arises, but it may be because the oblique OBC are not very widely used.
Note that normal radiation open boundary conditions do not have this problem.
@kshedstrom or @MJHarrison-GFDL, I would particularly appreciate your thoughts on whether this diagnosis seems correct.
The text was updated successfully, but these errors were encountered: