Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Oblique OBCs have problematic restarts in some cases #208

Closed
Hallberg-NOAA opened this issue Sep 18, 2022 · 23 comments · Fixed by #219
Closed

Oblique OBCs have problematic restarts in some cases #208

Hallberg-NOAA opened this issue Sep 18, 2022 · 23 comments · Fixed by #219

Comments

@Hallberg-NOAA
Copy link
Member

With oblique open boundary conditions, both u-face (east-west) and v-face (north-south) segments have both segment%rx_norm_obl and segment%ry_norm_obl fields. In order to have reproducing restarts, these are copied into the 3-d arrays OBC%rx_oblique and OBC%ry_oblique. Unfortunately, OBCs at both faces can have the same indices, so there could be cases where both the east-west and north-south segments are active with the same global indices, so these restart fields for OBC segments would be over-writing each other if active OBC segments join in the north-east corner of a tracer cell. A similar situation applies for the OBC%cff_normal fields that are also used for oblique OBCs.

The appropriate solution here would seem to be to use separate restart fields for the oblique OBCs for u- and v-face OBC segments. This would add 3 more restart fields when oblique OBCs are used, but then the staggering of all variables would be unambiguous and cases with pairs of oblique OBCs joining in the northeast corner of tracer cells would reproduce across restarts.

I do not have any examples where this situation arises, but it may be because the oblique OBC are not very widely used.

Note that normal radiation open boundary conditions do not have this problem.

@kshedstrom or @MJHarrison-GFDL, I would particularly appreciate your thoughts on whether this diagnosis seems correct.

@kshedstrom
Copy link

Yes, that seems correct to me. I might have even known it when we implemented it, just accepting it as being better than not saving any of that stuff at all.

@nikizadehgfdl
Copy link

Northwest Atlantic (NWA) experiment setups of @andrew-c-ross use the oblique OBCs and they have restart issues that we have not been able to fix. Can I replace OBLIQUE with something else to just test if the restart issues are related to this?

OBC_SEGMENT_001 = "J=0,I=0:N,FLATHER,OBLIQUE,NUDGED" 

@kshedstrom
Copy link

@nikizadehgfdl Yes, try ORLANSKI instead.

Hallberg-NOAA added a commit to Hallberg-NOAA/MOM6 that referenced this issue Sep 19, 2022
  Added separate restart variables for some of the oblique OBC restart variables
at north-south or east-west faces.  Before this fix, some of the full 3-d arrays
for restarts were being overwritten for oblique OBC segments that join at
adjacent corners on the north-east side of tracer points, as is noted in
github.com/NOAA-GFDL/issues/208.  The discussion surrounding this issue
confirms that there are cases using oblique OBCs (like some North-West Atlantic
cases) that do not past the restart reproducibility tests.  Some halo updates
were also corrected, in accordance with the introduction of these new variables
and their proper staggering locations.  In a number of places, the proper
case-sensitive horizontal indexing convention, as described in
github.com/NOAA-GFDL/MOM6/wiki/Code-style-guide#soft-case-convention, is now
being used.  This commit also corrected the comments describing a number of the
variables in the OBC_segment_type to make it clearer where they are discretized.

  This changes the names of oblique OBC-related variables in the restart files,
but since the previous version did not reproduce across restarts, there does not
seem to be any point in retaining those incorrect answers.  All answers in the
MOM6-examples test suite are bitwise identical, but this will change (fix)
solutions with oblique OBCs and OBC_RAD_VEL_WT < 1.
@Hallberg-NOAA
Copy link
Member Author

I may have a fix for this problem at github.com/Hallberg-NOAA/MOM6/tree/oblique_OBC_restart_fix, but because there were not any cases that exhibit this bug in the MOM6-examples test suite, I would appreciate it if someone else could test this code fix against a case that does exhibit the restart issues with oblique OBCs.

@nikizadehgfdl, or @andrew-c-ross, or @kshedstrom, would you be willing to test this for me? The restart file contents have changed, so the old restart files will not work. Instead I am just looking for a confirmation that two short runs with oblique OBCs give an equivalent answer to one longer run, with all of these run segments using the new code.

@kshedstrom
Copy link

I can test this in the Bering domain, but will have to rerun it with OBLIQUE instead of ORLANSKI. Do you want confirmation that the unmodified code has a restart problem?

The dumbbell tests with OBLIQUE should be fine, having only EW OBCs (or NS if rotated).

@andrew-c-ross
Copy link

I will test NWA and report back ASAP.

@andrew-c-ross
Copy link

I'm having some trouble testing this because it also fails to reproduce when I switch to ORLANSKI. With Orlanski, a 2 day and 1day/restart/1day set of experiments fails with:

      Comparing ice_model.res.nc...
DIFFER : VARIABLE : flux_u : ATTRIBUTE : checksum : VALUES : "4D9242C5DB2789AB" <> "4E10805449F28086"
      Comparing input.nml...
WARNING : NONESSENTIAL FILES DIFFER : input.nml
      Comparing MOM.res_1.nc...
DIFFER : VARIABLE : diffu : ATTRIBUTE : checksum : VALUES : "B093914D35D8782B" <> "9E2943536525AFC8"
      Comparing MOM.res.nc...
DIFFER : VARIABLE : Temp : ATTRIBUTE : checksum : VALUES : "94D9EE69B4E1CCA4" <> "9157B520BEDF7165"

The same variables fail with OBLIQUE. With OBLIQUE and this new code, tres_y_001 (south boundary) also doesn't match; with the old code tres_y_002 (north boundary) doesn't match.

I tried setting the tracer reservoir length scale to 0 since that has historically been troublesome, but it didn't change the failures. I will try turning off some other options.

@Hallberg-NOAA
Copy link
Member Author

Hallberg-NOAA commented Sep 21, 2022

Comparing the restart files can tell you if runs are identical, but because differences in one field can spread quickly to all others, such a comparison does not help much in tracking down where in the code the differences emerge. To accomplish this, we use setting DEBUG=True, which writes out checksum values for many different variables at various points in each timestep. The first point where the output diverges is usually a good indicator of where in the code the differences arise. Moreover, it is straightforward to (temporarily?) add other checksum calls to the code to further narrow down where the differences arise.

(Be warned, though, that setting DEBUG=True can lead to a very large volume of output, so it is best carried out on a short run with all the output to stdout piped to a file.)

When debugging a run with restarts, it may be helpful to append the output of the second segment to the output from the first, after removing some of the initialization-related output from the second run. This can then be compared with the output of a single run that covers the entire period in the two run segments.

@kshedstrom
Copy link

I can confirm not getting identical solutions with OBLIQUE with your branch. It differs from the get-go in:
u Beginning of step_MOM [uv]
but not v Beginning of step_MOM [uv]

The h fields match in the center, but not the corners, but that has always been the case, with differing nonsense values in the outside halo region.

@Hallberg-NOAA
Copy link
Member Author

Which test cases are demonstrating these inconsistencies (I am assuming the pan-Arctic and NWAtlantic configurations), and are there any relatively small test cases that would be good candidates for debugging these problems?

Also, just to narrow down what is going on, would it be possible to try short runs with these configurations with all of the OBC segments save one disabled, to determine whether this is a problem with individual OBC segments or whether this only arises with interactions between OBC segments? (Closing off OBC segments might lead to big adjustments, but hopefully the model will be able to hold together, perhaps with a shortened timestep, long enough to test the reproducibility of the restarts.)

@andrew-c-ross
Copy link

Here's a diff using the NWA12 domain and Orlanski (FLATHER,ORLANSKI,NUDGED), and tracer reservoir lengths set to 0: https://gist.github.com/andrew-c-ross/d4bba834b26f222b70986d5ad3cba920

I think NWA12 can run without open boundaries without too many problems, so I should be able to run experiments with that. However, it may take me some time to do because of a possible biogeochemical issue I'm currently working on.

@kshedstrom
Copy link

i was actually trying it with the Bering domain. All the Bering files are on gaea if you switch back to your JRA forcing. Sorry, I mostly run it from a January restart file, but I've started the transfer of 2011 files.

/lustre/f2/dev/gfdl/Katherine.Hedstrom/ESMG/ESMG-configs/Bering

What I saw last night in the debugger is that it somehow only copied over 3 of the 4 halo points at a tile boundary in u, I have no idea why and I only checked u.

@kshedstrom
Copy link

Gaea is now set up to run Bering for real. I failed to run dueling debuggers there though. I can have a 300 core license, but not two 12 core licenses, it seems. The DEBUG output shows again that it fails to match from the very beginning of the restart run. Why would only 3 of the 4 halo points have non-zero values? Where is that set? Would that cause the checksums to disagree?

@kshedstrom
Copy link

Turning off OBLIQUE_TAN changes how the runs diverge. Now they diverge at u pre-btstep accel PF[uv], both u and v, u not matching on the eastern side only, v on the northern side only.

@kshedstrom
Copy link

The pre-btstep accel was a red herring. u and v match at the end of the first go through step_mom_dyn_split_rk2, diverge in the second call at u before vertvisc: up.

@kshedstrom
Copy link

Turning on DEBUG_OBC causes the restart to match. The diff between with and without DEBUG_OBC shows up at:
h-point: c= 138662915 pre-btstep accel pbce
the second time around. DEBUG_OBC sets the value of h just outside the OBCs to zero, also sets the pressure gradients to zero right on the boundary. Could it be that something depends on the value of pbce outside?

@kshedstrom
Copy link

@Hallberg-NOAA I made a PR to your branch to fix btstep. However, there's still a dependence on h outside the domain in MOM_mixed_layer_restrat.F90. You can compare with and without DEBUG_OBC to see. In the Bering, things
change at time 4200 s with h(:,5,1) being different everywhere after the call, not just outside the domain, not
just at the edges.

Hallberg-NOAA added a commit to Hallberg-NOAA/MOM6 that referenced this issue Oct 10, 2022
  Added separate restart variables for some of the oblique OBC restart variables
at north-south or east-west faces.  Before this fix, some of the full 3-d arrays
for restarts were being overwritten for oblique OBC segments that join at
adjacent corners on the north-east side of tracer points, as is noted in
github.com/NOAA-GFDL/issues/208.  The discussion surrounding this issue
confirms that there are cases using oblique OBCs (like some North-West Atlantic
cases) that do not past the restart reproducibility tests.  Some halo updates
were also corrected, in accordance with the introduction of these new variables
and their proper staggering locations.  In a number of places, the proper
case-sensitive horizontal indexing convention, as described in
github.com/NOAA-GFDL/MOM6/wiki/Code-style-guide#soft-case-convention, is now
being used.  This commit also corrected the comments describing a number of the
variables in the OBC_segment_type to make it clearer where they are discretized.

  This changes the names of oblique OBC-related variables in the restart files,
but since the previous version did not reproduce across restarts, there does not
seem to be any point in retaining those incorrect answers.  All answers in the
MOM6-examples test suite are bitwise identical, but this will change (fix)
solutions with oblique OBCs and OBC_RAD_VEL_WT < 1.
Hallberg-NOAA added a commit to Hallberg-NOAA/MOM6 that referenced this issue Oct 10, 2022
  Added two new arrays (OBCmaskCu and OBCmaskCv) to the ocean_grid_type and
dyn_horgrid_type to mask out values over land or at open boundary condition
points.  Without open boundary conditions, these arrays are identical to
mask2dCu and mask2dCv.  These arrays are used in some of the lateral
parameterization modules to zero out certain gradient-dependent fluxes at open
boundary points.  With these changes, the Bering test case solutions no longer
exhibit any dependence on whether DEBUG_OBC is true or false or the value of
OBC_SILLY_THICK, so this commit should help to address some of the issues
discussed in github.com/NOAA-GFDL/issues/208.  All answers are bitwise
identical in the MOM6-examples test cases, but they change for some other tests
that use open boundary conditions more extensively, and there are new arrays in
some transparent types.
@Hallberg-NOAA
Copy link
Member Author

@kshedstrom, I merged in your PR and added some more changes to github.com/Hallberg-NOAA/MOM6/tree/oblique_OBC_restart_fix that I think address the changes to the Bering when DEBUG_OBC is enabled, but I would appreciate it if you would verify this in your version of the Bering test case. (I am using an older coupler, and had to drop some of the more recent coupler.nml options.)

@kshedstrom
Copy link

Yes, this does pass the perfect restart test for OBLIQUE obcs. However, there is still work to be done on OBLIQUE_TAN.

@Hallberg-NOAA
Copy link
Member Author

Hallberg-NOAA commented Oct 12, 2022

Thanks for this update, @kshedstrom. After thoroughly inspecting the OBLIQUE_TAN related code and trying (and failing) to spot what might be going with the OBLIQUE_TAN OBCs, there are three questions that I think might be instructive:

  1. If we set OBC_RAD_VEL_WT = 1.0, do cases with OBLIQUE_TAN OBCs reproduce across restarts?

  2. Do the cases with OBLIQUE_TAN OBCs reproduce across different PE counts and layouts?

  3. Do you think that the improvements accruing from this branch as it stands are great enough to warrant a PR to dev/gfdl and hence to main at this time, or do you think that we should hold off until we have corrected all of the problems with OBLIQUE_TAN OBCs as well?

@kshedstrom
Copy link

1., 2. Yes, with OBC_RAD_VEL_WT = 1.0, it does reproduce across restarts and across process counts.

  1. I don't see any reason to wait to push your success so far.

Hallberg-NOAA added a commit to Hallberg-NOAA/MOM6 that referenced this issue Oct 13, 2022
  Added separate restart variables for some of the oblique OBC restart variables
at north-south or east-west faces.  Before this fix, some of the full 3-d arrays
for restarts were being overwritten for oblique OBC segments that join at
adjacent corners on the north-east side of tracer points, as is noted in
github.com/NOAA-GFDL/issues/208.  The discussion surrounding this issue
confirms that there are cases using oblique OBCs (like some North-West Atlantic
cases) that do not past the restart reproducibility tests.  Some halo updates
were also corrected, in accordance with the introduction of these new variables
and their proper staggering locations.  In a number of places, the proper
case-sensitive horizontal indexing convention, as described in
github.com/NOAA-GFDL/MOM6/wiki/Code-style-guide#soft-case-convention, is now
being used.  This commit also corrected the comments describing a number of the
variables in the OBC_segment_type to make it clearer where they are discretized.

  This changes the names of oblique OBC-related variables in the restart files,
but since the previous version did not reproduce across restarts, there does not
seem to be any point in retaining those incorrect answers.  All answers in the
MOM6-examples test suite are bitwise identical, but this will change (fix)
solutions with oblique OBCs and OBC_RAD_VEL_WT < 1.
Hallberg-NOAA added a commit to Hallberg-NOAA/MOM6 that referenced this issue Oct 13, 2022
  Added two new arrays (OBCmaskCu and OBCmaskCv) to the ocean_grid_type and
dyn_horgrid_type to mask out values over land or at open boundary condition
points.  Without open boundary conditions, these arrays are identical to
mask2dCu and mask2dCv.  These arrays are used in some of the lateral
parameterization modules to zero out certain gradient-dependent fluxes at open
boundary points.  With these changes, the Bering test case solutions no longer
exhibit any dependence on whether DEBUG_OBC is true or false or the value of
OBC_SILLY_THICK, so this commit should help to address some of the issues
discussed in github.com/NOAA-GFDL/issues/208.  All answers are bitwise
identical in the MOM6-examples test cases, but they change for some other tests
that use open boundary conditions more extensively, and there are new arrays in
some transparent types.
@kshedstrom
Copy link

Sorry, my test does now reproduce with OBLIQUE_TAN. I must have done something wrong in that other test.

@Hallberg-NOAA
Copy link
Member Author

Thank you so much for that update, @kshedstrom. The fact that PR #219 corrects the restarts with OBLIQUE_TAN as well is a tremendous relief. With this update, I think that it is clear that this issue can be closed once PR #219 is merged into the dev/gfdl branch of the MOM6 code base.

@Hallberg-NOAA Hallberg-NOAA linked a pull request Oct 14, 2022 that will close this issue
Hallberg-NOAA added a commit to Hallberg-NOAA/MOM6 that referenced this issue Oct 21, 2022
  Added separate restart variables for some of the oblique OBC restart variables
at north-south or east-west faces.  Before this fix, some of the full 3-d arrays
for restarts were being overwritten for oblique OBC segments that join at
adjacent corners on the north-east side of tracer points, as is noted in
github.com/NOAA-GFDL/issues/208.  The discussion surrounding this issue
confirms that there are cases using oblique OBCs (like some North-West Atlantic
cases) that do not past the restart reproducibility tests.  Some halo updates
were also corrected, in accordance with the introduction of these new variables
and their proper staggering locations.  In a number of places, the proper
case-sensitive horizontal indexing convention, as described in
github.com/NOAA-GFDL/MOM6/wiki/Code-style-guide#soft-case-convention, is now
being used.  This commit also corrected the comments describing a number of the
variables in the OBC_segment_type to make it clearer where they are discretized.

  This changes the names of oblique OBC-related variables in the restart files,
but since the previous version did not reproduce across restarts, there does not
seem to be any point in retaining those incorrect answers.  All answers in the
MOM6-examples test suite are bitwise identical, but this will change (fix)
solutions with oblique OBCs and OBC_RAD_VEL_WT < 1.
Hallberg-NOAA added a commit to Hallberg-NOAA/MOM6 that referenced this issue Oct 21, 2022
  Added two new arrays (OBCmaskCu and OBCmaskCv) to the ocean_grid_type and
dyn_horgrid_type to mask out values over land or at open boundary condition
points.  Without open boundary conditions, these arrays are identical to
mask2dCu and mask2dCv.  These arrays are used in some of the lateral
parameterization modules to zero out certain gradient-dependent fluxes at open
boundary points.  With these changes, the Bering test case solutions no longer
exhibit any dependence on whether DEBUG_OBC is true or false or the value of
OBC_SILLY_THICK, so this commit should help to address some of the issues
discussed in github.com/NOAA-GFDL/issues/208.  All answers are bitwise
identical in the MOM6-examples test cases, but they change for some other tests
that use open boundary conditions more extensively, and there are new arrays in
some transparent types.
marshallward pushed a commit that referenced this issue Oct 25, 2022
  Added separate restart variables for some of the oblique OBC restart variables
at north-south or east-west faces.  Before this fix, some of the full 3-d arrays
for restarts were being overwritten for oblique OBC segments that join at
adjacent corners on the north-east side of tracer points, as is noted in
github.com//issues/208.  The discussion surrounding this issue
confirms that there are cases using oblique OBCs (like some North-West Atlantic
cases) that do not past the restart reproducibility tests.  Some halo updates
were also corrected, in accordance with the introduction of these new variables
and their proper staggering locations.  In a number of places, the proper
case-sensitive horizontal indexing convention, as described in
github.com/NOAA-GFDL/MOM6/wiki/Code-style-guide#soft-case-convention, is now
being used.  This commit also corrected the comments describing a number of the
variables in the OBC_segment_type to make it clearer where they are discretized.

  This changes the names of oblique OBC-related variables in the restart files,
but since the previous version did not reproduce across restarts, there does not
seem to be any point in retaining those incorrect answers.  All answers in the
MOM6-examples test suite are bitwise identical, but this will change (fix)
solutions with oblique OBCs and OBC_RAD_VEL_WT < 1.
marshallward pushed a commit that referenced this issue Oct 25, 2022
  Added two new arrays (OBCmaskCu and OBCmaskCv) to the ocean_grid_type and
dyn_horgrid_type to mask out values over land or at open boundary condition
points.  Without open boundary conditions, these arrays are identical to
mask2dCu and mask2dCv.  These arrays are used in some of the lateral
parameterization modules to zero out certain gradient-dependent fluxes at open
boundary points.  With these changes, the Bering test case solutions no longer
exhibit any dependence on whether DEBUG_OBC is true or false or the value of
OBC_SILLY_THICK, so this commit should help to address some of the issues
discussed in github.com//issues/208.  All answers are bitwise
identical in the MOM6-examples test cases, but they change for some other tests
that use open boundary conditions more extensively, and there are new arrays in
some transparent types.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants