Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intel20 compiler issue in ice_transport_remap #461

Open
apcraig opened this issue Jun 7, 2020 · 9 comments
Open

Intel20 compiler issue in ice_transport_remap #461

apcraig opened this issue Jun 7, 2020 · 9 comments

Comments

@apcraig
Copy link
Contributor

apcraig commented Jun 7, 2020

#460 includes what we believe is a compiler bug workaround in ice_transport_remap, but more analysis needs to be done. As highlighted in #460

ice_transport_remap seems to have persistent seg fault issues, but they appear in different places; there's a comment about one of the omp directives seg faulting, and in the past, I've had to unroll a loop in the transport (I no longer remember which one) in order for optimization to not create a seg fault. Is there a particular (set of) variable(s) that need to be allocated, or is it really all of them? Is there a reason to not allocate here all the time? Would it help to move to a vector version of the transport (e.g. the new unstructured-grid code in MPAS)?

We need to test on other machines with the intel20 compiler, understand the problem better (whether a coding issue or simply a coding vulnerability), and try to figure out a more robust solution.

@phil-blain
Copy link
Member

@apcraig have you tried running the model under Valgrind ? maybe that could help... on our Cray XC system we have a Cray-installed valgrind4hpc module that may be of help (from a quick Google search it seems to be a standard install on most Cray systems)

@phil-blain
Copy link
Member

Also, the Intel compilers can be installed on personal Linux computers for open-source contributors, so that might be a way to test the Intel 2020 compiler if it's hard to find another machine with it...

@apcraig
Copy link
Contributor Author

apcraig commented Jul 1, 2020

Testing was carried out on other machines and the workaround in #460 was removed in #462. We believe the problem arises only on izumi due to some system issues on that particular machine. I have renamed the issue to reflect that.

@apcraig apcraig changed the title ice_transport_remap seg faults izumi port problems due to izumi system issues Jul 1, 2020
@apcraig
Copy link
Contributor Author

apcraig commented Sep 29, 2020

The intel problem was repeated on Orion using the same problem. This truly does seem to be a compiler bug. We should consider implementing the work around.

@apcraig apcraig changed the title izumi port problems due to izumi system issues Intel20 compiler issue in ice_transport_remap Sep 29, 2020
@dabail10
Copy link
Contributor

I thought initially that orion did not have the issue with Intel20? Do we know what changed? I see that cheyenne has not moved to intel 20 yet, so perhaps there will be a compiler fix in the next update to intel 20?

@jedwards4b
Copy link
Contributor

Just reading through this - did you try adjusting the OMP_STACKSIZE variable?

@dabail10
Copy link
Contributor

I just tried OMP_STACKSIZE of 256M and 1024M. Neither did anything for this.

@dabail10
Copy link
Contributor

dabail10 commented Oct 6, 2020

We just talked about this at the CSEG meeting. There was a lot of pushback claiming it is the CICE code and not intel20. There are three tests that fail for CESM2 on izumi with intel 20. Two of these are failing at the same place in ice_transport_remap.F90 and this is the CICE5 code base. One of the tests is actually failing in POP. Jim Edwards has offered to help us debug this, but we need to come up with a reproducible case for this. In terms of izumi, our lab director has said that izumi is still an important tool for CGD and support will continue on this. They currently have someone from another lab helping with this until a replacement hire is made for Mark.

@apcraig
Copy link
Contributor Author

apcraig commented Oct 6, 2020

Thanks for the info @dabail10. It certainly could be an issue with the CICE implementation. It's just odd that this code has existed for 10 years or more and has been run on probably hundreds of different compilers and compiler versions over that time and none has had a problem until this version of intel20. I have spent some time debugging and have a workaround/fix in my back pocket that migrates the subroutine static memory allocation to dynamic. This seems to get rid of the error, although since I don't understand the underlying problem, don't know if it addresses it (if it exists). I'd be happy if someone else wants to take a look! @jedwards4b, would a simple standalone CICE6 case that fails be adequate? Also happy to show you my workaround and talk about my debugging efforts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants