Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix issues in two phase build #72

Conversation

jedwards4b
Copy link
Contributor

This fixes the issue with the NCK test. There still appears to be an intermittent problem - it looks like the mkdir command in preview_namelist at line 137 occasionally fails. Is it because multiple threads arrive here at the same time?

@jgfouca
Copy link
Contributor

jgfouca commented Feb 28, 2016

Hi @jedwards4b , I've been banging my head against this for several hours. The solution that had me very hopeful was to make an extra Case instance at the beginning of nck.build and then flushing it at the end if sharedlib_only. Unfortunately, this failed along with everything else I tried.

Many thanks for this fix. Would you mind briefly explaining what the core problem was and why this fixes it?

jgfouca added a commit that referenced this pull request Feb 28, 2016
@jgfouca jgfouca merged commit 6790449 into ESMCI:jgfouca/more_conv_speed_improvements Feb 28, 2016
@jgfouca
Copy link
Contributor

jgfouca commented Feb 28, 2016

Oh, and build.py already creates the sharedlibroot, so preview_namelists does not need to do it. I'll make that change. You are correct, it is not safe to have preview_namelists do anything to the shared area because they run in parallel.

@jedwards4b
Copy link
Contributor Author

The problem was that we were creating two build directories for the cpl/drv
component, the build was going into cesm/obj and clean was targeting
cpl/obj so that
component was not being properly cleaned before the second build. I'm not
sure why it worked before we separated the build into two steps.
I changed it so that the build is now going into cpl/obj - that cesm/obj
(actually $model/obj) directory is still being created somewhere and we can
get rid of it when we figure out where it is.

On Sun, Feb 28, 2016 at 3:44 PM, James Foucar [email protected]
wrote:

Hi @jedwards4b https://github.com/jedwards4b , I've been banging my
head against this for several hours. The solution that had me very hopeful
was to make an extra Case instance at the beginning of nck.build and then
flushing it at the end if sharedlib_only. Unfortunately, this failed along
with everything else I tried.

Many thanks for this fix. Would you mind briefly explaining what the core
problem was and why this fixes it?


Reply to this email directly or view it on GitHub
#72 (comment).

Jim Edwards

CESM Software Engineer
National Center for Atmospheric Research
Boulder, CO

@jgfouca
Copy link
Contributor

jgfouca commented Feb 29, 2016

Hi @jedwards4b , I tried your fix and still getting the same segf that I was getting before. I'm gathering some additional info now.

@jgfouca
Copy link
Contributor

jgfouca commented Feb 29, 2016

Hi @jedwards4b ,
I'm still stuck on this but I've created some tools that can help us debug this issue. One technique that I've used in the past with some success is to make a "baseline" test case, make some changes to the scripts, make another test case, then do a recursive diff on the two cases, hopefully identifying any changes in behavior of the scripts. This technique requires extra work in the sense that lots of things like timestamp and test-id differences can cause huge numbers of uninteresting diffs. I had a process of getting rid of these differences that I've formalized and implemented with a new "normalize_cases" tool that I've added to scripts/Tools. I've also added a "case_diff" tool for doing the recursive diff.

The process of using these new tools for this problem in particular is as follows:
`
% ./create_test NCK_Ld3.f45_g37_rx1.A -t jgf_works --no-build

% ./create_test NCK_Ld3.f45_g37_rx1.A -t jgf_broken --no-run

% * cd to test_root *

% cd NCK_Ld3.f45_g37_rx1.A.melvin_gnu.jgf_works

% ./case.test_build

% cd ..

% normalize_cases NCK_Ld3.f45_g37_rx1.A.melvin_gnu.jgf_broken NCK_Ld3.f45_g37_rx1.A.melvin_gnu.jgf_works

% case_diff NCK_Ld3.f45_g37_rx1.A.melvin_gnu.jgf_broken NCK_Ld3.f45_g37_rx1.A.melvin_gnu.jgf_works
`
The idea here is that if the cases have significant differences in XML, namelists, or build logs, then we've failed at getting the two-phase build to work like the one-phase build. I am seeing such diffs unfortunately.

Let me know what you think! I've pushed these new tools to the same branch on which I'm doing the two-phase work.

jayeshkrishna pushed a commit that referenced this pull request Aug 16, 2016
now MPI_abort does not overwrite ret_val
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants