-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix for conda on gaea #428
Conversation
ush/load_modules_run_task.sh
Outdated
if [ $machine = "gaea" ]; then | ||
conda deactivate | ||
conda activate ${SRW_ENV} | ||
else | ||
conda activate ${SRW_ENV} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We had a problem on Gaea and Jenkins, which is why this activate->deactivate->activate
route is used specifically for Gaea. Please take a look at this discussion:
#404 (comment)
Maybe something has changed since moving to Lua, but pretty sure it was working with Jenkins as it was a couple of days ago.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, Daniel. The deactivate worked for me when I tried it without first activating an environment, but I am looking into this further now. I think that we may not need the deactivate at all, actually. The problem I was trying to fix on my end was because I had an old conda init stanza in my .bash_profile that was confusing the new conda. I will see if something similar is messing up the Jenkins job.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mark-a-potts Maybe there is a way to make gaea consistent with other systems (one conda activate) once EPIC installs are used in PR #419 ? I also don't like the hack either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interestingly, the Jenkins CI pipeline for Gaea successfully ran through to completion for this PR. If it was possible to remove the conda deactivate
entirely, that would be great, but it might need to wait until after the updated miniconda PRs come in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MichaelLueken Indeed. I was under the impression that the conda deactivate
will be a no-op before conda activate $SRW
is executed, but it looks like it does deactivate the conda from the parent shell. I guess this is one less command for gaea but would be desirable to get rid of the conda deactivate
as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have tested a little more this afternoon, and I can't quite figure out what is going on. When I load wflow_gaea by hand, the python path is right and conda is known to be a function, but when I run a step with rocoto, the load_modules_and_run.sh script does not know that conda is a function and so the conda activate line does not work without a conda deactivate first.
You are probably right @MichaelLueken, that we should test again once the miniconda PRs come through.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did some more testing today and think I have found the problem, though I am not entirely sure why it is a problem. It seems that the extra module load for miniconda3 that happens in the modulefiles/tasks/gaea/*.lua files adds an extra level of conda stuff that messes things up. I removed all but the vx modulefile under gaea and everything seems to be working correctly without a need to deactivate the conda environment. I don't quite understand why the extra conda modules seem to be okay on other platforms, but not gaea, but I think this is a bit better of a solution than the deactivate workaround. I will re-open this and push the latest updates I have made.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this is happening on all systems now after transition to EPIC modules with @natalie-perlin 's PR?
#431 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that it is currently not activating conda in load_modules_run_task.sh
because SRW_ENV
is not set for it. So gaea is still an outlier compared to other Tier-1 systems. I wish there was a better solution that switches the python paths to */envs/regional_workflow/bin
instead of */bin
which is the root cause of the problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since it appears as though @natalie-perlin's PR #431 is having the same issue for all machines, merging this PR would likely break what Natalie has done. Would it be best to open a PR in Natalie's fork to include these changes in PR #431 and then ask her to make similar modifications to the local modulefiles for machines that aren't Gaea (since Mark has already done this for Gaea)?
…r of conda initialization
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mark-a-potts I have a question about the addition of the login enabling for ush/load_modules_run_task.sh
. Is making this script login enabled required? This appears to go against the code manager's guide for shell scripting in the SRW.
@@ -1,4 +1,4 @@ | |||
#!/bin/bash | |||
#!/bin/bash -l |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the bash script need to be login enabled in order for this work to function properly? According to the code manager's guide, all bash scripts must explicitly be #!/bin/bash
scripts (and not login enabled).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, good catch! I don't think that is necessary. I added it to see if I could figure out what was going on. I will remove it on my side and run a quick test to be sure.
@mark-a-potts All of the Jenkins tests successfully ran and completed on all machines. Now, we just need to figure out how to best get both these changes and @natalie-perlin's PR #431 in without adversely affecting each. |
@mark-a-potts @danielabdi-noaa @MichaelLueken : @EdwardSnyder-NOAA recently identified this same conda activate/deactivate issue with the workflow when we were testing things for an update to all the noaacloud module files (PR forthcoming from https://github.com/ulmononian/ufs-srweather-app/tree/feature/update_cloud_mods). we locally added an OR statement to include the noaacloud platform in this line https://github.com/ulmononian/ufs-srweather-app/blob/2f27a09f3a99cba6c582ebfd2607cf581f487847/ush/load_modules_run_task.sh#L187, but given the solution in this PR, i am wondering if we need to address the cloud module files in a similar way? |
Sorry, I have been stuck in meetings all afternoon. I will touch base with @natalie-perlin tomorrow and see if we can come up with a single solution. |
@ulmononian Good to know this behavior is present in noaacloud! I don't think removing the miniconda3 modulefile loading from the
All tasks finish even though miniconda3 module is not loaded by any of them.
I encounter So I don't think it is a good idea to remove miniconda3 from the task modulefiles. The philsophy behind task specific modulefiles is probably that each task loads the modules and packages it needs itself, not by the parent shell. |
@mark-a-potts Since PR #444 has been merged, is it safe to close this PR now? |
DESCRIPTION OF CHANGES:
This fixes a glitch with conda on gaea. In order to get the correct path set for python after activating the conda environment, "conda deactivate" must be run first.
Type of change
TESTS CONDUCTED:
ISSUE:
CHECKLIST
LABELS (optional):
A Code Manager needs to add the following labels to this PR: