Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix for conda on gaea #428

Closed

Conversation

mark-a-potts
Copy link
Collaborator

DESCRIPTION OF CHANGES:

This fixes a glitch with conda on gaea. In order to get the correct path set for python after activating the conda environment, "conda deactivate" must be run first.

Type of change

  • Bug fix (non-breaking change which fixes an issue)

TESTS CONDUCTED:

  • hera.intel
  • orion.intel
  • cheyenne.intel
  • cheyenne.gnu
  • gaea.intel
  • jet.intel
  • wcoss2.intel
  • NOAA Cloud (indicate which platform)
  • Jenkins
  • fundamental test suite
  • comprehensive tests (specify which if a subset was used)

ISSUE:

CHECKLIST

  • My code follows the style guidelines in the Contributor's Guide
  • I have performed a self-review of my own code using the Code Reviewer's Guide
  • I have commented my code, particularly in hard-to-understand areas
  • My changes need updates to the documentation. I have made corresponding changes to the documentation
  • My changes do not require updates to the documentation (explain).
  • My changes generate no new warnings
  • New and existing tests pass with my changes
  • Any dependent changes have been merged and published

LABELS (optional):

A Code Manager needs to add the following labels to this PR:

  • bug

@MichaelLueken MichaelLueken added bug Something isn't working run_we2e_coverage_tests Run the coverage set of SRW end-to-end tests labels Oct 25, 2022
if [ $machine = "gaea" ]; then
conda deactivate
conda activate ${SRW_ENV}
else
conda activate ${SRW_ENV}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had a problem on Gaea and Jenkins, which is why this activate->deactivate->activate route is used specifically for Gaea. Please take a look at this discussion:
#404 (comment)
Maybe something has changed since moving to Lua, but pretty sure it was working with Jenkins as it was a couple of days ago.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, Daniel. The deactivate worked for me when I tried it without first activating an environment, but I am looking into this further now. I think that we may not need the deactivate at all, actually. The problem I was trying to fix on my end was because I had an old conda init stanza in my .bash_profile that was confusing the new conda. I will see if something similar is messing up the Jenkins job.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mark-a-potts Maybe there is a way to make gaea consistent with other systems (one conda activate) once EPIC installs are used in PR #419 ? I also don't like the hack either.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interestingly, the Jenkins CI pipeline for Gaea successfully ran through to completion for this PR. If it was possible to remove the conda deactivate entirely, that would be great, but it might need to wait until after the updated miniconda PRs come in.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MichaelLueken Indeed. I was under the impression that the conda deactivate will be a no-op before conda activate $SRW is executed, but it looks like it does deactivate the conda from the parent shell. I guess this is one less command for gaea but would be desirable to get rid of the conda deactivate as well.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have tested a little more this afternoon, and I can't quite figure out what is going on. When I load wflow_gaea by hand, the python path is right and conda is known to be a function, but when I run a step with rocoto, the load_modules_and_run.sh script does not know that conda is a function and so the conda activate line does not work without a conda deactivate first.

You are probably right @MichaelLueken, that we should test again once the miniconda PRs come through.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did some more testing today and think I have found the problem, though I am not entirely sure why it is a problem. It seems that the extra module load for miniconda3 that happens in the modulefiles/tasks/gaea/*.lua files adds an extra level of conda stuff that messes things up. I removed all but the vx modulefile under gaea and everything seems to be working correctly without a need to deactivate the conda environment. I don't quite understand why the extra conda modules seem to be okay on other platforms, but not gaea, but I think this is a bit better of a solution than the deactivate workaround. I will re-open this and push the latest updates I have made.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this is happening on all systems now after transition to EPIC modules with @natalie-perlin 's PR?
#431 (comment)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that it is currently not activating conda in load_modules_run_task.sh because SRW_ENV is not set for it. So gaea is still an outlier compared to other Tier-1 systems. I wish there was a better solution that switches the python paths to */envs/regional_workflow/bin instead of */bin which is the root cause of the problem.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it appears as though @natalie-perlin's PR #431 is having the same issue for all machines, merging this PR would likely break what Natalie has done. Would it be best to open a PR in Natalie's fork to include these changes in PR #431 and then ask her to make similar modifications to the local modulefiles for machines that aren't Gaea (since Mark has already done this for Gaea)?

Copy link
Collaborator

@MichaelLueken MichaelLueken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mark-a-potts I have a question about the addition of the login enabling for ush/load_modules_run_task.sh. Is making this script login enabled required? This appears to go against the code manager's guide for shell scripting in the SRW.

@@ -1,4 +1,4 @@
#!/bin/bash
#!/bin/bash -l
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the bash script need to be login enabled in order for this work to function properly? According to the code manager's guide, all bash scripts must explicitly be #!/bin/bash scripts (and not login enabled).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, good catch! I don't think that is necessary. I added it to see if I could figure out what was going on. I will remove it on my side and run a quick test to be sure.

@MichaelLueken
Copy link
Collaborator

@mark-a-potts All of the Jenkins tests successfully ran and completed on all machines. Now, we just need to figure out how to best get both these changes and @natalie-perlin's PR #431 in without adversely affecting each.

@ulmononian
Copy link
Collaborator

ulmononian commented Oct 26, 2022

@mark-a-potts @danielabdi-noaa @MichaelLueken : @EdwardSnyder-NOAA recently identified this same conda activate/deactivate issue with the workflow when we were testing things for an update to all the noaacloud module files (PR forthcoming from https://github.com/ulmononian/ufs-srweather-app/tree/feature/update_cloud_mods). we locally added an OR statement to include the noaacloud platform in this line https://github.com/ulmononian/ufs-srweather-app/blob/2f27a09f3a99cba6c582ebfd2607cf581f487847/ush/load_modules_run_task.sh#L187, but given the solution in this PR, i am wondering if we need to address the cloud module files in a similar way?

@mark-a-potts
Copy link
Collaborator Author

Sorry, I have been stuck in meetings all afternoon. I will touch base with @natalie-perlin tomorrow and see if we can come up with a single solution.

@danielabdi-noaa
Copy link
Collaborator

danielabdi-noaa commented Oct 26, 2022

@ulmononian Good to know this behavior is present in noaacloud! I don't think removing the miniconda3 modulefile loading from the modulefiles/tasks/machine like this PR does is a good idea. I believe the reason it is working here is because it is relying on the conda activate from the parent shell. Also both CIs (jenkins & github actions) do the same before launching tasks so the problem is masked away. I can reproduce this on Hera after removing miniconda3 modulefile loading from modulefiles/tasks/hera/miniconda_regional_worfklow.lua and running a test case

  • with conda activated from command linek
(regional_workflow) Daniel.Abdi@hfe02

All tasks finish even though miniconda3 module is not loaded by any of them.

  • conda not activated (prompt without regional_workflow)
Daniel.Abdi@hfe02

I encounter ModuleImportError on yaml simlar to what we saw on gaea before the fix.

So I don't think it is a good idea to remove miniconda3 from the task modulefiles. The philsophy behind task specific modulefiles is probably that each task loads the modules and packages it needs itself, not by the parent shell.
If we do want to rely on this behaviour, it is best to make the change to all task modulefiles -- it does simplify the code afterall
but we want to be sure we are not breaking something else.

@ulmononian ulmononian mentioned this pull request Oct 26, 2022
30 tasks
@MichaelLueken
Copy link
Collaborator

@mark-a-potts Since PR #444 has been merged, is it safe to close this PR now?

@MichaelLueken
Copy link
Collaborator

Closing PR #428 since the necessary changes have already been merged in via PR #444.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working run_we2e_coverage_tests Run the coverage set of SRW end-to-end tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants