Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues with new EPIC modulefiles #458

Closed
danielabdi-noaa opened this issue Nov 7, 2022 · 12 comments
Closed

Issues with new EPIC modulefiles #458

danielabdi-noaa opened this issue Nov 7, 2022 · 12 comments
Labels
bug Something isn't working

Comments

@danielabdi-noaa
Copy link
Collaborator

danielabdi-noaa commented Nov 7, 2022

Expected behavior

  • Test cases should run properly whether conda is activated externally or not
  • Any user should be able to load modulefiles used by SRW on Orion. Currently the modulefiles do not have permissions for non-epic users

Current behavior

  • On hera and Jet, running test cases with setup_WE2E_tests.sh is not possible if you already activated conda
  • On Orion, loading the wflow_orion is no more possible

Machines affected

Hera, Jet and Orion

Steps To Reproduce

a) On orion, I can no longer load the wflow_orion module.

> ls /work/noaa/epic-ps/role-epic-ps/miniconda3/modulefiles
ls: cannot access /work/noaa/epic-ps/role-epic-ps/miniconda3/modulefiles: Permission denied

b) On Hera, if the user already activated conda on the command line, running test cases fail

(regional_workflow) Daniel.Abdi@hfe01 WE2E $ ./setup_WE2E_tests.sh hera zrtrr intel custom                                                                                                                  
Modules based on Lua: Version 8.5.2  2021-05-12 12:44 -05:00
    by Robert McLay [email protected]


Currently Loaded Modules:
  1) rocoto/1.3.3   2) miniconda3/4.12.0   3) wflow_hera

 

ERROR:root:
Error: Missing python package required by the SRW app

ERROR:root:
*************************************************************************
FATAL ERROR:
The system does not meet minimum requirements for running the SRW app.
Instructions for setting up python environments can be found on the web:
https://github.com/ufs-community/ufs-srweather-app/wiki/Getting-Started
*************************************************************************

Traceback (most recent call last):
  File "/scratch2/BMC/gsd-hpcs/Daniel.Abdi/ufs-srweather-app/ush/check_python_version.py", line 43, in <module>
    check_python_version()
  File "/scratch2/BMC/gsd-hpcs/Daniel.Abdi/ufs-srweather-app/ush/check_python_version.py", line 15, in check_python_version
    import jinja2
ModuleNotFoundError: No module named 'jinja2'

Previously it used to work whether you had activated conda or not.
This looks similar to the issue solved by --export=NONE but this time we will have to prevent exporting of environment to the testing script.

@MichaelLueken
Copy link
Collaborator

@danielabdi-noaa I'm able to replicate the same behavior you are encountering on Hera. The only way to submit the .cicd/scripts/srw_test.sh run script, while already having the regional_workflow conda environment loaded, is to use conda deactivate. Have you encountered this behavior on other machines following the merging of PR #444 into develop?

Looking at the old TCL modulefile for 4.5.12, I see:

module unload miniconda3

before anything else is done. I don't see this in the new Lua modulefile for 4.12.0. Could this be the reason that this worked before, but not after moving to the EPIC maintained stack?

While I'm unable to replicate what you are encountering on Orion, I think I have identified the issue. Looking at /work/noaa/epic-ps/role-epic-ps/ on Orion, I see:

drwxr-s--- 2 role-epic-ps epic-ps 4096 Nov  3 09:47 containers
drwxr-s--- 4 role-epic-ps epic-ps 4096 Oct  6 14:29 hpc-stack
drwxr-s--- 6 role-epic-ps epic-ps 4096 Aug 10 11:25 jenkins
drwxr-s--- 6 role-epic-ps epic-ps 4096 Oct  6 22:19 miniconda3
drwxr-s--- 4 role-epic-ps epic-ps 4096 Oct 18 09:56 sandbox

No permissions are set for users who aren't using the EPIC role account or have access to the EPIC account. This is also an issue at the epic-ps level:

drwxr-s--- 7 role-epic-ps epic-ps 4096 Oct 6 22:47 role-epic-ps

Tagging @natalie-perlin to see if she can change the permission of role-epic-ps and miniconda3 on Orion from:

drwxr-s--- 6 role-epic-ps epic-ps 4096 Oct 6 22:19 miniconda3

to

drwxr-sr-x 6 role-epic-ps epic-ps 4096 Oct 6 22:19 miniconda3

and to see if she knows why the use of setup_WE2E_tests.sh doesn't work while the regional_workflow conda environment is loaded on Hera.

@danielabdi-noaa
Copy link
Collaborator Author

@MichaelLueken I just tried Jet now and the same thing happens there too, so I have updated the affected machine's list. Note that this behaviour did not happen after transition to Lua modulefiles, so I don't think it is related to TCL vs Lua modulefiles. If I go back 1 commit before #444 everything seems to work. The problem is exactly same as before, it is using the wrong python3.
I added this

which python3
echo $PATH

in run_WE2E script just before the python check and here is what I got

/mnt/lfs4/HFIP/hfv3gfs/role.epic/miniconda3/4.12.0/bin/python3

And the PATH has the envs/regional_workflow/bin coming later which is the problem.

/mnt/lfs4/HFIP/hfv3gfs/role.epic/miniconda3/4.12.0/condabin:/mnt/lfs4/HFIP/hfv3gfs/role.epic/miniconda3/4.12.0/bin:/apps/rocoto/1.3.3/bin:/apps/local/bin:/mnt/lfs4/HFIP/hfv3gfs/role.epic/miniconda3/4.12.0/envs/regional_workflow/bin:/lfs4/HFIP/hfv3gfs/nwprod/hpc-stack/libs/intel-2022.1.2/nccmp/1.8.9.0/bin:/lfs4/HFIP/hfv3gfs/nwprod/hpc-stack/libs/intel-2022.1.2/prod_util/1.2.2/bin: ....

I think the conda activate-deactivate trick may solve it but I wish there was a better solution.

@MichaelLueken
Copy link
Collaborator

@danielabdi-noaa Sorry about that. When I was speaking about Lua vs TCL, I meant the miniconda3 modulefile, not the wflow_* modulefiles. In /contrib/miniconda3/modulefiles/miniconda3/4.5.12, there is:

set ver  "4.5.12"
set base "/contrib/miniconda3/$ver"

set shell [module-info shelltype]

module unload miniconda3

before miniconda3 is even loaded, there is an unload. I'm wondering if this is the reason why there weren't issues on Hera and Jet previously. Before miniconda3 was loaded, it ensured that a previously loaded miniconda3 is being unloaded first.

If this is done in /scratch1/NCEPDEV/nems/role.epic/miniconda3/modulefiles/miniconda3/4.12.0.lua, i.e. adding:

unload("miniconda3")

between:

conflict("intelpython")

local prefix = pathJoin("/scratch1/NCEPDEV/nems/role.epic",pkgName)

would the expected behavior return?

@danielabdi-noaa
Copy link
Collaborator Author

@MichaelLueken I see. I haven't looked at the miniconda3 modulefile itself but I do agree a modification there could potentially solve the problem. I've quickly tried to do unload("miniconda3") before loading it in wflow_hera but doesn't seem to help.

@danielabdi-noaa
Copy link
Collaborator Author

@MichaelLueken This seems to work for me modifying setup_WE2E_tests.sh

-#!/bin/bash -l
+#!/usr/bin/env bash
+[ -n "$HOME" ] && exec -c "$0" "$@"

@danielabdi-noaa danielabdi-noaa mentioned this issue Nov 9, 2022
37 tasks
@natalie-perlin
Copy link
Collaborator

@danielabdi-noaa @MichaelLueken -
It looks like the issue could be avoided if the same python or miniconda is used for building the SRW and running the SRW, such as suggested in PR-431.
It was later replaced by the approach in PR-444 where earlier version of python/miniconda/miniconda3 modules were loaded in ./modulefiles/build_<machine>_<compiler> module, and newer miniconda3/4.12.0 was only used in ./modulefiles/wflow_<machine> and modules for individual tasks, ./modulefiles/tasks//.

The main issue with this approach is that the build_<machine>_<compiler> modules are always loaded when running the model or task (see lines 110-111 in ./ush/load_modules_run_task.sh) , and therefore there always a need to unload an earlier version of either python or miniconda3 is unloaded before running the task that requires activating a regional_workflow environment build for miniconda3/4.12.0.

@natalie-perlin
Copy link
Collaborator

@MichaelLueken - permissions adjusted on Orion using setfacl, setting both mask and r-x permissions for 'others', for directories
/work/noaa/epic-ps/role-epic-ps/miniconda3 and
/work/noaa/epic-ps/role-epic-ps/hpc-stack.
They show the following (running getfacl . )

# file: .
# owner: role-epic-ps
# group: epic-ps
# flags: -s-
user::rwx
group::r-x
mask::r-x
other::r-x

@danielabdi-noaa
Copy link
Collaborator Author

@natalie-perlin The role-epic-ps directory itself still do not seem to have permissions for others to read.

drwxr-s---   7 role-epic-ps epic-ps 4.0K Oct  6 22:47 role-epic-ps

@natalie-perlin
Copy link
Collaborator

@danielabdi-noaa -
Fixed.
drwxr-sr-x+ 7 role-epic-ps epic-ps 4096 Oct 6 22:47 role-epic-ps

@danielabdi-noaa
Copy link
Collaborator Author

danielabdi-noaa commented Nov 9, 2022

@natalie-perlin Thanks! It works for me now. There is some odd message printed when i tried to activate conda, but is most likely related to Orion being on maintainance so will probably go away afterwards

@natalie-perlin
Copy link
Collaborator

@danielabdi-noaa - what was the message? Was there any relation to the module or virtual environment?..

@danielabdi-noaa
Copy link
Collaborator Author

@natalie-perlin It went away after a while. Orion has been very slow these days so I suspected that was the case.
Here was the message I got but I am sure it is not related to the modulefiles

[09:35 dabdi@Orion-login-1 ufs-srweather-app] > module list
No modules loaded
[09:35 dabdi@Orion-login-1 ufs-srweather-app] > module use modulefiles
[09:35 dabdi@Orion-login-1 ufs-srweather-app] > module load build_orion_intel
[09:35 dabdi@Orion-login-1 ufs-srweather-app] > module load wflow_orion
Please do the following to activate conda:
       > conda activate regional_workflow
[09:36 dabdi@Orion-login-1 ufs-srweather-app] > conda activate regional_workflow
df: ‘/work/noaa/gsd-hpcs/Daniel.Abdi/ufs-srweather-app’: Cannot send after transport endpoint shutdown
-bash: [: -gt: unary operator expected
-bash: [: -gt: unary operator expected
(regional_workflow) [09:43 dabdi@Orion-login-1 ufs-srweather-app] > conda deactivate
^C^Z^Z^Z^Z



Fatal Python error: init_import_site: Failed to import the site module
Python runtime state: initialized
Traceback (most recent call last):
  File "/work/noaa/epic-ps/role-epic-ps/miniconda3/4.12.0/lib/python3.9/site.py", line 589, in <module>
    main()
  File "/work/noaa/epic-ps/role-epic-ps/miniconda3/4.12.0/lib/python3.9/site.py", line 576, in main
    known_paths = addsitepackages(known_paths)
  File "/work/noaa/epic-ps/role-epic-ps/miniconda3/4.12.0/lib/python3.9/site.py", line 359, in addsitepackages
    addsitedir(sitedir, known_paths)
  File "/work/noaa/epic-ps/role-epic-ps/miniconda3/4.12.0/lib/python3.9/site.py", line 208, in addsitedir
    addpackage(sitedir, name, known_paths)
  File "/work/noaa/epic-ps/role-epic-ps/miniconda3/4.12.0/lib/python3.9/site.py", line 169, in addpackage
    exec(line)
  File "<string>", line 1, in <module>
  File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
  File "<frozen importlib._bootstrap>", line 982, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 925, in _find_spec
  File "<frozen importlib._bootstrap_external>", line 1423, in find_spec
  File "<frozen importlib._bootstrap_external>", line 1395, in _get_spec
  File "<frozen importlib._bootstrap_external>", line 1526, in find_spec
  File "<frozen importlib._bootstrap_external>", line 1569, in _fill_cache
KeyboardInterrupt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants