Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove ENV_INIT_SCRIPT and avoid dependency on system python #404

Merged

Conversation

danielabdi-noaa
Copy link
Collaborator

@danielabdi-noaa danielabdi-noaa commented Oct 7, 2022

DESCRIPTION OF CHANGES:

This PR addresses issues #401 and #402. The latter is addressed through the alternative solution described there.

Detailed changes:

  • removes ENV_INIT_SCRIPT_FP and apply the sourcing of /etc/profile directly in etc/lmod-setup.sh. There will only be one way to setup Lmod going forward.
  • Adds a central place to load worflow modules ush/load_modules_wflow.sh that is sourced by CIs run scripts
  • Activate conda on all systems besides Cheyenne so that we can use the python version in conda for crontab deletion.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

TESTS CONDUCTED:

Run on Hera and Jet a test case successfully, but the most critical test is Gaea that does additional Lmod sourcing through scripts passed by ENV_INIT_SCRIPT_FP

  • hera.intel
  • orion.intel
  • cheyenne.intel
  • cheyenne.gnu
  • gaea.intel
  • jet.intel
  • wcoss2.intel
  • NOAA Cloud (indicate which platform)
  • Jenkins
  • fundamental test suite
  • comprehensive tests (specify which if a subset was used)

DEPENDENCIES:

DOCUMENTATION:

None

ISSUE:

CHECKLIST

  • My code follows the style guidelines in the Contributor's Guide
  • I have performed a self-review of my own code using the Code Reviewer's Guide
  • I have commented my code, particularly in hard-to-understand areas
  • My changes need updates to the documentation. I have made corresponding changes to the documentation
  • My changes do not require updates to the documentation (explain).
  • My changes generate no new warnings
  • New and existing tests pass with my changes
  • Any dependent changes have been merged and published

LABELS (optional):

A Code Manager needs to add the following labels to this PR:

  • Work In Progress
  • bug
  • enhancement
  • documentation
  • release
  • high priority
  • run_ci
  • run_we2e_fundamental_tests
  • run_we2e_comprehensive_tests
  • Needs Cheyenne test
  • Needs Jet test
  • Needs Hera test
  • Needs Orion test
  • help wanted

CONTRIBUTORS (optional):

@danielabdi-noaa danielabdi-noaa added ci-hera-intel-WE Kicks off automated workflow test on hera with intel ci-jet-intel-WE Kicks off automated workflow test on jet with intel labels Oct 7, 2022
@venitahagerty venitahagerty removed ci-jet-intel-WE Kicks off automated workflow test on jet with intel ci-hera-intel-WE Kicks off automated workflow test on hera with intel labels Oct 7, 2022
@venitahagerty
Copy link
Collaborator

venitahagerty commented Oct 7, 2022

Machine: hera
Compiler: intel
Job: WE
Repo location: /scratch1/BMC/zrtrr/rrfs_ci/autoci/pr/1079855920/20221007042014/ufs-srweather-app
Build was Successful
Rocoto jobs started
Long term tracking will be done on 9 experiments
If test failed, please make changes and add the following label back:
ci-hera-intel-WE
Experiment Succeeded on hera: grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2
Experiment Succeeded on hera: grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_HRRR
Experiment Succeeded on hera: grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta
Experiment Succeeded on hera: grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_v15p2
Experiment Succeeded on hera: nco_grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_HRRR
Experiment Succeeded on hera: grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16
Experiment Succeeded on hera: grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_RAP_suite_HRRR
Experiment Succeeded on hera: grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_HRRR_suite_RRFS_v1beta
Experiment Succeeded on hera: grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_HRRR_suite_HRRR
All experiments completed

@venitahagerty
Copy link
Collaborator

venitahagerty commented Oct 7, 2022

Machine: jet
Compiler: intel
Job: WE
Repo location: /lfs1/BMC/nrtrr/rrfs_ci/autoci/pr/1079855920/20221007042011/ufs-srweather-app
Build was Successful
Rocoto jobs started
Long term tracking will be done on 9 experiments
If test failed, please make changes and add the following label back:
ci-jet-intel-WE
Experiment Succeeded on jet: nco_grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_HRRR
Experiment Succeeded on jet: grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta
Experiment Succeeded on jet: grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2
Experiment Succeeded on jet: grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_v15p2
Experiment Succeeded on jet: grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_HRRR
Experiment Succeeded on jet: grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_RAP_suite_HRRR
Experiment Succeeded on jet: grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16
Experiment Succeeded on jet: grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_HRRR_suite_RRFS_v1beta
Experiment Succeeded on jet: grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_HRRR_suite_HRRR

Copy link
Collaborator

@MichaelLueken MichaelLueken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danielabdi-noaa Should the TOPO_DIR and SFC_CLIMO_INPUT_DIR entries be moved to task_make_orog and task_make_sfc_climo, in line with the rest of the machine yaml files?

ush/machine/orion.yaml Show resolved Hide resolved
Copy link
Collaborator

@MichaelLueken MichaelLueken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for addressing my concerns. Approving this PR and launching Jenkins CI.

@MichaelLueken MichaelLueken added the run_we2e_coverage_tests Run the coverage set of SRW end-to-end tests label Oct 10, 2022
@MichaelLueken
Copy link
Collaborator

@danielabdi-noaa Unfortunately, this PR is still failing to build on Gaea through the Jenkins CI pipeline. Any ideas as to why this might be happening?

@danielabdi-noaa
Copy link
Collaborator Author

@MichaelLueken I am testing build manually on gaea and it doesn't seem to fail for me. I thought it maybe an issue with set -u being set but doesn't seem to help for the CI. I will let you know when i found it more.

@MichaelLueken
Copy link
Collaborator

@danielabdi-noaa It seems as though your modifications at 15986db have corrected the issue with the build failing on Gaea. Since PR #407 has been merged, if the feature/remove_env_script is updated to the HEAD, do you think that the WE2E tests will pass on Gaea?

@danielabdi-noaa
Copy link
Collaborator Author

@MichaelLueken The issue on gaea in this PR seems to be with modulefiles. Depending on how conda activated yaml may not be present the python3 version. If you do, which python3 on gaea, you will see this:

python3 is /lustre/f2/dev/role.epic/contrib/miniconda3/4.12.0/envs/regional_workflow/bin/python3
python3 is /lustre/f2/dev/role.epic/contrib/miniconda3/4.12.0/bin/python3
python3 is /usr/bin/python3

The firs one is the one we need and has yaml. The second one has same python3 version but has not yaml package.
The last one is system python is an older version. If PATH contains the path to the first python first, it will work fine. However, if you have activated conda outside (command line you are using), and then load_modules_run_task.sh activates conda then somehow (need to figure out why and fix it) the second python version is listed first on PATH and it won't find yaml

Python 3.9.12 (main, Apr  5 2022, 06:56:58) 
[GCC 7.5.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import yaml
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'yaml'
>>> 

I will try and figure out a way around this later.

@MichaelLueken MichaelLueken added bug Something isn't working and removed bug Something isn't working labels Oct 12, 2022
@danielabdi-noaa
Copy link
Collaborator Author

@MichaelLueken I thought the same but I don't it will work since there is an additional issue with modulefiles on gaea that this PR uncovers. The miniconda3 modulefile is written in Lua and is different from other systems that have miniconda3 modulefile written in TCL. It maybe be better to move to Lua format but i know in the past I've encountered issues with mixing TCL/Lua modulefiles. That is actually the reason why we can't put "conda activate" inside a modulefile. The Lua miniconda modulefile is putting both "bin" paths that contain two pythons and we may have to modify the Lua modulefile if I can't find a workaround for it.

@MichaelLueken
Copy link
Collaborator

@danielabdi-noaa I'm seeing some interesting behavior between the various RDHPCS machines with respect to what happens when attempting to load a task modulefile while having the modulefile/wflow_machine modulefile loaded first.

On Hera, using module load wflow_hera followed by conda activate regional_workflow will load the correct version of python3. When loading the task modulefile - module use $PWD/modulefiles/tasks/hera followed by module load get_extrn_ics.local, the conda environment is deactivated, allowing conda activate regional_workflow to properly activate the conda environment once again.

Obviously, this isn't the case with Gaea. On this machine, using module load wflow_gaea followed by conda activate regional_workflow will load the correct version of python3. However, when loading the task modulefile - module use $PWD/modulefiles/tasks/gaea followed by module load get_extrn_ics.local, the conda environment is not deactivated. However, it appears as though the proper conda environment should still be loaded with conda activate regional_workflow after the module load.

Would adding conda deactivate before loading the task modulefile in ush/load_modules_run_task.sh potentially correct the issue on Gaea?

@danielabdi-noaa
Copy link
Collaborator Author

danielabdi-noaa commented Oct 12, 2022

@MichaelLueken Yes, you've captured the essense of the problem. I have thought about conda deactivate but it will not work the first time it is tried in load_modules_run_task. What seems to work for me is to activate conda first, deactivate it, and then activate it again.

conda activate regional_worfkflow
conda deactivate
conda activate regional_workflow

The correct python3 path will be pre-pended to $PATH and we get the right python3 this way. On Hera and others the first conda activate works fine. This could be a side effect of Lmod/TCL modulefiles mixup.

@danielabdi-noaa danielabdi-noaa force-pushed the feature/remove_env_script branch from 15986db to 7dd9ecd Compare October 12, 2022 19:23
@danielabdi-noaa
Copy link
Collaborator Author

@MichaelLueken I've pushed this change now. Since I can not test it myself, please re-run jenkins when you get the chance?

@panll
Copy link
Collaborator

panll commented Oct 13, 2022

Looks good to me!

@MichaelLueken MichaelLueken merged commit 806105b into ufs-community:develop Oct 13, 2022
@danielabdi-noaa danielabdi-noaa mentioned this pull request Oct 25, 2022
21 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
run_we2e_coverage_tests Run the coverage set of SRW end-to-end tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants