Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rerunning regression test by submitting job_card in the run directory fails #2247

Closed
DusanJovic-NOAA opened this issue Apr 22, 2024 · 12 comments
Labels
bug Something isn't working

Comments

@DusanJovic-NOAA
Copy link
Collaborator

Description

During the development and/or debugging I often need to keep rerunning the same test over and over again. The simplest way of doing this is to run the desired regression test once, save the run directory, move that directory to somewhere where it will not be purged (noscrub directory for example) and then submit the job card from that run directory. This workflow is currently broken after the last PR that cleaned up rt.sh.

To Reproduce:

  1. Run one regression test, for example ./rt.sh -n 'control_p8 intel' -k
  2. Move the run directory to different location, mv ../rt_xxxxxxxx/control_P8_intel /new/location/control_p8_intel
  3. Go to the run directory, cd /new/location/control_p8_intel
  4. Submit the job sbatch job_card

This will first fail with the error: MACHINE_ID: unbound variable. This happens because MACHINE_ID=hera is commented out in the job_card.

https://github.com/ufs-community/ufs-weather-model/blob/develop/tests/fv3_conf/fv3_slurm.IN_hera#L17-L23

After uncommenting that line and resubmitting job card it fails again with:

++ date +%s
+ date_s_start=1713801710
+ echo -n 1713801710,
+ set +x
Lmod has detected the following error: The following module(s) are unknown:
"modules.fv3"

Please check the spelling or version number. Also try "module spider ..."
It is also possible your cache file is out-of-date; it may help to try:
  $ module --ignore_cache load "modules.fv3"

Also make sure that all modulefiles written in TCL start with the string
#%Module

This error happens because in the job_card the module use location is hard-coded to the original run directory:

MACHINE_ID=hera
source ./module-setup.sh
module use "/scratch1/NCEPDEV/stmp2/Dusan.Jovic/FV3_RT/rt_1691993/control_p8_intel/modulefiles"
module load modules.fv3
module list

I am not submitting this job from that original location, I moved the entire run directory to a different location. I see these job_card template changes are made only on Hera, all other job cards are untouched. Why?

Additional context

Output

@DusanJovic-NOAA DusanJovic-NOAA added the bug Something isn't working label Apr 22, 2024
@BrianCurtis-NOAA
Copy link
Collaborator

@DusanJovic-NOAA looks like I made a change from $PWD/modulefiles to @[PWD]/modulefiles which should not have happened. If you change fv3_slurm.IN_hera to use $ instead of @[] for line 20 I think that would solve this issue. I have to fix an issue using -l and -n together too that I will try to bring this fix into.

@BrianCurtis-NOAA
Copy link
Collaborator

@DusanJovic-NOAA
Copy link
Collaborator Author

Ok, that seems to work. Why is MACHINE_ID exportred, previously it wasn't? And it isn't in the other job card templates. Also srun like now looks like:

srun --label -n "160" ./fv3.exe

Why is number of tasks (-n option) quoted string, it should be an integer. In the job card template , this line:

srun --label -n @[TASKS] ./fv3.exe

has been changed to:

srun --label -n "@[TASKS]" ./fv3.exe

@DusanJovic-NOAA
Copy link
Collaborator Author

OMP_NUM_THREADS is now also quoted string, it should be an integer:

export OMP_NUM_THREADS="@[THRD]"

Why did you quote all these at-parse-placeholders? Only in hera templates, in all other templates they are as before.

@BrianCurtis-NOAA
Copy link
Collaborator

Actually, thanks for bringing that up. The quoted strings using the at-parse variables do not need the quotes, you're right. The hera job card is a test for switching all of those fv3/compile IN files to bash. Having one in development helps me see what mistakes this might be. Bash likes quoted variables but with using the at-parse it's not needed as bash doesn't see it as anything.

I'll get those quotes removed in that branch.

@BrianCurtis-NOAA
Copy link
Collaborator

Ok, that seems to work. Why is MACHINE_ID exportred, previously it wasn't? And it isn't in the other job card templates.

Bash likes variables that are not used directly in the script to be exported to make sure they are used externally properly. If the export is meaningless and the variable is not used in the script, then it should be removed.

@DusanJovic-NOAA
Copy link
Collaborator Author

Unrelated to the job card template issues, but now by default build.sh runs the make command with VERBOSE option turned on. And I can not turn it off now. The following change in build.sh, from:

OMP_NUM_THREADS=1 make -j ${BUILD_JOBS:-4} VERBOSE=${BUILD_VERBOSE:-}

to:

OMP_NUM_THREADS=1 make -j "${BUILD_JOBS:-4}" "VERBOSE=${BUILD_VERBOSE:-1}"

broke that. Please remove that 1 at the end, by default BUILD_VERBOSE should be empty.

And again -j make option accepts integers not quoted strings (BUILD_JOBS).

@DusanJovic-NOAA
Copy link
Collaborator Author

Ok, that seems to work. Why is MACHINE_ID exportred, previously it wasn't? And it isn't in the other job card templates.

Bash likes variables that are not used directly in the script to be exported to make sure they are used externally properly. If the export is meaningless and the variable is not used in the script, then it should be removed.

But in this case MACHINE_ID is used in this script, it is used by module-setup.sh, which is sourced. So MACHINE_ID does not have to be exported. Please remove export. After all it isn't exported in any other job card template, and everything works without any issues.

@BrianCurtis-NOAA
Copy link
Collaborator

I've removed the quoted at-parse variables.
I've removed the BUILD_VERBOSE extra 1
make -j option does accept quoted strings
For MACHINE_ID being exported in the job_card: What shellcheck expects is that if a variable is set in a shell script file, that it's also used in that same shell script file. It seems logical to me to export set variables into the environment if the script does not use that variable, and it's what shellcheck wants to see. This is consistent now across all shell scripts that shellcheck lints. It seems good practice. I am open to potential issues that might arise by doing this, if you have any.

@DusanJovic-NOAA
Copy link
Collaborator Author

Then why is MACHINE_ID not exported in other job card templates?

@BrianCurtis-NOAA
Copy link
Collaborator

Then why is MACHINE_ID not exported in other job card templates?

I haven't converted those to shellcheck, yet. Hera was a test to see if it had issues with the move from /bin/sh to /bin/bash and the changes that shellcheck wanted. I tested as much as I could on Hera but was not sure with using the at-parse if it would have any negative impacts for other projects. With Hera getting the most usage externally, it seemed a great place to introduce the change. The intent is to convert them all over if Hera does not see major issues.

@BrianCurtis-NOAA
Copy link
Collaborator

These were fixed in #2241

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants