Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update WE2E documentation #241

Merged
merged 5 commits into from
May 3, 2022
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
349 changes: 259 additions & 90 deletions docs/UsersGuide/source/WE2Etests.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,120 +3,289 @@
================================
Workflow End-to-End (WE2E) Tests
================================
The SRW Application's experiment generation system contains a set of end-to-end tests that
The SRW App's experiment generation system contains a set of end-to-end tests that
exercise various configurations of that system as well as those of the pre-processing,
UFS Weather Model, and UPP post-processing codes. The script to run these tests is named
``run_experiments.sh`` and is located in the directory ``ufs-srweather-app/regional_workflow/tests``.
A complete list of the available tests can be found in ``baselines_list.txt`` in that directory.
This list is extensive; it is not recommended to run all of the tests as some are computationally
expensive. A subset of the tests supported for this release of the SRW Application can be found
in the file ``testlist.release_public_v1.txt``.

The base experiment configuration file for each test is located in the ``baseline_configs``
subdirectory. Each file is named ``config.${expt_name}.sh``, where ``${expt_name}`` is the
name of the corresponding test configuration. These base configuration files are subsets of
UFS Weather Model, UPP post-processing, and MET/METPlus verification scripts and codes.
These are referred to as workflow end-to-end (WE2E) tests. The script to run these tests is named
``run_WE2E_tests.sh`` and is located in the directory ``ufs-srweather-app/regional_workflow/tests/WE2E``.

Each WE2E test has associated with it a configuration file named ``config.${test_name}.sh``,
where ``${test_name}`` is the name of the corresponding test.
These configuration files are subsets of
the full ``config.sh`` experiment configuration file used in :numref:`Section %s <SetUpConfigFileC>`
and described in :numref:`Section %s <UserSpecificConfig>`. For each test that the user wants
to run, the ``run_experiments.sh`` script reads in its base configuration file and generates from
it a full ``config.sh`` file (a copy of which is placed in the experiment directory for the test).
to run, the ``run_WE2E_tests.sh`` script reads in its configuration file and generates from
it a complete ``config.sh`` file. It then calls ``generate_FV3LAM_wflow.sh``, which in turn
reads in ``config.sh`` and generates a new experiment for the test.
The name of each experiment directory is set to that of the corresponding test,
and a copy of ``config.sh`` for each test is placed in its experiment directory.

The WE2E tests are currently grouped into the following categories:

* ``grids_extrn_mdls_suites_community``

This category of tests ensures that the SRW App workflow running in **community mode**
completes successfully for various combinations of predefined grids, physics
suites, and external models for ICs and LBCs.

* ``grids_extrn_mdls_suites_nco``

This category of tests ensures that the workflow running in **NCO mode** (i.e. using
an operational environment and directory structure)
completes successfully for various combinations of predefined grids, physics
suites, and external models for ICs and LBCs.

* ``wflow_features``

This category of tests ensures that the workflow with various features/capabilities activated
completes successfully. Thus, their focus is the various scripts in the ``ufs-srweather-app``
and ``regional_workflow`` repositories. To reduce computational cost, these tests generally
use coarser grids.

The test configuration files for these categories are located in the following directories,
respectively:

Since ``run_experiments.sh`` calls ``generate_FV3LAM_wflow.sh`` for each test to be run, the
Python modules required for experiment generation must be loaded before ``run_experiments.sh``
.. code-block::

ufs-srweather-app/regional_workflow/tests/WE2E/test_configs/grids_extrn_mdls_suites_community
ufs-srweather-app/regional_workflow/tests/WE2E/test_configs/grids_extrn_mdls_suites_nco
ufs-srweather-app/regional_workflow/tests/WE2E/test_configs/wflow_features

Since ``run_WE2E_tests.sh`` calls ``generate_FV3LAM_wflow.sh`` for each test, the
Python modules required for experiment generation must be loaded before ``run_WE2E_tests.sh``
can be called. See :numref:`Section %s <SetUpPythonEnv>` for information on loading the Python
environment on supported platforms. Note also that ``run_experiments.sh`` assumes that all of
the executables have been built.

The user specifies the set of test configurations that the ``run_experiments.sh`` script will
run by creating a text file, say ``expts_list.txt``, that contains a list of tests (one per line)
and passing the name of that file to the script. For each test in the file, ``run_experiments.sh``
will generate an experiment directory and, by default, will continuously (re)launch its workflow
by inserting a new cron job in the user's cron table. This cron job calls the workflow launch script
environment on supported platforms. Note also that ``run_WE2E_tests.sh`` assumes that all of
the executables have been built. If they are not, then ``run_WE2E_tests.sh`` will still
generate the experiment directories, but the workflows will fail.

Running the WE2E Tests
----------------------

The user specifies the set of tests that ``run_WE2E_tests.sh`` will run by creating a text
file, say ``my_tests.txt``, that contains a list of the WE2E tests to run (one per line)
and passing the name of that file to ``run_WE2E_tests.sh``. For example, if the user
wants to run the tests ``new_ESGgrid`` and ``grid_RRFS_CONUScompact_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16``
(from the ``wflow_features`` and ``grids_extrn_mdls_suites_community`` categories, respectively), we would have:

.. code-block:: console

> cat my_tests.txt
new_ESGgrid
grid_RRFS_CONUScompact_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16


For each test in ``my_tests.txt``, ``run_WE2E_tests.sh``
will generate a new experiment directory and, by default, create a new cron job in the user's cron
table that will (re)launch the workflow every two minutes. This cron job calls the workflow launch script
``launch_FV3LAM_wflow.sh`` located in the experiment directory until the workflow either
completes successfully (i.e. all tasks are successful) or fails (i.e. at least one task fails).
The cron job is then removed from the user's cron table.

Next, we show several common ways that ``run_WE2E_tests.sh`` may be called with
the ``my_tests.txt`` file above.

The script ``run_experiments.sh`` accepts the command line arguments shown in
:numref:`Table %s <WE2ECommandLineArgs>`.

.. _WE2ECommandLineArgs:

.. list-table:: Command line arguments for the WE2E testing script ``run_experiments.sh``.
:widths: 20 40 40
:header-rows: 1

* - Command Line Argument
- Description
- Optional
* - expts_file
- Name of the file containing the list of tests to run. If ``expts_file`` is the absolute path
to a file, it is used as is. If it is a relative path (including just a file name), it is assumed
to be given relative to the path from which this script is called.
- No
* - machine
- Machine name
- No
* - account
- HPC account to use
- No
* - use_cron_to_relaunch
- Flag that specifies whether or not to use a cron job to continuously relaunch the workflow
- Yes. Default value is ``TRUE`` (set in ``run_experiments.sh``).
* - cron_relaunch_intvl_mnts
- Frequency (in minutes) with which cron will relaunch the workflow
- Used only if ``use_cron_to_relaunch`` is set to ``TRUE``. Default value is "02" (set in ``run_experiments.sh``).

For example, to run the tests named ``grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2``
and ``grid_RRFS_CONUS_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1alpha`` on Cheyenne, first create the file
``expts_list.txt`` containing the following lines:
1) To run the tests listed in ``my_tests.txt`` on Hera and charge the computational
resources used to the "rtrr" account, use:

.. code-block:: console
.. code-block::

> run_WE2E_tests.sh tests_file="my_tests.txt" machine="hera" account="rtrr"

This will create the experiment subdirectories for the two tests in
the directory

.. code-block::

${SR_WX_APP_TOP_DIR}/../expt_dirs

where ``SR_WX_APP_TOP_DIR`` is the directory in which the ufs-srweather-app
repository is cloned (usually set to something like ``/path/to/ufs-srweather-app``).
Thus, the following two experiment directories will be created:

.. code-block::

${SR_WX_APP_TOP_DIR}/../expt_dirs/new_ESGgrid
${SR_WX_APP_TOP_DIR}/../expt_dirs/grid_RRFS_CONUScompact_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16

In addition, by default, cron jobs will be added to the user's cron
table to relaunch the workflows of these experiments every 2 minutes.

2) To change the frequency with which the cron relaunch jobs are submitted
from the default of 2 minutes to 1 minute, use:

.. code-block::

> run_WE2E_tests.sh tests_file="my_tests.txt" machine="hera" account="rtrr" cron_relaunch_intvl_mnts="01"

3) To disable use of cron (which implies the worfkow for each test will
have to be relaunched manually from within each experiment directory),
use:

.. code-block::

> run_WE2E_tests.sh tests_file="my_tests.txt" machine="hera" account="rtrr" use_cron_to_relaunch="FALSE"

In this case, the user will have to go into each test's experiment directory and
either manually call the ``launch_FV3LAM_wflow.sh`` script or use the Rocoto commands described
in :numref:`Chapter %s <RocotoInfo>` to (re)launch the workflow. Note that if using the Rocoto
commands directly, the log file ``log.launch_FV3LAM_wflow`` will not be created; in this case,
the status of the workflow can be checked using the ``rocotostat`` command (see :numref:`Chapter %s <RocotoInfo>`).

4) To place the experiment subdirectories in a subdirectory named ``test_set_01`` under
``${SR_WX_APP_TOP_DIR}/../expt_dirs`` (instead of immediately under the latter), use:

.. code-block::

> run_WE2E_tests.sh tests_file="my_tests.txt" machine="hera" account="rtrr" expt_basedir="test_set_01"

In this case, the full paths to the experiment directories will be:

.. code-block::

${SR_WX_APP_TOP_DIR}/../expt_dirs/test_set_01/new_ESGgrid
${SR_WX_APP_TOP_DIR}/../expt_dirs/test_set_01/grid_RRFS_CONUScompact_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16

This is useful for grouping various sets of tests.

5) To use a test list file (again named ``my_tests.txt``) located in ``/path/to/custom/location``
instead of in the same directory as ``run_WE2E_tests.sh``, and to have the experiment directories
be placed in an arbitrary location, say ``/path/to/custom/expt_dirs``, use:

.. code-block::

> run_WE2E_tests.sh tests_file="/path/to/custom/location/my_tests.txt" machine="hera" account="rtrr" expt_basedir="/path/to/custom/expt_dirs"


The full usage statement for ``run_WE2E_tests.sh`` is as follows:

.. code-block::

run_WE2E_tests.sh \
tests_file="..." \
machine="..." \
account="..." \
[expt_basedir="..."] \
[exec_subdir="..."] \
[use_cron_to_relaunch="..."] \
[cron_relaunch_intvl_mnts="..."] \
[verbose="..."] \
[machine_file="..."] \
[stmp="..."] \
[ptmp="..."] \
[compiler="..."] \
[build_env_fn="..."]

The arguments in brackets are optional. A complete description of these arguments can be
obtained by issuing

.. code-block::

run_WE2E_tests.sh --help

in the directory ``ufs-srweather-app/regional_workflow/tests/WE2E``.




Checking Test Status
--------------------
If cron jobs are being used to periodically relaunch the tests, the status of
each test can be checked by viewing the end of the log file ``log.launch_FV3LAM_wflow``
(since the cron jobs use ``launch_FV3LAM_wflow.sh`` for this purpose, which
generates that log file). Otherwise (or alternatively), the ``rocotorun``/``rocotostat``
combination of commands can be used. See :numref:`Section %s <RocotoRun>` for
details.

The App also provides the script ``get_expts_status.sh`` in the directory
``ufs-srweather-app/regional_workflow/tests/WE2E`` that can be used to generate
a status summary for all tests in a given base directory. This script updates
the workflow status of each test (by internally calling ``launch_FV3LAM_wflow.sh``)
and then prints out to screen the status of the various tests. It also creates
a status report file named ``expts_status_${create_date}.txt`` (where ``create_date``
is a time stamp of the form ``YYYYMMDDHHmm`` corresponding to the creation date/time
of the report) and places it in the experiment base directory. This status file
contains the last 40 lines (by default; this can be adjusted) from the end of each
``log.launch_FV3LAM_wflow`` log file. These lines include the experiment status
as well as the task status table generated by ``rocotostat`` (so that, in
case of failure, it is convenient to pinpoint the task that failed).
For details on the usage of ``get_expts_stats.sh``, issue

.. code-block::

> get_expts_status.sh --help

For example:

.. code-block:: console

> ./get_expts_status.sh expts_basedir=/path/to/expt_dirs/set01
Checking for active experiment directories in the specified experiments
base directory (expts_basedir):
expts_basedir = "/path/to/expt_dirs/set01"
...

grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2
grid_RRFS_CONUS_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1alpha
The number of active experiments found is:
num_expts = 2
The list of experiments whose workflow status will be checked is:
'new_ESGgrid'
'grid_RRFS_CONUScompact_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16'

Then, from the ``ufs-srweather-app/regional_workflow/tests`` directory, issue the following command:
======================================
Checking workflow status of experiment "new_ESGgrid" ...
Workflow status: SUCCESS
======================================

.. code-block:: console
======================================
Checking workflow status of experiment "grid_RRFS_CONUScompact_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16" ...
Workflow status: IN PROGRESS
======================================

./run_experiments.sh expts_file="expts_list.txt" machine=cheyenne account="account_name"
A status report has been created in:
expts_status_fp = "/path/to/expt_dirs/set01/expts_status_202204211440.txt"

where ``account_name`` should be replaced by the account to which to charge the core-hours used
by the tests. Running this command will automatically insert an entry into the user's crontab
that regularly (re)launches the workflow. The experiment directories will be created under
``ufs-srweather-app/../expt_dirs``, and the name of each experiment directory will be identical
to the name of the corresponding test.
DONE.

To see if a test completed successfully, look at the end of the ``log.launch_FV3LAM_wflow`` file (which
is the log file that ``launch_FV3LAM_wflow.sh`` appends to every time it is called) located in the
experiment directory for that test:

.. code-block:: console
The "Workflow status" field of each test indicates the status of its workflow.
The values that this can take on are "SUCCESS", "FAILURE", and "IN PROGRESS".

Summary of workflow status:
~~~~~~~~~~~~~~~~~~~~~~~~~~

1 out of 1 cycles completed.
Workflow status: SUCCESS

========================================================================
End of output from script "launch_FV3LAM_wflow.sh".
========================================================================

Adding New WE2E Tests
---------------------
To add a new test named, for example ``new_test01``, to one of the existing categories listed
above, say ``wflow_features``:

Use of cron for all tests to be run by ``run_experiments.sh`` can be turned off by instead issuing
the following command:
1) Choose an existing test configuration file in any one of the category directories that
matches most closely the new test to be added. Copy that file to ``config.new_test01.sh``
and, if necessary, move it to the ``wflow_features`` category directory.

.. code-block:: console
2) Edit ``config.new_test01.sh`` so that the header containing the test description properly
describes the new test.

3) Further edit ``config.new_test01.sh`` by modifying existing experiment variable values
and/or adding new variables such that the test runs with the intended configuration.

To create a new test category called, e.g. ``new_category``:

1) In the directory ``ufs-srweather-app/regional_workflow/tests/WE2E/test_configs``.
create a new directory named ``new_category``.

2) In the file ``get_WE2Etest_names_subdirs_descs.sh``, add the element ``"new_category"``
to the array ``category_subdirs`` that contains the list of categories/subdirectories
in which to search for test configuration files. Thus, ``category_subdirs`` becomes:

.. code-block:: console

category_subdirs=( \
"." \
"grids_extrn_mdls_suites_community" \
"grids_extrn_mdls_suites_nco" \
"wflow_features" \
"new_category" \
)

New tests can now be added to ``new_category`` using the procedure described above.

./run_experiments.sh expts_file="expts_list.txt" machine=cheyenne account="account_name" use_cron_to_relaunch=FALSE

In this case, the experiment directories for the tests will be created, but their workflows will
not be (re)launched. For each test, the user will have to go into the experiment directory and
either manually call the ``launch_FV3LAM_wflow.sh`` script or use the Rocoto commands described
in :numref:`Chapter %s <RocotoInfo>` to (re)launch the workflow. Note that if using the Rocoto
commands directly, the log file ``log.launch_FV3LAM_wflow`` will not be created; in this case,
the status of the workflow can be checked using the ``rocotostat`` command (see :numref:`Chapter %s <RocotoInfo>`).