Bundle jobs #227

forsyth2 · 2022-03-23T17:10:01Z

Bundle jobs. Resolves #189.

Code review
Address comments
Test on NERSC (after Create example configuration files for other machines #233 is merged)
Merge
Update expected files for integration tests

forsyth2

@golaz This is ready for review. I've included comments explaining the code. My suggested order of review is:

docs/ and tests/ (to get context on how zppy is run with bundles)
zppy/__main__.py and zppy/utils.py (for the majority of the code that enables the bundle functionality)
Other files in zppy: the task files & zppy/templates/default.ini

forsyth2 · 2022-04-13T21:44:57Z

docs/source/dev_guide/testing.rst

+       # bundle1 and bundle2 should run. After they finish, run:
+       rm /lcrc/group/e3sm/ac.forsyth2/zppy_test_bundles_output/v2.LR.historical_0201/post/scripts/bundle3.bash
+       # (If this file isn't deleted, zppy will fail because it assumes bundle3 is already running).
+       zppy -c tests/integration/test_bundles.cfg


Currently, bundle3 and ilamb_run have to be run after bundle1 finishes. This is because of a dependency within bundle1. Unfortunately, I haven't come up with a good way to automatically launch later tasks/bundles when the bundle they depend on finishes.

One possibility is:

Launch bundle1, keeping track of its job id.

When we reach bundle3 we'd see that we're missing a dependency. [In the current implementation, we'd simply skip bundle3 (meaning the user has to manually re-run zppy at a later point -- once bundle1 finishes.)]

Comb through every bundle's bash file to find the dependency. We'd find that bundle1 contains our dependency.

Submit bundle3 with a dependency on the job id of bundle1.

That would presumably work for a bundle depending on another -- but having ilamb_run (not in a bundle in this case) depend on bundle1 also causes problems. In this implementation, we process bundles after all the tasks -- so we'd presumably have to once again cycle through all the tasks to launch ilamb_run this time with the job id of bundle1 as a dependency.

In any case, it probably makes sense to consider this in its own issue/pull request.

forsyth2 · 2022-04-13T21:55:05Z

docs/source/dev_guide/testing.rst

+       rm -rf /lcrc/group/e3sm/ac.forsyth2/zppy_test_bundles_output/v2.LR.historical_0201/post
+       zppy -c tests/integration/test_bundles.cfg
+       # bundle1 and bundle2 should run. After they finish, run:
+       rm /lcrc/group/e3sm/ac.forsyth2/zppy_test_bundles_output/v2.LR.historical_0201/post/scripts/bundle3.bash


This is an interesting quirk. The current implementation doesn't create status files for the bundles, so if you re-run zppy immediately, then it will try to re-run tasks that were just done. To prevent users from accidentally overwriting anything, I just made zppy raise an error in such case.

Unfortunately, by the time dependencies are checked, the bundle's bash file already exists. So, if the bundle depends on something that isn't finished, it gets skipped. That means if you re-run zppy once that dependency has finished, that error will still get raised (because the bash file exists). So, the bash file needs to be deleted (or renamed) before re-running.

Now that I'm typing this however, let me look into this a bit more. It seems like zppy wouldn't try to run individual tasks with status files that say "RUNNING", but I believe I was running into that. Perhaps status files for bundles could fix the issue.

@golaz a few options included in the new second commit:

Simply remove the FileExistsError: tests still pass except for the bash file content test. Basically, if you rerun zppy while bundle1 is running, bundle1.bash gets overwritten (so it doesn't have the lines to run scripts that already have status files with "OK"). Presumably work begins twice on scripts that have not yet been run.

Check for existence of a status file (e.g., bundle1.status) rather than a bash file: this seems to be the best solution.

Downside is if bundle1 is still running but has finished the task that bundle3 needs, zppy will error out before having a chance to run bundle3 (it will try to re-run bundle1, which is currently running, before getting to bundle3).

Another downside is that it takes a few seconds for the status files to get created, so it's still possible to launch two identical jobs if you re-run zppy fast enough.

Upside is that there is no need to delete bundle3.bash or any other files anymore -- bundle3.status doesn't get created until after it actually starts running.

Tests still pass with this change, aside from the bash file content test (will need to include the line printing to the bundle's status file).

Lastly, I included, in a commented out section, code to print an error status if a particular script fails: while this would allow us to have a status simply beyond "the bundle has started", it does add of a lot lines to the scripts, for not much added value, I think.

forsyth2 · 2022-04-13T21:56:00Z

tests/integration/test_bundles.py

+    def test_bundles_bash_file_content(self):
+        actual_directory = "/lcrc/group/e3sm/ac.forsyth2/zppy_test_bundles_output/v2.LR.historical_0201/post/scripts"
+        expected_directory = "/lcrc/group/e3sm/public_html/zppy_test_resources/expected_bundles/bundle_files"
+        # Check that bundle files are correct


Note: move comment to immediately after def line.

forsyth2 · 2022-04-13T21:56:40Z

tests/integration/test_complete_run.py

-                os.path.join(diff_dir, "{}_diff.png".format(simple_image_name)),
-                "PNG",
-            )
+from tests.integration.utils import check_mismatched_images


Since similar logic is used for the bundles test, I put the common logic in utils.

forsyth2 · 2022-04-13T21:57:50Z

zppy/__main__.py

    # climo tasks
-    climo(config, scriptDir)
+    existing_bundles = climo(config, scriptDir, existing_bundles)


We start with an empty list of bundles. As we go through each task, we build up a list of existing bundles. That way, we don't end up with two versions of the same bundle.

forsyth2 · 2022-04-13T22:07:17Z

zppy/utils.py

+                # If any script fails, no new ones should try to run.
+                # It's possible the failed script is a dependency for later scripts.


Set the script to fail if any task fails, since there may be dependencies.

forsyth2 · 2022-04-13T22:08:12Z

zppy/utils.py

+        return existing_bundles
+    for b in existing_bundles:
+        if b.bundle_name == bundle_name:
+            # This bundle already exists


This prevents us from creating a bundle that already exists.

forsyth2 · 2022-04-13T22:09:27Z

zppy/utils.py

+        # If one task requires export="ALL", then the bundle script will need it as well
+        bundle.export = export


Since we're only launching one job rather than many, we have to make a decision on the value for export. I decided that if any task required export="ALL", then the whole bundle would.

forsyth2 · 2022-04-13T22:09:48Z

zppy/utils.py

+
+
+# -----------------------------------------------------------------------------
+def submitScript(scriptFile, export, dependFiles=[]):


export is now a required parameter

forsyth2 · 2022-04-13T22:12:05Z

zppy/__main__.py

+        b.display_dependencies()
+        if not b.dry_run:
+            submitScript(
+                b.bundle_file, b.export, dependFiles=b.dependencies_not_in_bundle_file


Regarding my comment in docs/source/dev_guide/testing.rst: when we submit bundle3, it will have a dependency listed in dependencies_not_in_bundle_file which is run as part of bundle1. As the code currently is, however, it has no idea that this dependency is in bundle1.

golaz

@forsyth2,

Very nice work on this PR. I've made a few more modifications to your PR. The changes are available on a separate branch: https://github.com/E3SM-Project/zppy/tree/bundle-jobs. Maybe you can import them into this PR.

My changes add the following:

Bundle bash scripts are now constructed from a Jinja2 template. This will facilitate future changes.
Ability to explicitly declare bundle jobs in the cfg file via new section and sub-sections.

[bundle]

  [[ bundle2 ]]
  nodes=2

Some code restructuring.

Notes:

black doesn't work. You branched before the changes to fix black were merged in. I used --no-verify
The integration tests run fine, but I have not updated the expected output, so one reports a failure.
For the complete integration tests, the status files for mpas_analysis and tc have the weird 'NING ' status.

forsyth2 · 2022-05-15T22:31:53Z

Closing in favor of #243.

forsyth2 added the semver: new feature New feature (will increment minor version) label Mar 23, 2022

forsyth2 self-assigned this Mar 23, 2022

forsyth2 force-pushed the bundle-jobs branch from 58a9735 to dd57669 Compare March 24, 2022 17:57

forsyth2 force-pushed the bundle-jobs branch 3 times, most recently from 1d1c40f to f34516b Compare April 13, 2022 21:08

Bundle jobs

8ca7892

forsyth2 force-pushed the bundle-jobs branch from f34516b to 8ca7892 Compare April 13, 2022 21:33

forsyth2 commented Apr 13, 2022

View reviewed changes

forsyth2 marked this pull request as ready for review April 13, 2022 22:22

forsyth2 requested a review from golaz April 13, 2022 22:22

Update to bundle status

5964d31

golaz requested changes May 12, 2022

View reviewed changes

This was referenced May 12, 2022

"OK" overwriting "RUNNING" status #241

Closed

Corrupted status files on chrysalis (NING bug) #242

Closed

Bundle jobs revision #243

Merged

forsyth2 closed this May 15, 2022

forsyth2 deleted the bundle-jobs branch May 17, 2022 22:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bundle jobs #227

Bundle jobs #227

forsyth2 commented Mar 23, 2022 •

edited

Loading

forsyth2 left a comment

forsyth2 Apr 13, 2022

forsyth2 Apr 13, 2022

forsyth2 Apr 19, 2022

forsyth2 Apr 13, 2022

forsyth2 Apr 13, 2022

forsyth2 Apr 13, 2022

forsyth2 Apr 13, 2022

forsyth2 Apr 13, 2022

forsyth2 Apr 13, 2022

forsyth2 Apr 13, 2022

forsyth2 Apr 13, 2022

golaz left a comment

forsyth2 commented May 15, 2022

		# If any script fails, no new ones should try to run.
		# It's possible the failed script is a dependency for later scripts.

		# If one task requires export="ALL", then the bundle script will need it as well
		bundle.export = export



		# -----------------------------------------------------------------------------
		def submitScript(scriptFile, export, dependFiles=[]):

Bundle jobs #227

Bundle jobs #227

Conversation

forsyth2 commented Mar 23, 2022 • edited Loading

forsyth2 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

golaz left a comment

Choose a reason for hiding this comment

forsyth2 commented May 15, 2022

forsyth2 commented Mar 23, 2022 •

edited

Loading