Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bundle jobs #227

Closed
wants to merge 2 commits into from
Closed

Bundle jobs #227

wants to merge 2 commits into from

Conversation

forsyth2
Copy link
Collaborator

@forsyth2 forsyth2 commented Mar 23, 2022

Bundle jobs. Resolves #189.

@forsyth2 forsyth2 added the semver: new feature New feature (will increment minor version) label Mar 23, 2022
@forsyth2 forsyth2 self-assigned this Mar 23, 2022
@forsyth2 forsyth2 force-pushed the bundle-jobs branch 3 times, most recently from 1d1c40f to f34516b Compare April 13, 2022 21:08
Copy link
Collaborator Author

@forsyth2 forsyth2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@golaz This is ready for review. I've included comments explaining the code. My suggested order of review is:

  1. docs/ and tests/ (to get context on how zppy is run with bundles)
  2. zppy/__main__.py and zppy/utils.py (for the majority of the code that enables the bundle functionality)
  3. Other files in zppy: the task files & zppy/templates/default.ini

# bundle1 and bundle2 should run. After they finish, run:
rm /lcrc/group/e3sm/ac.forsyth2/zppy_test_bundles_output/v2.LR.historical_0201/post/scripts/bundle3.bash
# (If this file isn't deleted, zppy will fail because it assumes bundle3 is already running).
zppy -c tests/integration/test_bundles.cfg
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, bundle3 and ilamb_run have to be run after bundle1 finishes. This is because of a dependency within bundle1. Unfortunately, I haven't come up with a good way to automatically launch later tasks/bundles when the bundle they depend on finishes.

One possibility is:

  1. Launch bundle1, keeping track of its job id.
  2. When we reach bundle3 we'd see that we're missing a dependency. [In the current implementation, we'd simply skip bundle3 (meaning the user has to manually re-run zppy at a later point -- once bundle1 finishes.)]
  3. Comb through every bundle's bash file to find the dependency. We'd find that bundle1 contains our dependency.
  4. Submit bundle3 with a dependency on the job id of bundle1.

That would presumably work for a bundle depending on another -- but having ilamb_run (not in a bundle in this case) depend on bundle1 also causes problems. In this implementation, we process bundles after all the tasks -- so we'd presumably have to once again cycle through all the tasks to launch ilamb_run this time with the job id of bundle1 as a dependency.

In any case, it probably makes sense to consider this in its own issue/pull request.

rm -rf /lcrc/group/e3sm/ac.forsyth2/zppy_test_bundles_output/v2.LR.historical_0201/post
zppy -c tests/integration/test_bundles.cfg
# bundle1 and bundle2 should run. After they finish, run:
rm /lcrc/group/e3sm/ac.forsyth2/zppy_test_bundles_output/v2.LR.historical_0201/post/scripts/bundle3.bash
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an interesting quirk. The current implementation doesn't create status files for the bundles, so if you re-run zppy immediately, then it will try to re-run tasks that were just done. To prevent users from accidentally overwriting anything, I just made zppy raise an error in such case.

Unfortunately, by the time dependencies are checked, the bundle's bash file already exists. So, if the bundle depends on something that isn't finished, it gets skipped. That means if you re-run zppy once that dependency has finished, that error will still get raised (because the bash file exists). So, the bash file needs to be deleted (or renamed) before re-running.

Now that I'm typing this however, let me look into this a bit more. It seems like zppy wouldn't try to run individual tasks with status files that say "RUNNING", but I believe I was running into that. Perhaps status files for bundles could fix the issue.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@golaz a few options included in the new second commit:

Simply remove the FileExistsError: tests still pass except for the bash file content test. Basically, if you rerun zppy while bundle1 is running, bundle1.bash gets overwritten (so it doesn't have the lines to run scripts that already have status files with "OK"). Presumably work begins twice on scripts that have not yet been run.

Check for existence of a status file (e.g., bundle1.status) rather than a bash file: this seems to be the best solution.

  • Downside is if bundle1 is still running but has finished the task that bundle3 needs, zppy will error out before having a chance to run bundle3 (it will try to re-run bundle1, which is currently running, before getting to bundle3).
  • Another downside is that it takes a few seconds for the status files to get created, so it's still possible to launch two identical jobs if you re-run zppy fast enough.
  • Upside is that there is no need to delete bundle3.bash or any other files anymore -- bundle3.status doesn't get created until after it actually starts running.
  • Tests still pass with this change, aside from the bash file content test (will need to include the line printing to the bundle's status file).

Lastly, I included, in a commented out section, code to print an error status if a particular script fails: while this would allow us to have a status simply beyond "the bundle has started", it does add of a lot lines to the scripts, for not much added value, I think.

def test_bundles_bash_file_content(self):
actual_directory = "/lcrc/group/e3sm/ac.forsyth2/zppy_test_bundles_output/v2.LR.historical_0201/post/scripts"
expected_directory = "/lcrc/group/e3sm/public_html/zppy_test_resources/expected_bundles/bundle_files"
# Check that bundle files are correct
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: move comment to immediately after def line.

os.path.join(diff_dir, "{}_diff.png".format(simple_image_name)),
"PNG",
)
from tests.integration.utils import check_mismatched_images
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since similar logic is used for the bundles test, I put the common logic in utils.

# climo tasks
climo(config, scriptDir)
existing_bundles = climo(config, scriptDir, existing_bundles)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We start with an empty list of bundles. As we go through each task, we build up a list of existing bundles. That way, we don't end up with two versions of the same bundle.

Comment on lines +188 to +189
# If any script fails, no new ones should try to run.
# It's possible the failed script is a dependency for later scripts.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Set the script to fail if any task fails, since there may be dependencies.

return existing_bundles
for b in existing_bundles:
if b.bundle_name == bundle_name:
# This bundle already exists
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This prevents us from creating a bundle that already exists.

Comment on lines +266 to +267
# If one task requires export="ALL", then the bundle script will need it as well
bundle.export = export
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're only launching one job rather than many, we have to make a decision on the value for export. I decided that if any task required export="ALL", then the whole bundle would.



# -----------------------------------------------------------------------------
def submitScript(scriptFile, export, dependFiles=[]):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

export is now a required parameter

b.display_dependencies()
if not b.dry_run:
submitScript(
b.bundle_file, b.export, dependFiles=b.dependencies_not_in_bundle_file
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding my comment in docs/source/dev_guide/testing.rst: when we submit bundle3, it will have a dependency listed in dependencies_not_in_bundle_file which is run as part of bundle1. As the code currently is, however, it has no idea that this dependency is in bundle1.

@forsyth2 forsyth2 marked this pull request as ready for review April 13, 2022 22:22
@forsyth2 forsyth2 requested a review from golaz April 13, 2022 22:22
Copy link
Collaborator

@golaz golaz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@forsyth2,

Very nice work on this PR. I've made a few more modifications to your PR. The changes are available on a separate branch: https://github.com/E3SM-Project/zppy/tree/bundle-jobs. Maybe you can import them into this PR.

My changes add the following:

  • Bundle bash scripts are now constructed from a Jinja2 template. This will facilitate future changes.
  • Ability to explicitly declare bundle jobs in the cfg file via new section and sub-sections.
[bundle]

  [[ bundle2 ]]
  nodes=2
  • Some code restructuring.

Notes:

  • black doesn't work. You branched before the changes to fix black were merged in. I used --no-verify
  • The integration tests run fine, but I have not updated the expected output, so one reports a failure.
  • For the complete integration tests, the status files for mpas_analysis and tc have the weird 'NING ' status.

@forsyth2
Copy link
Collaborator Author

Closing in favor of #243.

@forsyth2 forsyth2 closed this May 15, 2022
@forsyth2 forsyth2 deleted the bundle-jobs branch May 17, 2022 22:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
semver: new feature New feature (will increment minor version)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add option to bundle zppy jobs
2 participants