-
Notifications
You must be signed in to change notification settings - Fork 374
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CIME5: is short term archiving supposed to work? #1305
Comments
What was the compset and resolution? Try an ERR test with it on your platform. |
I assume this was edison with the A_WCYCL case? |
Yes, this was a A_WCYCL case. |
@golaz, |
Here is my edison script. |
We are not saying this is only an issue on edison, right? Can I lower the number of days to reproduce the problem?
|
I would also like a shorter reproducer. But we would need to tell all the models, including MPAS, to output daily and I'm not sure how to do that. |
The best way to cheaply reproduce the problem would be to run the ultra low-res coupled model (ne4_oQU240). I read somewhere that it gets around 38 SYPD. You probably just need to run it for a few years to get enough output files. |
@rljacob: is there a place where it is explained how short-term archiving for MPAS will be handled? I suppose the history files will go in the hist/ subdirectory, but I also wonder about the namelist and streams files. Will they go in a rest/ subdirectory or log/? |
There's some developing documentation here: http://esmci.github.io/cime/doc/build/html/users_guide/running-a-case.html#archiving-model-output-data For mpas-o, only files with "mpaso" and "hist" in them will be copied to the ocn/hist dir in the archive. But I think we can add an entry for streams. |
I see. Thanks, that's helpful. |
A low res case is a good test. I confirmed that master is not copying the mpas files. I'll try using next which has CIME5.2 |
namelists are copied to CaseDocs in your case directory. |
I tried CIME5.2 and got the same result with archiving. I'll open a new issue for that. @golaz for archiving interfering with a running job: are you using the run_script's re-submit or auto-chaining? I can see how those might interfere with what CIME is trying to do. I think if you want to use CIME's archiving while also having jobs automatically re-submit, you'll have to use CIME's resubmit feature. |
@rljacob: thanks for clarifying. I thought that the interference between runs and short-term archiving might be due to the fact that I'm not using CIME's resubmit feature. That's disappointing because it means that even if short term archiving actually worked, I probably would not be able to use it. |
Why not? |
I tried to turn it on last week but I failed on Titan. So, I am also curious about whether it works or not.
******************************************
Xiaoying Shi
Climate Change Science Institute
Environmental Sciences Division
Building 2040, Room E222, MS-6301
Oak Ridge National Laboratory
P.O. Box 2008
Oak Ridge, TN 37831-6301
Office: 865-241-9199 Mobile: 865-804-1900
[email protected]<mailto:[email protected]> 865-574-9501 (fax)
[cid:20A076CB-8B0C-458C-823D-B85B89EAC141]
From: Robert Jacob <[email protected]<mailto:[email protected]>>
Reply-To: ACME-Climate/ACME <[email protected]<mailto:[email protected]>>
Date: Thursday, March 16, 2017 at 3:17 PM
To: ACME-Climate/ACME <[email protected]<mailto:[email protected]>>
Cc: Subscribed <[email protected]<mailto:[email protected]>>
Subject: Re: [ACME-Climate/ACME] CIME5: is short term archiving supposed to work? (#1305)
Why not?
-
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub<#1305 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AH0tY_XE0OKHyfjUWitvHccj_ZhrEkYwks5rmYq6gaJpZM4MaUbl>.
|
I opened another issue for this in ESMCI. ESMCI/cime#1252 |
@rljacob in response to your question above. CIME5 takes a "one-size-fits-all" approach to running ACME jobs. While that approach works well for anvil, it is unfortunately poorly suited for other machines. CIME5 tools would be much more useful to me if it they could be assembled to create a custom workflow tailored for a specific need and environment (which is sometimes a moving target), rather than tools that come pre-assembled to only work in one specific fashion, such as
But that's off-topic and probably a philosophical difference that we will not reconcile here 😄 |
You can call case.st_archive yourself after a run is done. Just leave DOUT_S as false so CIME doesn't try to call it for you. That way you can control when exactly it happens in conjunction with your calling of case.run. I forgot about the chaining. We can work on supporting that. |
@golaz et.al, The main advantage to using the CIME scripts as documented is that those use patterns are tested before any updates make their way to ACME master. Therefore, I would like to get your use cases adopted as CIME standard usage so that it is always tested. In particular, the current need seems to be job bundling (as mentioned by @huiwanpnnl above). Rather than solving this problem for each machine and update (either CIME update or system update), I would like to make this a supported feature. Please consider opening a new issue on ESMCI/CIME to specify the requirements of this feature. Flagging @mfdeakin-sandia to help with this process. |
@gold2718 , so how are supported use patterns identified typically? It is not as if job bundling is something new. It is definitely not peculiar to ACME. "we" don't know that something is going away until the next version comes out and it is determined that a capability has disappeared. The perception, correct or not, is that each version of CIME is less flexible than the previous one, perhaps because the required use patterns are not yet sufficiently broad? While adding this as a request may get the capability added back in in the future, the capability is needed now. |
By users speaking up as @golaz has done.
If it is not new, at least to ACME, then perhaps something is wrong with the ACME development processes since ACME team members provided approximately half the CIME development efforts. Do have you have any suggestions for process improvement or are you just venting?
Perhaps, you meant 'My perception ...' or if not, please name the cohort which has 'The' perception so that we can poll them.
I am all for finding short-term workarounds but in order to ensure stability, we do need to identify required use patterns and make sure they are tested against regressions. This is software engineering 101, I assume it is not news to you? |
@huiwanpnnl, use "./case.submit --no-batch" if you have your own batch script controlling things. |
This is an ACME github page, so of course my audience is ACME CIME developers :-). Partially venting since capabilites that I care about seem to disappear with each release, and I have to spend time figuring out how to put them back in. And not being a CIME developer that has been increasingly difficult. I have always had the hope that each generation of CIME should at least have the capabilities of the previous version. Documenting what those are appears to be a bottleneck? Bundling jobs has been a use case preceding ACME - I expect that CESM users would appreciate this capability as well.
I will only speak for myself. Others will have to self-identify.
Short-term workarounds can be difficult to design, but I am probably being overly pessimistic here. I'll let the CIME wizards determine how to put this back in. (Update: corrected misattribution to Confluence, as pointed out by @gold2718 .) |
Repeating in case it gets lost: job scripts that "bundle" like @huiwanpnnl pointed to should still work with CIME5. You just need to use "./case.submit --no-batch" in place of "./case.run". It turns out you can invoke case.st_archive outside of case.submit/run. You can run it at command line inside your case directory. The bigger problem is that it doesn't process mpas files and we're working on that. |
I'm trying to understand where we stand. This issue was originally opened because of two problems:
Looking at https://acme-climate.atlassian.net/browse/S2-130, it appears that (2) has now been fixed, but (1) has not. This would imply that it is currently not safe to invoke short term archiving while the model is running? @rljacob : can you confirm? |
Yes that's right. Its not safe to invoke while the model is running. |
The log file request is a new feature so its been opened as an issue in JIRA. |
My understanding was that pre-CIME, it was safe to invoke short term archiving while the model was running. The log file request was a compromise to make it easier for SE to modify short term archiving to be safe. So I view everything here as a bug fix. |
I agree with Chris' understanding. In the past when you turned on short-term archiving, each time a job submission completed a 1-node job would be launched which would move all the files from that job into the short-term archiving location while the next job was running. The ability to have short-term archiving work as part of the normal job submission process is important because otherwise the user has to stop doing model runs and invoke short-term archiving by hand. This is a lot of unnecessary work which slows things down and provides ample opportunity for screwing things up. |
Short term archiving with job submission is working. If DOUT_S is TRUE, when you run "case.submit", 2 jobs will be submitted, one for the run and one for the archiving. The archiving will start as soon as the run finishes. A new feature lets you put the archiving job in a faster queue. If RESUBMIT > 0, then when the first pair finishes, CIME will submit another pair of jobs to continue the run, decrement RESUBMIT by 1 and keep doing that until RESUBMIT is 0. That should all work. |
Ok, so if our executables are 'acme' and 'archiving', I think you are saying that job submission works like this: submit acme and {archiving with dependency on acme} I think this will work, but it is less efficient than what used to happen, which is this: submit acme Having archiving ruin logs of currently-running acme simulations would clearly not be acceptable in this old workflow and is, I think, what Chris was concerned about. Can you confirm that my assumptions about how jobs are submitted now is correct? |
Yes with the CIME script system, the second pair is not submitted until the first pair is complete. If there's an error, the whole process stops and the run directory is left as-is. I don't understand your example of what used to happen. Only an acme job would be submitted and when that finished then the archiver (for that run) would be submitted as well as the next job? That sounds less efficient. |
I'll try to explain the old version in more detail: start by imagining a job without short-term archiving. As soon as it completes one submission, it starts another. Now add short term archiving by having acme launch an archiving script at the end of each of its submissions which cleans up all the files created by the job that just finished. Don't impose any dependency on this short-term archiving job because you know that the job it is meant to clean up has already finished. This old way is more efficient because archiving and simulation can occur in parallel, while the way you've set things up now serializes archiving and simulation. Is this serialization a big deal? Not if the archiving job gets through the queue and runs quickly (say, in less than 1/2 hr). But on machines where you spend more time waiting in the queue than do do running, this serialization could slow down time-to-solution by a huge amount! |
Ok I think I understand. But this "old version" was not a workflow implemented by an earlier version of CIME or by the pre-CIME CESM scripts. Its always been paris of run-and-archive, submitted in sequence. The archiver has always moved all the log and history files and has never been documented as safe to use while another run is going. |
As mentioned above in an older comment, the archiver can be submitted to the "xfer" queue on edison which allows it to run quickly and reduce the time-cost of the serialization. |
It's definitely true that short term archiving has been broken ever since ACME branched from CESM, but I could have sworn that it operated as I described before that. At the time I was using it, I wasn't involved in the code-level details so perhaps I'm misunderstanding how it worked. In any case, my goal here is to alleviate what I see as a breakdown in communication about 'how things should work'. I'm fine with your 'pairs of jobs' approach as long as serialization cost doesn't kill productivity. |
Sounds good. If you can recall the version of CESM where it worked that way we can check that out, look at the short term archiver code and confirm if something was lost. |
The CIME issues for these implementations are here: ESMCI/cime#1503 ESMCI/cime#1485 |
And the JIRA issues are: |
Short Term Archiving Features This implements features to the short term archiver to enable running it while the model is without obviously breaking things (see ESMCI/cime#1503 for potential issues with the --last-date option). Other options added include --copy-only, which copies the files to be archived instead of moving them; and --no-incomplete-logs, which ignores logs which are not gzipped, and thus not complete Fixes #1305 Passes scripts_regression_tests BFB * origin/mfdeakin-sandia/in_run_archive: Adds the --force-move option and implies --copy-only when --last-date is specified without --force-move Adds a warning when using the --last-date option and to its help Implement the copy_only option for short term archiving. This copies files rather than moving them Implemented most of the machinery for testing with "incomplete" log files Fix code format issue - replace unused variable with _ Update template.st_archive Adds options to the st_archive to specify the last date (--last-date) to archive, and whether to disable archiving incomplete log files (--no-incomplete-logs)
Short Term Archiving Features This implements features to the short term archiver to enable running it while the model is without obviously breaking things (see ESMCI/cime#1503 for potential issues with the --last-date option). Other options added include --copy-only, which copies the files to be archived instead of moving them; and --no-incomplete-logs, which ignores logs which are not gzipped, and thus not complete Fixes #1305 Passes scripts_regression_tests BFB * origin/mfdeakin-sandia/in_run_archive: Adds the --force-move option and implies --copy-only when --last-date is specified without --force-move Adds a warning when using the --last-date option and to its help Implement the copy_only option for short term archiving. This copies files rather than moving them Implemented most of the machinery for testing with "incomplete" log files Fix code format issue - replace unused variable with _ Update template.st_archive Adds options to the st_archive to specify the last date (--last-date) to archive, and whether to disable archiving incomplete log files (--no-incomplete-logs)
Allow the case.st_archive script to work with mpaso and mpascice history and restart files. Also should work with mpasli but not tested. From the case directory, executing ./case.st_archive should move all history and restart files to the short term archive for all ACME components. Fixes #1305 S2-131 #close [BFB] * rljacob/cime/fix-mpas-starchive: fix mpas pattern matching so only interim restart files are deleted Add ability to archive MPAS land ice files Add ability to handle mpas files Change regex for mpaso and mpascice files
Short Term Archiving Features This implements features to the short term archiver to enable running it while the model is without obviously breaking things (see ESMCI/cime#1503 for potential issues with the --last-date option). Other options added include --copy-only, which copies the files to be archived instead of moving them; and --no-incomplete-logs, which ignores logs which are not gzipped, and thus not complete Fixes #1305 Passes scripts_regression_tests BFB * origin/mfdeakin-sandia/in_run_archive: Adds the --force-move option and implies --copy-only when --last-date is specified without --force-move Adds a warning when using the --last-date option and to its help Implement the copy_only option for short term archiving. This copies files rather than moving them Implemented most of the machinery for testing with "incomplete" log files Fix code format issue - replace unused variable with _ Update template.st_archive Adds options to the st_archive to specify the last date (--last-date) to archive, and whether to disable archiving incomplete log files (--no-incomplete-logs)
Allow the case.st_archive script to work with mpaso and mpascice history and restart files. Also should work with mpasli but not tested. From the case directory, executing ./case.st_archive should move all history and restart files to the short term archive for all ACME components. Fixes #1305 S2-131 #close [BFB] * rljacob/cime/fix-mpas-starchive: fix mpas pattern matching so only interim restart files are deleted Add ability to archive MPAS land ice files Add ability to handle mpas files Change regex for mpaso and mpascice files
Short Term Archiving Features This implements features to the short term archiver to enable running it while the model is without obviously breaking things (see ESMCI/cime#1503 for potential issues with the --last-date option). Other options added include --copy-only, which copies the files to be archived instead of moving them; and --no-incomplete-logs, which ignores logs which are not gzipped, and thus not complete Fixes #1305 Passes scripts_regression_tests BFB * origin/mfdeakin-sandia/in_run_archive: Adds the --force-move option and implies --copy-only when --last-date is specified without --force-move Adds a warning when using the --last-date option and to its help Implement the copy_only option for short term archiving. This copies files rather than moving them Implemented most of the machinery for testing with "incomplete" log files Fix code format issue - replace unused variable with _ Update template.st_archive Adds options to the st_archive to specify the last date (--last-date) to archive, and whether to disable archiving incomplete log files (--no-incomplete-logs)
Allow the case.st_archive script to work with mpaso and mpascice history and restart files. Also should work with mpasli but not tested. From the case directory, executing ./case.st_archive should move all history and restart files to the short term archive for all ACME components. Fixes #1305 S2-131 #close [BFB] * rljacob/cime/fix-mpas-starchive: fix mpas pattern matching so only interim restart files are deleted Add ability to archive MPAS land ice files Add ability to handle mpas files Change regex for mpaso and mpascice files
Short Term Archiving Features This implements features to the short term archiver to enable running it while the model is without obviously breaking things (see ESMCI/cime#1503 for potential issues with the --last-date option). Other options added include --copy-only, which copies the files to be archived instead of moving them; and --no-incomplete-logs, which ignores logs which are not gzipped, and thus not complete Fixes #1305 Passes scripts_regression_tests BFB * origin/mfdeakin-sandia/in_run_archive: Adds the --force-move option and implies --copy-only when --last-date is specified without --force-move Adds a warning when using the --last-date option and to its help Implement the copy_only option for short term archiving. This copies files rather than moving them Implemented most of the machinery for testing with "incomplete" log files Fix code format issue - replace unused variable with _ Update template.st_archive Adds options to the st_archive to specify the last date (--last-date) to archive, and whether to disable archiving incomplete log files (--no-incomplete-logs)
Allow the case.st_archive script to work with mpaso and mpascice history and restart files. Also should work with mpasli but not tested. From the case directory, executing ./case.st_archive should move all history and restart files to the short term archive for all ACME components. Fixes #1305 S2-131 #close [BFB] * rljacob/cime/fix-mpas-starchive: fix mpas pattern matching so only interim restart files are deleted Add ability to archive MPAS land ice files Add ability to handle mpas files Change regex for mpaso and mpascice files
Allow the case.st_archive script to work with mpaso and mpascice history and restart files. Also should work with mpasli but not tested. From the case directory, executing ./case.st_archive should move all history and restart files to the short term archive for all ACME components. Fixes #1305 S2-131 #close [BFB] * rljacob/cime/fix-mpas-starchive: fix mpas pattern matching so only interim restart files are deleted Add ability to archive MPAS land ice files Add ability to handle mpas files Change regex for mpaso and mpascice files
I had the impression that short term archiving was now supposed to work with CIME5, so I decided to turn it on in my latest coupled simulation. Turns out that was a bad idea.
The short term archiving ran when the next job segment had already started, which I understand should be perfectly safe. Here are some obvious issues I encountered.
So, as a result, I have a bit of a mess now. Log files that have missing information, some components files that have been moved to their short term archiving location, and other component files still in their original location.
The text was updated successfully, but these errors were encountered: