-
Notifications
You must be signed in to change notification settings - Fork 212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Current state of CIME data files #2161
Comments
I've fixed the issues in Headers and config_pio in my jpe/hide_xml_interface branch. I think we decided pretty early on that not all xml files would be compliant with the entryid format, that's why its a separate class from genericxml. The more compliance we have the easier it is to maintain, but I think that there are valid reasons for non compliant files or blocks within files. |
Update: the scan_children cost has been significantly reduced (factor of 6 or so) by pre-caching entries in EnvBase, a PR that went in a while ago. The next hotspot in case.setup is in the buildnmls. Some poking around revealed another significant source of performance loss for us in CIME: we read and parse the same XML files over and over even though they aren't changing. We are pretty good about not doing this unless necessary for the env XML files managed under the Case class. The problem seems to be with other XML files. I printed the XML file reads taking place during buildnml for a basic CIME case:
To summarize the above, we did 32 XML file reads, 24 of which were redundant. I believe the next step in our quest for performance is to add a filename + hash cash to GenericXML. This will have the additional robustness benefit of multiple different GenericXML objects that wrap the same XML file will have the same ElementRoot, so the in-memory representation will be consistent across all of the objects. |
Update: full-file caching of read-only XML files has been implemented. I think we can now cautiously say that CIME offers reasonable performance when used correctly. That brings me to my next concern: the weak guarantees offered by the CIME data model (by that I mean the database of env XML values for a case). The value for an XML entry id for an active CIME process exists in at least 3 places:
As a rough analogy, think of (1) as a CPU register, (2) as a CPU cache, (3) as main memory, and (4) as disk. All these things need to be coherent or things will get very weird fast, and the same is true for CIME. There are two very obvious problems in our data model.
CIME currently works because we almost never have multiple non-const GenericXML objects wrapping the same XML file at the same time. This is because we have nicely encapsulated the Env XML objects in the Case class. This mostly addresses (1), although not with much rigor because there were some parts of the code violating this assumption and getting lucky that it never caused a failure. We addressed (2) by calling case.read_xml() every time we think any of the env_.xml files may have change behind our backs. This does a very expensive full re-read of all the env.xml files. Unfortunately, it's completely up to the CIME developer to remember to call this when needed. So, again, there's nothing to catch potential mistakes. Thoughts? |
Some naive thoughts on (2)... sorry if these are obvious, dumb, unworkable or all three :-)
It seems like there are two approaches for addressing this: (a) Don't allow an xml file to be modified "behind the back". If there are reasons why this needs to be done now, make these operations achievable via the class interfaces - for example, providing a method that does the copy command you gave in your example... the method would know that it then needs to re-load the file. It would be hard (impossible?) to completely prevent someone from going "behind the back" of the interfaces (e.g., stopping someone from writing code that modifies an xml file directly), but this should never pass code review. (b) Build some more smarts into the "get" routines. For example, could they check the last-modified date/time-stamp on the file, and reread the file if it has been changed since the last read? However, I could imagine this carrying some real performance cost, since we'd need to query this file metadata on disk for every "get". I like (a) better, but I don't know how feasible it is. |
@billsacks , I was thinking similar things. Periodically checking the timestamp or checksum of the file or just setting the files to read-only and only let genericXML chmod+w the file when it flushes. (a) is the correct longterm objective. Everything that wants to modify these files needs to be a python library that takes a Case object. |
@jhkennedy , I thought this discussion might interest you and/or serve as an example of a discussion ticket. |
Continuing this discussion. One thing to note: with the new caching system, the file structure is locked. When the file is being built during create_newcase, no caching occurs. In later phases, the file structure is assumed to be static, adding and removing children won't work. The only thing expected to change are entry values. Since it's the structure (children) that are cached, any change to an entry value should immediately be visible to all GenericXML objects that wrap that file, so having multiple GenericXML objects wrapping the same XML file is not a problem. That leaves the "behind-the-back" modification (modification of env XML files by subprocesses of CIME) of XML files as the remaining issue. I don't think it's too hard to use file timestamps to prevent use of old cache values when opening new GenericXML objects, but I am unsure of how to detect and invalidate existing GenericXML objects. We currently handle this by calling case.read_xml() to re-read all these files. This allows CIME to tolerate behind-the-back modifications in the following places:
NOTE: this means the buildnml scripts for components should not ever modify env XML files. |
Upon case.exit all XML files call write, even for read-only cases. If we do a timestamp check there, we should at least be able to raise an error if it looks like a file was changed without CIME's knowing about it. |
One last comment: the reason this issue has been been painful for us is because we are using CIME's XML files like a database (multiple processes accessing concurrently) without using a proper database. That said, with the additional checks I'm pushing today, I feel like we are robust enough to close this issue for now. |
Big improvement to robustness of CIME's XML handling Change list: Propagate Case read_only to its XML objects Cache all files, but use file mod-time to detect changes. Try to pass Case, not caseroot, to run_sub_or_cmd Add a CIME performance test Add tests for CIME XML handling Test suite: scripts_regression_tests Test baseline: Test namelist changes: Test status: bit for bit Fixes #2161 Fixes #2850 User interface changes?: GenericXML will now throw an error if XML files are changed behind its back without a re-read/invalidate. It is now expected that all python buildlib calls take a Case object, not caseroot. Instructions for updating python buildlib scripts: The main function should open a Case object (preferably, read-only) and pass that object to the buildlib function, not the caseroot. The buildlib function should take a case object, not a caseroot, and should therefore not have to open a Case object. Update gh-pages html (Y/N)?: Code review: @jedwards4b @billsacks
Reopening this issue since we are still having problems. The key issue is subprocesses accessing XML while CIME is running. There are two failure modes here:
|
Idea from telecon: Only allow one instance of CIME active in a case at a time. Implementation: upon import of generic xml, lock the file. Do we have enough exception handling in place? |
To elaborate a bit on @rljacob 's last comment: For locking, I was imagining creating an empty file in the case directory like '.cime_in_use', then not allowing any other processes to interact with xml files if that file is present. Actually, I guess that file would need a process ID, so we'd just prevent processes with a different process ID from interacting with xml files in that case? But @jgfouca might have something else in mind.
This is referring to: We need to be sure to "unlock" (e.g., remove the '.cime_in_use' file) whenever a cime process exits. As much as possible, then, we'd want to catch exceptions and clean up after ourselves - flush any unflushed changes to xml files and unlock the case. @jgfouca pointed out that there will always be some instances where we can't exit gracefully - e.g., if someone |
@jgfouca I was thinking about this - what if we add a timestamp to each xml variable and use it to determine whether the value on file is newer than the value in memory? Would this allow us to do things like change the RESUBMIT value while the model is running without otherwise slowing things down too much? |
@jedwards4b , I hadn't thought of that. I'm going to begin this effort soon and so will be thinking about it more |
Discussion from telecon: would add an attribute to each variable that has a timestamp. But checking that could be slow. |
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. |
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. |
This issue was closed because it has been stalled for 5 days with no activity. |
Hi all,
This is a follow-on discussion to the XML API discussion work that has been completed (not yet merged though as of this typing). Using the new API, I've confirmed that scanning for children is killing our performance. Note the 46% of runtime spent in scan_children (it would be a far higher % except for the fact that loading environment modules is crazy slow on this machine):
And the reason that we need to do so many scans is because it's hard to utilize assumptions about file structure to do vastly-more-efficient direct-child searches because our file formats are so inconsistent.
To help inform the discussion, here's a class diagram of the python classes in XML:
Black lines denote inheritance, red lines denote a "has-a" relationship with the critical Case class. Note that the case class splits it's has-a relationships to XML files between "entry-id" and "generic" files.
Diving deeper, here's an analysis of the files themselves:
The things that stick out to me:
Short-term path forward:
Long-term path forward:
The text was updated successfully, but these errors were encountered: