Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Current state of CIME data files #2161

Closed
jgfouca opened this issue Dec 19, 2017 · 18 comments
Closed

Current state of CIME data files #2161

jgfouca opened this issue Dec 19, 2017 · 18 comments

Comments

@jgfouca
Copy link
Contributor

jgfouca commented Dec 19, 2017

Hi all,

This is a follow-on discussion to the XML API discussion work that has been completed (not yet merged though as of this typing). Using the new API, I've confirmed that scanning for children is killing our performance. Note the 46% of runtime spent in scan_children (it would be a far higher % except for the fact that loading environment modules is crazy slow on this machine):

screen shot 2017-12-19 at 3 46 42 pm

And the reason that we need to do so many scans is because it's hard to utilize assumptions about file structure to do vastly-more-efficient direct-child searches because our file formats are so inconsistent.

To help inform the discussion, here's a class diagram of the python classes in XML:

screen shot 2017-12-19 at 3 50 00 pm

Black lines denote inheritance, red lines denote a "has-a" relationship with the critical Case class. Note that the case class splits it's has-a relationships to XML files between "entry-id" and "generic" files.

Diving deeper, here's an analysis of the files themselves:

Archive(GenericXML):
  * CIME/config/$MODEL/config_archive.xml
  * Format: custom

Batch(GenericXML):
  * CIME/config/$MODEL/machines/config_archive.xml
  * Format: custom

Compilers(GenericXML):
  * CIME/config/$MODEL/machines/config_compilers.xml
  * Format: custom, ACME and CESM have different formats (1.0 vs. 2.0)

Compsets(GenericXML):
  * CIME/config/acme/allactive/config_compsets.xml
  * CIME/src/drivers/mct/cime_config/config_compsets.xml
  * components can add more
  * Format: custom

EnvArchive(GenericXML):
  * $CASE/env_archive.xml : Note, this is the ONLY direct subclass of GenericXML that lives in the CASE directory
  * Format: custom, similar to Archive

Grids(GenericXML):
  * CIME/config/$MODEL/config_grids.xml
  * Format: custom, ACME and CESM have different formats (1.0 vs. 2.0)

Machines(GenericXML):
  * CIME/config/$MODEL/machines/config_machines.xml
  * Format: custom

Pes(GenericXML):
  * CIME/config/acme/allactive/config_pesall.xml
  * Format: custom

TestReporter(GenericXML):
  * ? (CESM only)
  * Format: custom

Tests(GenericXML):
  * CIME/config/config_tests.xml
  * Format: custom

TestSpec(GenericXML):
  * ? (CESM only)
  * Format: custom



Component(EntryId):
  * CIME/src/drivers/mct/cime_config/config_component.xml + config_component_$MODEL.xml
  * Format: entry_id->entry->(various entry info)

Headers(EntryId):
  * CIME/config/config_headers.xml
  * Format: custom, why is this a subclass of EntryId?

Files(EntryId):
  * CIME/config/config_headers.xml
  * Format: entry_id->entry->(various entry info)

NamelistDefinitions(EntryId):
  * Lots
  * Format: entry_id->entry->(various entry info)

PIO(EntryId):
  * CIME/config/$MODEL/machines/config_pio.xml
  * Format: entries->entry->(values) , why different root element tagname?



EnvBatch(EnvBase):
  * $CASE/env_batch.xml
  * Format: file->group(job)->entry + noncompliant batch_system block

EnvBuild(EnvBase):
  * $CASE/env_build.xml
  * Format: file->group->entry fully compliant file!

EnvCase(EnvBase):
  * $CASE/env_case.xml
  * Format: file->group->entry fully compliant file!

EnvMachPes(EnvBase):
  * $CASE/env_mach_pes.xml
  * Format: file->group->entry fully compliant file!

EnvMachSpecific(EnvBase):
  * $CASE/env_mach_specific.xml
  * Format: custom

EnvRun(EnvBase):
  * $CASE/env_run.xml
  * Format: file->group->entry fully compliant file!

EnvTest(EnvBase):
  * $CASE/env_test.xml
  * Format: file->group(job)->entry + noncompliant TEST block

The things that stick out to me:

  • EnvArchive is strange. It's the only direct subclass of GenericXML that lives in the case directory, but it's totally non compliant with our standard env file formatting.
  • Headers seems like is should be a direct subclass of GenericXML, not an EntryId, since it has no entries.
  • The root element of PIO is inconsistent with its siblings
  • EnvBatch is almost compliant except for one block
  • EnvTest is almost compliant except for one block
  • EnvMachPes makes no effort at all to be compliant

Short-term path forward:

  • Re-work class hierarchy to differentiate compliant files from noncompliant
  • The performance of all compliant Env*.xml files can be drastically increased by leveraging the consistent file structure they all share when retrieving entry id values
  • This would most-likely involve a cached entry_name -> (group or entry node) map

Long-term path forward:

  • It sure would be nice if there was more consistency all around
  • break backwards compatibility by forcing more Env* files to be compliant
  • more consistency between the direct subclasses of GenericXML would be nice (config_machines, config_grid), but is not a necessity since create_newcase is already pretty fast and the penalties for parsing these files is only incurred once (at create_newcase).
@jedwards4b
Copy link
Contributor

I've fixed the issues in Headers and config_pio in my jpe/hide_xml_interface branch. I think we decided pretty early on that not all xml files would be compliant with the entryid format, that's why its a separate class from genericxml. The more compliance we have the easier it is to maintain, but I think that there are valid reasons for non compliant files or blocks within files.

@jgfouca
Copy link
Contributor Author

jgfouca commented Jan 18, 2018

Update: the scan_children cost has been significantly reduced (factor of 6 or so) by pre-caching entries in EnvBase, a PR that went in a while ago. The next hotspot in case.setup is in the buildnmls. Some poking around revealed another significant source of performance loss for us in CIME: we read and parse the same XML files over and over even though they aren't changing. We are pretty good about not doing this unless necessary for the env XML files managed under the Case class. The problem seems to be with other XML files.

I printed the XML file reads taking place during buildnml for a basic CIME case:

JGF read XML: config/acme/config_files.xml
JGF read XML: config/acme/config_files.xml
JGF read XML: config/acme/config_files.xml
JGF read XML: config/acme/config_files.xml
JGF read XML: config/acme/config_files.xml
JGF read XML: config/acme/config_files.xml
JGF read XML: config/acme/config_files.xml
JGF read XML: config/acme/config_files.xml
JGF read XML: config/acme/config_files.xml
JGF read XML: config/acme/config_files.xml
JGF read XML: config/acme/config_files.xml
JGF read XML: config/acme/config_files.xml
JGF read XML: config/acme/config_files.xml
JGF read XML: config/acme/config_files.xml
JGF read XML: config/acme/config_files.xml
JGF read XML: config/acme/config_files.xml
JGF read XML: config/acme/config_files.xml
JGF read XML: config/acme/config_grids.xml
JGF read XML: src/components/data_comps/datm/cime_config/namelist_definition_datm.xml
JGF read XML: src/components/data_comps/dice/cime_config/namelist_definition_dice.xml
JGF read XML: src/components/data_comps/docn/cime_config/namelist_definition_docn.xml
JGF read XML: src/components/data_comps/drof/cime_config/namelist_definition_drof.xml
JGF read XML: src/drivers/mct/cime_config/namelist_definition_drv.xml
JGF read XML: src/drivers/mct/cime_config/namelist_definition_modelio.xml
JGF read XML: src/drivers/mct/cime_config/namelist_definition_modelio.xml
JGF read XML: src/drivers/mct/cime_config/namelist_definition_modelio.xml
JGF read XML: src/drivers/mct/cime_config/namelist_definition_modelio.xml
JGF read XML: src/drivers/mct/cime_config/namelist_definition_modelio.xml
JGF read XML: src/drivers/mct/cime_config/namelist_definition_modelio.xml
JGF read XML: src/drivers/mct/cime_config/namelist_definition_modelio.xml
JGF read XML: src/drivers/mct/cime_config/namelist_definition_modelio.xml
JGF read XML: src/drivers/mct/cime_config/namelist_definition_modelio.xml

To summarize the above, we did 32 XML file reads, 24 of which were redundant.

I believe the next step in our quest for performance is to add a filename + hash cash to GenericXML. This will have the additional robustness benefit of multiple different GenericXML objects that wrap the same XML file will have the same ElementRoot, so the in-memory representation will be consistent across all of the objects.

@jgfouca jgfouca reopened this Feb 1, 2018
@jgfouca
Copy link
Contributor Author

jgfouca commented Feb 1, 2018

Update: full-file caching of read-only XML files has been implemented. I think we can now cautiously say that CIME offers reasonable performance when used correctly.

That brings me to my next concern: the weak guarantees offered by the CIME data model (by that I mean the database of env XML values for a case). The value for an XML entry id for an active CIME process exists in at least 3 places:

  1. storage in a local variable, e.g. foo = case.get_value("FOO")
  2. The special entry-id cache in EnvBase
  3. The in-memory XML representation in GenericXML
  4. The value it has on-disk in its XML file

As a rough analogy, think of (1) as a CPU register, (2) as a CPU cache, (3) as main memory, and (4) as disk. All these things need to be coherent or things will get very weird fast, and the same is true for CIME.

There are two very obvious problems in our data model.

  1. If you create two non-const GenericXML objects around the same XML file, these two instances will quickly fall out of sync:
obj1 = GenericXML("foo.xml")
obj2 = GenericXML("foo.xml")
obj2.add_child("bar")
obj1.get_child("bar") # None!
  1. If the XML file is modified "behind the back" of GenericXML, things will be incoherent:
obj = GenericXML("foo.xml")
run_cmd("cp different_stuff.xml foo.xml")
obj.get_child("different_stuff") # None!
  1. Unflushed writes are an issue when the client accesses the Env objects directly (not through Case). We should consider having all XML files be accessed with a context manager.
obj = GenericXML("foo.xml")
obj.add_child("bar")
* Process exists, client never called write *
obj = GenericXML("foo.xml")
obj.get_child("bar") # None!

CIME currently works because we almost never have multiple non-const GenericXML objects wrapping the same XML file at the same time. This is because we have nicely encapsulated the Env XML objects in the Case class. This mostly addresses (1), although not with much rigor because there were some parts of the code violating this assumption and getting lucky that it never caused a failure.

We addressed (2) by calling case.read_xml() every time we think any of the env_.xml files may have change behind our backs. This does a very expensive full re-read of all the env.xml files. Unfortunately, it's completely up to the CIME developer to remember to call this when needed. So, again, there's nothing to catch potential mistakes.

Thoughts?

@billsacks
Copy link
Member

Some naive thoughts on (2)... sorry if these are obvious, dumb, unworkable or all three :-)

If the XML file is modified "behind the back" of GenericXML, things will be incoherent

It seems like there are two approaches for addressing this:

(a) Don't allow an xml file to be modified "behind the back". If there are reasons why this needs to be done now, make these operations achievable via the class interfaces - for example, providing a method that does the copy command you gave in your example... the method would know that it then needs to re-load the file. It would be hard (impossible?) to completely prevent someone from going "behind the back" of the interfaces (e.g., stopping someone from writing code that modifies an xml file directly), but this should never pass code review.

(b) Build some more smarts into the "get" routines. For example, could they check the last-modified date/time-stamp on the file, and reread the file if it has been changed since the last read? However, I could imagine this carrying some real performance cost, since we'd need to query this file metadata on disk for every "get".

I like (a) better, but I don't know how feasible it is.

@jgfouca
Copy link
Contributor Author

jgfouca commented Feb 1, 2018

@billsacks , I was thinking similar things. Periodically checking the timestamp or checksum of the file or just setting the files to read-only and only let genericXML chmod+w the file when it flushes.

(a) is the correct longterm objective. Everything that wants to modify these files needs to be a python library that takes a Case object.

@jgfouca
Copy link
Contributor Author

jgfouca commented Feb 8, 2018

@jhkennedy , I thought this discussion might interest you and/or serve as an example of a discussion ticket.

@jgfouca
Copy link
Contributor Author

jgfouca commented Dec 3, 2018

Continuing this discussion.

One thing to note: with the new caching system, the file structure is locked. When the file is being built during create_newcase, no caching occurs. In later phases, the file structure is assumed to be static, adding and removing children won't work. The only thing expected to change are entry values. Since it's the structure (children) that are cached, any change to an entry value should immediately be visible to all GenericXML objects that wrap that file, so having multiple GenericXML objects wrapping the same XML file is not a problem.

That leaves the "behind-the-back" modification (modification of env XML files by subprocesses of CIME) of XML files as the remaining issue. I don't think it's too hard to use file timestamps to prevent use of old cache values when opening new GenericXML objects, but I am unsure of how to detect and invalidate existing GenericXML objects. We currently handle this by calling case.read_xml() to re-read all these files. This allows CIME to tolerate behind-the-back modifications in the following places:

  • component build scripts
  • case_run prerun script
  • case_run data-assimilate script
  • case_run postrun script

NOTE: this means the buildnml scripts for components should not ever modify env XML files.

@jgfouca
Copy link
Contributor Author

jgfouca commented Dec 3, 2018

Upon case.exit all XML files call write, even for read-only cases. If we do a timestamp check there, we should at least be able to raise an error if it looks like a file was changed without CIME's knowing about it.

@jgfouca
Copy link
Contributor Author

jgfouca commented Dec 3, 2018

One last comment: the reason this issue has been been painful for us is because we are using CIME's XML files like a database (multiple processes accessing concurrently) without using a proper database. That said, with the additional checks I'm pushing today, I feel like we are robust enough to close this issue for now.

jgfouca added a commit that referenced this issue Dec 7, 2018
Big improvement to robustness of CIME's XML handling

Change list:

Propagate Case read_only to its XML objects
Cache all files, but use file mod-time to detect changes.
Try to pass Case, not caseroot, to run_sub_or_cmd
Add a CIME performance test
Add tests for CIME XML handling
Test suite: scripts_regression_tests
Test baseline:
Test namelist changes:
Test status: bit for bit

Fixes #2161
Fixes #2850

User interface changes?: GenericXML will now throw an error if XML files are changed behind its back without a re-read/invalidate. It is now expected that all python buildlib calls take a Case object, not caseroot.

Instructions for updating python buildlib scripts:

The main function should open a Case object (preferably, read-only) and pass that object to the buildlib function, not the caseroot.
The buildlib function should take a case object, not a caseroot, and should therefore not have to open a Case object.
Update gh-pages html (Y/N)?:

Code review: @jedwards4b @billsacks
@jgfouca
Copy link
Contributor Author

jgfouca commented Jan 17, 2019

Reopening this issue since we are still having problems. The key issue is subprocesses accessing XML while CIME is running. There are two failure modes here:

  1. The subprocess modifies an XML file and the parent CIME process doesn't "see" the change (failed to call read_xml after the subprocess returned). This error should be caught in current CIME with the timestamp checking that was recently added. The remaining issue here is that it can be pretty hard to debug this kind of error.
  2. The CIME process has unflushed XML changes and then calls a subprocess that is sensitive to those changes. We are currently vulnerable to this error and have no checks to prevent it. We could potentially check all open Case objects for unflushed changes every time a run_cmd is issued, but this could yield lots of false positives.

@jgfouca jgfouca reopened this Jan 17, 2019
@rljacob
Copy link
Member

rljacob commented Jan 30, 2019

Idea from telecon: Only allow one instance of CIME active in a case at a time.

Implementation: upon import of generic xml, lock the file.

Do we have enough exception handling in place?

@billsacks
Copy link
Member

To elaborate a bit on @rljacob 's last comment:

For locking, I was imagining creating an empty file in the case directory like '.cime_in_use', then not allowing any other processes to interact with xml files if that file is present. Actually, I guess that file would need a process ID, so we'd just prevent processes with a different process ID from interacting with xml files in that case? But @jgfouca might have something else in mind.

Do we have enough exception handling in place?

This is referring to: We need to be sure to "unlock" (e.g., remove the '.cime_in_use' file) whenever a cime process exits. As much as possible, then, we'd want to catch exceptions and clean up after ourselves - flush any unflushed changes to xml files and unlock the case.

@jgfouca pointed out that there will always be some instances where we can't exit gracefully - e.g., if someone kill -9s us, or if there is an uncaught exception. But, as was pointed out on the telecon, in those cases, it might be okay for the lock to be left in place, forcing the user to start over - because the state of the xml files would be unknown in that case.

@jedwards4b
Copy link
Contributor

@jgfouca I was thinking about this - what if we add a timestamp to each xml variable and use it to determine whether the value on file is newer than the value in memory? Would this allow us to do things like change the RESUBMIT value while the model is running without otherwise slowing things down too much?

@jgfouca
Copy link
Contributor Author

jgfouca commented Aug 19, 2020

@jedwards4b , I hadn't thought of that. I'm going to begin this effort soon and so will be thinking about it more

@rljacob
Copy link
Member

rljacob commented Aug 19, 2020

Discussion from telecon: would add an attribute to each variable that has a timestamp. But checking that could be slow.
If you change something in env_build.xml while a job is running AND don't actually rebuild before the job finishes, it will also error (we want that).
"env_run.xml" is described as variables you can change after build and before submit. We've said nothing about what you can change after submit or while running.

@github-actions
Copy link
Contributor

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale label Aug 12, 2023
@jasonb5 jasonb5 removed the Stale label Aug 12, 2023
@billsacks billsacks removed their assignment Sep 14, 2023
Copy link
Contributor

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale label Nov 11, 2023
Copy link
Contributor

This issue was closed because it has been stalled for 5 days with no activity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Nov 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment