Current state of CIME data files #2161

jgfouca · 2017-12-19T22:59:22Z

Hi all,

This is a follow-on discussion to the XML API discussion work that has been completed (not yet merged though as of this typing). Using the new API, I've confirmed that scanning for children is killing our performance. Note the 46% of runtime spent in scan_children (it would be a far higher % except for the fact that loading environment modules is crazy slow on this machine):

And the reason that we need to do so many scans is because it's hard to utilize assumptions about file structure to do vastly-more-efficient direct-child searches because our file formats are so inconsistent.

To help inform the discussion, here's a class diagram of the python classes in XML:

Black lines denote inheritance, red lines denote a "has-a" relationship with the critical Case class. Note that the case class splits it's has-a relationships to XML files between "entry-id" and "generic" files.

Diving deeper, here's an analysis of the files themselves:

Archive(GenericXML):
  * CIME/config/$MODEL/config_archive.xml
  * Format: custom

Batch(GenericXML):
  * CIME/config/$MODEL/machines/config_archive.xml
  * Format: custom

Compilers(GenericXML):
  * CIME/config/$MODEL/machines/config_compilers.xml
  * Format: custom, ACME and CESM have different formats (1.0 vs. 2.0)

Compsets(GenericXML):
  * CIME/config/acme/allactive/config_compsets.xml
  * CIME/src/drivers/mct/cime_config/config_compsets.xml
  * components can add more
  * Format: custom

EnvArchive(GenericXML):
  * $CASE/env_archive.xml : Note, this is the ONLY direct subclass of GenericXML that lives in the CASE directory
  * Format: custom, similar to Archive

Grids(GenericXML):
  * CIME/config/$MODEL/config_grids.xml
  * Format: custom, ACME and CESM have different formats (1.0 vs. 2.0)

Machines(GenericXML):
  * CIME/config/$MODEL/machines/config_machines.xml
  * Format: custom

Pes(GenericXML):
  * CIME/config/acme/allactive/config_pesall.xml
  * Format: custom

TestReporter(GenericXML):
  * ? (CESM only)
  * Format: custom

Tests(GenericXML):
  * CIME/config/config_tests.xml
  * Format: custom

TestSpec(GenericXML):
  * ? (CESM only)
  * Format: custom



Component(EntryId):
  * CIME/src/drivers/mct/cime_config/config_component.xml + config_component_$MODEL.xml
  * Format: entry_id->entry->(various entry info)

Headers(EntryId):
  * CIME/config/config_headers.xml
  * Format: custom, why is this a subclass of EntryId?

Files(EntryId):
  * CIME/config/config_headers.xml
  * Format: entry_id->entry->(various entry info)

NamelistDefinitions(EntryId):
  * Lots
  * Format: entry_id->entry->(various entry info)

PIO(EntryId):
  * CIME/config/$MODEL/machines/config_pio.xml
  * Format: entries->entry->(values) , why different root element tagname?



EnvBatch(EnvBase):
  * $CASE/env_batch.xml
  * Format: file->group(job)->entry + noncompliant batch_system block

EnvBuild(EnvBase):
  * $CASE/env_build.xml
  * Format: file->group->entry fully compliant file!

EnvCase(EnvBase):
  * $CASE/env_case.xml
  * Format: file->group->entry fully compliant file!

EnvMachPes(EnvBase):
  * $CASE/env_mach_pes.xml
  * Format: file->group->entry fully compliant file!

EnvMachSpecific(EnvBase):
  * $CASE/env_mach_specific.xml
  * Format: custom

EnvRun(EnvBase):
  * $CASE/env_run.xml
  * Format: file->group->entry fully compliant file!

EnvTest(EnvBase):
  * $CASE/env_test.xml
  * Format: file->group(job)->entry + noncompliant TEST block

The things that stick out to me:

EnvArchive is strange. It's the only direct subclass of GenericXML that lives in the case directory, but it's totally non compliant with our standard env file formatting.
Headers seems like is should be a direct subclass of GenericXML, not an EntryId, since it has no entries.
The root element of PIO is inconsistent with its siblings
EnvBatch is almost compliant except for one block
EnvTest is almost compliant except for one block
EnvMachPes makes no effort at all to be compliant

Short-term path forward:

Re-work class hierarchy to differentiate compliant files from noncompliant
The performance of all compliant Env*.xml files can be drastically increased by leveraging the consistent file structure they all share when retrieving entry id values
This would most-likely involve a cached entry_name -> (group or entry node) map

Long-term path forward:

It sure would be nice if there was more consistency all around
break backwards compatibility by forcing more Env* files to be compliant
more consistency between the direct subclasses of GenericXML would be nice (config_machines, config_grid), but is not a necessity since create_newcase is already pretty fast and the penalties for parsing these files is only incurred once (at create_newcase).

The text was updated successfully, but these errors were encountered:

jedwards4b · 2017-12-20T18:36:03Z

I've fixed the issues in Headers and config_pio in my jpe/hide_xml_interface branch. I think we decided pretty early on that not all xml files would be compliant with the entryid format, that's why its a separate class from genericxml. The more compliance we have the easier it is to maintain, but I think that there are valid reasons for non compliant files or blocks within files.

jgfouca · 2018-01-18T23:48:58Z

Update: the scan_children cost has been significantly reduced (factor of 6 or so) by pre-caching entries in EnvBase, a PR that went in a while ago. The next hotspot in case.setup is in the buildnmls. Some poking around revealed another significant source of performance loss for us in CIME: we read and parse the same XML files over and over even though they aren't changing. We are pretty good about not doing this unless necessary for the env XML files managed under the Case class. The problem seems to be with other XML files.

I printed the XML file reads taking place during buildnml for a basic CIME case:

JGF read XML: config/acme/config_files.xml
JGF read XML: config/acme/config_files.xml
JGF read XML: config/acme/config_files.xml
JGF read XML: config/acme/config_files.xml
JGF read XML: config/acme/config_files.xml
JGF read XML: config/acme/config_files.xml
JGF read XML: config/acme/config_files.xml
JGF read XML: config/acme/config_files.xml
JGF read XML: config/acme/config_files.xml
JGF read XML: config/acme/config_files.xml
JGF read XML: config/acme/config_files.xml
JGF read XML: config/acme/config_files.xml
JGF read XML: config/acme/config_files.xml
JGF read XML: config/acme/config_files.xml
JGF read XML: config/acme/config_files.xml
JGF read XML: config/acme/config_files.xml
JGF read XML: config/acme/config_files.xml
JGF read XML: config/acme/config_grids.xml
JGF read XML: src/components/data_comps/datm/cime_config/namelist_definition_datm.xml
JGF read XML: src/components/data_comps/dice/cime_config/namelist_definition_dice.xml
JGF read XML: src/components/data_comps/docn/cime_config/namelist_definition_docn.xml
JGF read XML: src/components/data_comps/drof/cime_config/namelist_definition_drof.xml
JGF read XML: src/drivers/mct/cime_config/namelist_definition_drv.xml
JGF read XML: src/drivers/mct/cime_config/namelist_definition_modelio.xml
JGF read XML: src/drivers/mct/cime_config/namelist_definition_modelio.xml
JGF read XML: src/drivers/mct/cime_config/namelist_definition_modelio.xml
JGF read XML: src/drivers/mct/cime_config/namelist_definition_modelio.xml
JGF read XML: src/drivers/mct/cime_config/namelist_definition_modelio.xml
JGF read XML: src/drivers/mct/cime_config/namelist_definition_modelio.xml
JGF read XML: src/drivers/mct/cime_config/namelist_definition_modelio.xml
JGF read XML: src/drivers/mct/cime_config/namelist_definition_modelio.xml
JGF read XML: src/drivers/mct/cime_config/namelist_definition_modelio.xml

To summarize the above, we did 32 XML file reads, 24 of which were redundant.

I believe the next step in our quest for performance is to add a filename + hash cash to GenericXML. This will have the additional robustness benefit of multiple different GenericXML objects that wrap the same XML file will have the same ElementRoot, so the in-memory representation will be consistent across all of the objects.

jgfouca · 2018-02-01T01:05:21Z

Update: full-file caching of read-only XML files has been implemented. I think we can now cautiously say that CIME offers reasonable performance when used correctly.

That brings me to my next concern: the weak guarantees offered by the CIME data model (by that I mean the database of env XML values for a case). The value for an XML entry id for an active CIME process exists in at least 3 places:

storage in a local variable, e.g. foo = case.get_value("FOO")
The special entry-id cache in EnvBase
The in-memory XML representation in GenericXML
The value it has on-disk in its XML file

As a rough analogy, think of (1) as a CPU register, (2) as a CPU cache, (3) as main memory, and (4) as disk. All these things need to be coherent or things will get very weird fast, and the same is true for CIME.

There are two very obvious problems in our data model.

If you create two non-const GenericXML objects around the same XML file, these two instances will quickly fall out of sync:

obj1 = GenericXML("foo.xml")
obj2 = GenericXML("foo.xml")
obj2.add_child("bar")
obj1.get_child("bar") # None!

If the XML file is modified "behind the back" of GenericXML, things will be incoherent:

obj = GenericXML("foo.xml")
run_cmd("cp different_stuff.xml foo.xml")
obj.get_child("different_stuff") # None!

Unflushed writes are an issue when the client accesses the Env objects directly (not through Case). We should consider having all XML files be accessed with a context manager.

obj = GenericXML("foo.xml")
obj.add_child("bar")
* Process exists, client never called write *
obj = GenericXML("foo.xml")
obj.get_child("bar") # None!

CIME currently works because we almost never have multiple non-const GenericXML objects wrapping the same XML file at the same time. This is because we have nicely encapsulated the Env XML objects in the Case class. This mostly addresses (1), although not with much rigor because there were some parts of the code violating this assumption and getting lucky that it never caused a failure.

We addressed (2) by calling case.read_xml() every time we think any of the env_.xml files may have change behind our backs. This does a very expensive full re-read of all the env.xml files. Unfortunately, it's completely up to the CIME developer to remember to call this when needed. So, again, there's nothing to catch potential mistakes.

Thoughts?

billsacks · 2018-02-01T15:11:34Z

Some naive thoughts on (2)... sorry if these are obvious, dumb, unworkable or all three :-)

If the XML file is modified "behind the back" of GenericXML, things will be incoherent

It seems like there are two approaches for addressing this:

(a) Don't allow an xml file to be modified "behind the back". If there are reasons why this needs to be done now, make these operations achievable via the class interfaces - for example, providing a method that does the copy command you gave in your example... the method would know that it then needs to re-load the file. It would be hard (impossible?) to completely prevent someone from going "behind the back" of the interfaces (e.g., stopping someone from writing code that modifies an xml file directly), but this should never pass code review.

(b) Build some more smarts into the "get" routines. For example, could they check the last-modified date/time-stamp on the file, and reread the file if it has been changed since the last read? However, I could imagine this carrying some real performance cost, since we'd need to query this file metadata on disk for every "get".

I like (a) better, but I don't know how feasible it is.

jgfouca · 2018-02-01T17:40:15Z

@billsacks , I was thinking similar things. Periodically checking the timestamp or checksum of the file or just setting the files to read-only and only let genericXML chmod+w the file when it flushes.

(a) is the correct longterm objective. Everything that wants to modify these files needs to be a python library that takes a Case object.

jgfouca · 2018-02-08T01:38:52Z

@jhkennedy , I thought this discussion might interest you and/or serve as an example of a discussion ticket.

jgfouca · 2018-12-03T21:22:38Z

Continuing this discussion.

One thing to note: with the new caching system, the file structure is locked. When the file is being built during create_newcase, no caching occurs. In later phases, the file structure is assumed to be static, adding and removing children won't work. The only thing expected to change are entry values. Since it's the structure (children) that are cached, any change to an entry value should immediately be visible to all GenericXML objects that wrap that file, so having multiple GenericXML objects wrapping the same XML file is not a problem.

That leaves the "behind-the-back" modification (modification of env XML files by subprocesses of CIME) of XML files as the remaining issue. I don't think it's too hard to use file timestamps to prevent use of old cache values when opening new GenericXML objects, but I am unsure of how to detect and invalidate existing GenericXML objects. We currently handle this by calling case.read_xml() to re-read all these files. This allows CIME to tolerate behind-the-back modifications in the following places:

component build scripts
case_run prerun script
case_run data-assimilate script
case_run postrun script

NOTE: this means the buildnml scripts for components should not ever modify env XML files.

jgfouca · 2018-12-03T22:52:49Z

Upon case.exit all XML files call write, even for read-only cases. If we do a timestamp check there, we should at least be able to raise an error if it looks like a file was changed without CIME's knowing about it.

jgfouca · 2018-12-03T23:00:48Z

One last comment: the reason this issue has been been painful for us is because we are using CIME's XML files like a database (multiple processes accessing concurrently) without using a proper database. That said, with the additional checks I'm pushing today, I feel like we are robust enough to close this issue for now.

@jedwards4b

Big improvement to robustness of CIME's XML handling Change list: Propagate Case read_only to its XML objects Cache all files, but use file mod-time to detect changes. Try to pass Case, not caseroot, to run_sub_or_cmd Add a CIME performance test Add tests for CIME XML handling Test suite: scripts_regression_tests Test baseline: Test namelist changes: Test status: bit for bit Fixes #2161 Fixes #2850 User interface changes?: GenericXML will now throw an error if XML files are changed behind its back without a re-read/invalidate. It is now expected that all python buildlib calls take a Case object, not caseroot. Instructions for updating python buildlib scripts: The main function should open a Case object (preferably, read-only) and pass that object to the buildlib function, not the caseroot. The buildlib function should take a case object, not a caseroot, and should therefore not have to open a Case object. Update gh-pages html (Y/N)?: Code review: @jedwards4b @billsacks

jgfouca · 2019-01-17T00:12:10Z

Reopening this issue since we are still having problems. The key issue is subprocesses accessing XML while CIME is running. There are two failure modes here:

The subprocess modifies an XML file and the parent CIME process doesn't "see" the change (failed to call read_xml after the subprocess returned). This error should be caught in current CIME with the timestamp checking that was recently added. The remaining issue here is that it can be pretty hard to debug this kind of error.
The CIME process has unflushed XML changes and then calls a subprocess that is sensitive to those changes. We are currently vulnerable to this error and have no checks to prevent it. We could potentially check all open Case objects for unflushed changes every time a run_cmd is issued, but this could yield lots of false positives.

rljacob · 2019-01-30T21:19:48Z

Idea from telecon: Only allow one instance of CIME active in a case at a time.

Implementation: upon import of generic xml, lock the file.

Do we have enough exception handling in place?

billsacks · 2019-01-30T21:26:45Z

To elaborate a bit on @rljacob 's last comment:

For locking, I was imagining creating an empty file in the case directory like '.cime_in_use', then not allowing any other processes to interact with xml files if that file is present. Actually, I guess that file would need a process ID, so we'd just prevent processes with a different process ID from interacting with xml files in that case? But @jgfouca might have something else in mind.

Do we have enough exception handling in place?

This is referring to: We need to be sure to "unlock" (e.g., remove the '.cime_in_use' file) whenever a cime process exits. As much as possible, then, we'd want to catch exceptions and clean up after ourselves - flush any unflushed changes to xml files and unlock the case.

@jgfouca pointed out that there will always be some instances where we can't exit gracefully - e.g., if someone kill -9s us, or if there is an uncaught exception. But, as was pointed out on the telecon, in those cases, it might be okay for the lock to be left in place, forcing the user to start over - because the state of the xml files would be unknown in that case.

jedwards4b · 2020-08-18T22:29:11Z

@jgfouca I was thinking about this - what if we add a timestamp to each xml variable and use it to determine whether the value on file is newer than the value in memory? Would this allow us to do things like change the RESUBMIT value while the model is running without otherwise slowing things down too much?

jgfouca · 2020-08-19T18:13:58Z

@jedwards4b , I hadn't thought of that. I'm going to begin this effort soon and so will be thinking about it more

rljacob · 2020-08-19T19:09:41Z

Discussion from telecon: would add an attribute to each variable that has a timestamp. But checking that could be slow.
If you change something in env_build.xml while a job is running AND don't actually rebuild before the job finishes, it will also error (we want that).
"env_run.xml" is described as variables you can change after build and before submit. We've said nothing about what you can change after submit or while running.

github-actions · 2023-08-12T01:55:46Z

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions · 2023-11-11T02:36:06Z

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions · 2023-11-17T02:06:23Z

This issue was closed because it has been stalled for 5 days with no activity.

jgfouca added tp: config ty: Discussion labels Dec 19, 2017

jgfouca assigned jedwards4b, jgfouca, mvertens, rljacob, agsalin and billsacks Dec 19, 2017

jgfouca added the Assigned label Jan 10, 2018

jgfouca closed this as completed Jan 10, 2018

jgfouca removed the Assigned label Jan 10, 2018

jgfouca reopened this Feb 1, 2018

rljacob added the Assigned label Feb 7, 2018

jgfouca assigned jhkennedy Feb 8, 2018

jgfouca mentioned this issue Apr 11, 2018

Possible race condition regarding env_build.xml during build #2448

Closed

jgfouca assigned mnlevy1981 Apr 11, 2018

jgfouca closed this as completed Dec 3, 2018

jgfouca removed the Assigned label Dec 3, 2018

jgfouca reopened this Dec 3, 2018

jgfouca mentioned this issue Dec 3, 2018

Big improvement to robustness of CIME's XML handling #2932

Merged

jgfouca added the in progress label Dec 3, 2018

jgfouca closed this as completed in #2932 Dec 7, 2018

jgfouca removed the in progress label Dec 7, 2018

jgfouca reopened this Jan 17, 2019

rljacob added the Assigned label Feb 20, 2019

jgfouca mentioned this issue Sep 3, 2019

Overzealous LockedFiles scheme #3225

Closed

rljacob unassigned jhkennedy Oct 2, 2019

jgfouca mentioned this issue Jan 8, 2020

Cannot change RESUBMIT while model is running #3338

Closed

jgfouca mentioned this issue Mar 18, 2020

Jgfouca/branch for acme split 2020 03 16 #3457

Merged

jgfouca assigned jasonb5 Jan 27, 2021

jgfouca mentioned this issue Mar 17, 2021

CIME V6 Planning! #3886

Open

github-actions bot added the Stale label Aug 12, 2023

jasonb5 removed the Stale label Aug 12, 2023

billsacks removed their assignment Sep 14, 2023

github-actions bot added the Stale label Nov 11, 2023

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Nov 17, 2023

rljacob added Low Priority and removed Stale labels Nov 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Current state of CIME data files #2161

Current state of CIME data files #2161

jgfouca commented Dec 19, 2017 •

edited

Loading

jedwards4b commented Dec 20, 2017

jgfouca commented Jan 18, 2018

jgfouca commented Feb 1, 2018 •

edited

Loading

billsacks commented Feb 1, 2018

jgfouca commented Feb 1, 2018

jgfouca commented Feb 8, 2018

jgfouca commented Dec 3, 2018 •

edited

Loading

jgfouca commented Dec 3, 2018

jgfouca commented Dec 3, 2018

jgfouca commented Jan 17, 2019

rljacob commented Jan 30, 2019

billsacks commented Jan 30, 2019

jedwards4b commented Aug 18, 2020

jgfouca commented Aug 19, 2020

rljacob commented Aug 19, 2020 •

edited

Loading

github-actions bot commented Aug 12, 2023

github-actions bot commented Nov 11, 2023

github-actions bot commented Nov 17, 2023

Current state of CIME data files #2161

Current state of CIME data files #2161

Comments

jgfouca commented Dec 19, 2017 • edited Loading

jedwards4b commented Dec 20, 2017

jgfouca commented Jan 18, 2018

jgfouca commented Feb 1, 2018 • edited Loading

billsacks commented Feb 1, 2018

jgfouca commented Feb 1, 2018

jgfouca commented Feb 8, 2018

jgfouca commented Dec 3, 2018 • edited Loading

jgfouca commented Dec 3, 2018

jgfouca commented Dec 3, 2018

jgfouca commented Jan 17, 2019

rljacob commented Jan 30, 2019

billsacks commented Jan 30, 2019

jedwards4b commented Aug 18, 2020

jgfouca commented Aug 19, 2020

rljacob commented Aug 19, 2020 • edited Loading

github-actions bot commented Aug 12, 2023

github-actions bot commented Nov 11, 2023

github-actions bot commented Nov 17, 2023

jgfouca commented Dec 19, 2017 •

edited

Loading

jgfouca commented Feb 1, 2018 •

edited

Loading

jgfouca commented Dec 3, 2018 •

edited

Loading

rljacob commented Aug 19, 2020 •

edited

Loading