-
Notifications
You must be signed in to change notification settings - Fork 85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Permuting reference input sets; Tracking data history #27
Comments
I have some suggestions for c)-f) (no miracles, but fairly workable), but they'll take a little while to writeup. My solution to a) so far has run counter to b). i.e., I have an easy way to create different inputs directories for each permutation, but that creates a lot of data (e.g., four full inputs directories to study two different price trajectories and two different demand response trajectories). I've been thinking about three solutions to this, and I think I prefer the third. Here they are:
|
Now that I think more about implementing this, I'm leaning more toward option 1: allow subdirectories within each inputs directory, each of which contains a subset of the files needed for a model run. Then load_aug() will check the main directory if a required file doesn't exist in the subdirectory. I may do this by defining a model.open_input_file(tab_file_name) method, which can be used instead of open(os.path.join(inputs_dir, tab_file_name)). Then the same logic can be used by modules that directly open their own input files. Disadvantages of the inputs subdir approach:
Advantages:
|
So your basic pattern is specify one reference scenario, then allow permutations to selectively replace individual input files. Of your choices, I would favor 1 or 3, depending on use cases. For 3, I generally prefer storing a long series of arguments in a config file rather than passing them through the command line (easier record keeping), but that's an implementation detail. The main differences between 1 & 3 is whether you look for the list of diffs in a directory or in an config file/command line arguments. Option 1 would probably be a bit easier for casual users to dig through .. both options require you to open a file browser, but option 3 also requires you to open a text document. Option 3 is a bit more optimized for computers, allowing comprehensive permutations without data duplication or needing to wait for OS reads of directory contents (which can add up to a significant overhead for network drives when computations are relatively fast and disk scans are frequent). Another option I was starting to think through is to make data repositories in git, and use branches and/or github forks to track permutations? This seems like a good strategy for most issues, but could be harder for casual users to grasp. A static set of folders & files could be easier for casual users to deal with than git branches. Different use cases may require different strategies, but I wanted to explore if git repos & branches might work for everything. |
I was leaning toward option 3 originally, but when I contemplated implementing it, I got scared off. Take an easy example -- suppose I want to run two different scenarios which use two different fuel cost series. In my back-end database, I have a fuel_costs table with a column showing fuel_scen_id. In my model setup script, I can specify a fuel_scen_id as an argument, and the extraction query(ies) use that to pull out the right fuel cost series. If I use option 1, I can make one pass through all the queries with my default fuel_scen_id, and dump those files into the main directory. Then I can make another pass with (only) a new scenario name and fuel_scen_id. In this pass, any queries that are affected by the fuel_scen_id argument will automatically dump their output into a subdir matching the scenario name (e.g., "high_fuel_costs"). If I use option 3, the user would need to give a name to each alternative version of each data setup parameter (e.g., "high") and then the model setup script would have to munge that into the file name ("fuel_costs_high.tab"). Then the user would need to identify all the file(s) that are affected by this change, and refer to the alternative version when they setup the scenario (e.g., "--alias fuel_costs.tab=fuel_costs_high.tab"). To make this work, the user will need to look through the data extraction module or the inputs directory to find out which .tab files are affected by each argument. This is probably workable in this case, but it's messy. It will be harder to use and error-prone if one parameter affects multiple files, e.g., a model setup parameter that excludes certain technologies. So I'm leaning toward option 1 basically because it creates a more natural pipeline for scenario definition: the user can say in their model setup script, "I want a scenario called 'high_high' with fuel_scen_id='high' and ev_scen_id='high'". Then they can run that scenario by saying "switch solve --inputs-subdir high_high" instead of "switch solve --alias fuel_costs.tab=fuel_costs_high_fuel_cost.tab --alias ev_adoption.tab=ev_adoption_high_ev_adoption.tab". For most of my use cases, I don't think that git branches and forks would work very well for permuting the data files. Usually I use different permutations to analyze different policies or risks as part of a single study, i.e., I would usually want to have both datasets on disk at the same time, so I can compare the data files, present them as a coherent set of scenarios, run the scenarios in parallel, etc. I would use commits and tags to represent different versions of the same basic study (e.g., if I change my solar dataset and begin doing new studies with that). If I had multiple qualitatively different studies that shared the same code, I might use forks or branches for that. But more likely I'd just promote the shared code up to the regional code repository (switch_mod.hawaii) and maintain separate repositories for the separate datasets. |
By the way, this is the general file structure I have been moving towards:
In this setup, there is one repository for each different category of study that I do. Each repository either holds a complete set of source data (often in Excel files) and code to make that into .tab files, or it holds a lightweight script (get_scenario_data.py) which passes arguments to a shared script (switch_mod.hawaii.scenario_data.write_tables()) which creates all the .tab files by extracting data from our back-end database. This allows me to have multiple studies going at once, which aren't really related to each other. Even the studies that draw on the same back-end database just use a lightweight script to say which data they want, so there's no real reason to create these as branches or forks of some "standard" study repository. Managing each of these with git/github enables goals c, d and e. There is a straightforward evolution of the data used for a particular study, tracked in git. And to run a particular study, people just need to install switch, clone a study repository (possibly a particular commit/tag/release of that repository), cd into it and then run "switch solve" or "switch solve-scenarios". I'm not sure about goal f). I haven't found that I need to do a lot of derivative work based on individual studies. It's more like my back-end database, data extraction scripts and main repository evolve in tandem, and then it's pretty easy to tweak the other study repositories to use the revised dataset (since there's not much code in each repository). But I suppose if you wanted to start with the "main" or "pha" repository and tweak a few parameters to make a derived model, that would be easy to do by branching or forking the repository. That could even allow a combination of automated .tab file creation (as I use) with manual changes (as an interested outside party might do. As I said, there are no miracles here, but it seems to work well enough for me. |
Regarding ease of implementation and use of option 1 vs 3, you could make option 3 about as easy as option 1. This is under the assumption that command-line parameters can be stored in a text file, which I assume is straightforward. In both cases each individual scenario is defined in terms of a reference dataset and a set of data diffs. The reference dataset will probably be stored in a single directory for simplicity. The set of data diffs will be a set of paths that resolves to the diff files. For fastest performance on random file systems of random clusters, you'll probably want to have a text file that specifies the paths to each scenario's diff files. A few years ago, I ran into significant and persistent disk lag during the secondary production-cost simulation on a UC Berkeley EECS cluster when scanning a directory of that held ~365 folders. My solution was to tweak my database export script to write a text files of the each relevant paths while it was creating the directories. If option 1 includes that performance detail, it starts to resemble option 3 with the added convention of 1 set of diffs per subdirectory. In your example, the user would execute the scenario like with: "switch solve --scenario high_high", and switch would look for the appropriate aliases in the scenarios.cfg file; failing to find descriptions in the text file, it could look for a subdirectory named high_high before giving up. Yeah, the need to keep multiple semi-human-readable scenarios on disk at the same time is pretty crucial. If we have a git repository for the compiled data and branches for each scenario, then setting up the local runtime data directories could entail:
This is a different process than just cloning the entire repository and relevant branches. If some script is automating the grunt-work, then using git vs direct database export to set up input data doesn't seem like a bit deal. The next question is then, does git offer enough functionality to bother using it to package data as well as track changes & authorship. I gotta run, but I'll reply to your file structure thread soon. Maybe we should move these kinds of discussions to our google groups? It has better features to track dialogue. https://groups.google.com/forum/#!forum/switch-model |
We need more robust methods for re-using reference input sets, that lets us:
a) specify permutations of inputs for exploring a wider space
b) not duplicate data on disk
c) allows clean and compact diffs
d) readily deployable
e) track development, history, and and stakeholder approval process
f) easily enables derivative work
This issues of permutation and history tracking may have distinct solutions, but I am wondering if we could design a way to use git and data organizational conventions to accomplish all of these.
Matthias has many constant use cases for a-c.
Sergio and his team are actively working through d as they compile data for Switch-Mexico. They have a chance to do it well, and could use some help in figuring out how to navigate tools. They are using google drive, I suggested moving to git (and github if their repositories don't have a size restriction).
I've had separate conversations with Mark and Ana about these issues lately.
That's it for now. I wanted to start a thread on this topic before leaving on vacation for the week.
-Josiah
The text was updated successfully, but these errors were encountered: