-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/one step save baseline #193
Conversation
Builds a docker image with fv3net, vcm, and fv3gfs-python installed.
It looks cleaner to use a closure for this
The former is constant for all runs, while the latter changes for each time step. It is cleaner to separate them, and remove some of the "and" functions.
workflows/one_step_jobs/runfile.py
Outdated
mpi_comm=MPI.COMM_WORLD, | ||
) | ||
|
||
begin_monitor = fv3gfs.ZarrMonitor( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the "physics" being referred to here in "begin_physics" the same physics that's being stepped in L261? I am kind of confused with the naming here and what the difference is between "begin physics" and "before physics", since the former stores the state right before the dynamics step.
Edit:
Never mind, I think I see what the names refer to. Could the intermediate zarr names use the same names as the step dim ["begin", "after_dynamics", "after_physics"]
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay. I rewrote the code to pass the paths of the monitors as a dictionary whose keys are the "step".
@@ -0,0 +1,30 @@ | |||
import fsspec |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this file supposed to stay in the final PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we can remove it. I thought it was a helpful script though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Noah, this PR is really useful. Minor comments/requests and one spot in one_step.py
that is breaking on my end.
|
||
|
||
def rename_sfc_dt_atmos(sfc: xr.Dataset) -> xr.Dataset: | ||
DIMS = {"grid_xt": "x", "grid_yt": "y", "time": "forecast_time"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really in the scope of this PR but a note for future PRs to the develop branch: need to make sure all references to the old set of variable names in downstream workflows are updated to work with this output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay. maybe we could move this note to the PR to be created from the develop branch to master.
import time | ||
|
||
# avoid out of memory errors | ||
# dask.config.set(scheduler='single-threaded') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add some info to readme instructing people to use this if this is an issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added a section to the README
@@ -1,89 +0,0 @@ | |||
import logging |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the deletion of this file, is the intention for all one step jobs to use orchestrate_submit_jobs.py? Could you update the README for this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed. It doesn't seem like the original script was used by anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@frodre let me know if I should add it back.
I updated the README, where I deleted the help string, since that is bound to get out of date.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for updating the README. If the entry point is the orchestrator can you update the one_step config section in the workflows/end_to_end/full_workflow_config.yaml
with a set if working parameters? FYI I was able to run this using _run_steps.sh
, but when I tried to use the orchestrator I was having some issues.
curr_output_url, | ||
job_labels=job_labels, | ||
**kube_config, | ||
model_config_url, "/tmp/null", job_labels=labels, **kube_kwargs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is /tmp/null
supposed to be the output url in this line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. this job does not need to upload the run directory any longer since all the useful info is in the big zarr. Setting outdir=/tmp/null
was the easiest way to do this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a comment to make this clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I run a test (writing to my own test dir), it complains that the output tmp/null needs to be a remote path, any idea I doing something wrong? Running the test script seems to work for others.
INFO:fv3net.pipelines.kube_jobs.one_step:Number of input times: 11
INFO:fv3net.pipelines.kube_jobs.one_step:Number of completed times: 0
INFO:fv3net.pipelines.kube_jobs.one_step:Number of times to process: 11
INFO:fv3net.pipelines.kube_jobs.one_step:Running the first time step to initialize the zarr store
Traceback (most recent call last):
File "/home/AnnaK/fv3net/workflows/one_step_jobs/orchestrate_submit_jobs.py", line 116, in <module>
local_vertical_grid_file=local_vgrid_file,
File "/home/AnnaK/fv3net/fv3net/pipelines/kube_jobs/one_step.py", line 330, in submit_jobs
run_job(index=k, init=True, wait=True, timesteps=timestep_list)
File "/home/AnnaK/fv3net/fv3net/pipelines/kube_jobs/one_step.py", line 322, in run_job
model_config_url, local_tmp_dir, job_labels=labels, **kube_kwargs
File "/home/AnnaK/fv3net/external/fv3config/fv3config/fv3run/_kubernetes.py", line 74, in run_kubernetes
job_labels,
File "/home/AnnaK/fv3net/external/fv3config/fv3config/fv3run/_kubernetes.py", line 92, in _get_job
_ensure_locations_are_remote(config_location, outdir)
File "/home/AnnaK/fv3net/external/fv3config/fv3config/fv3run/_kubernetes.py", line 113, in _ensure_locations_are_remote
_ensure_is_remote(location, description)
File "/home/AnnaK/fv3net/external/fv3config/fv3config/fv3run/_kubernetes.py", line 119, in _ensure_is_remote
f"{description} must be a remote path when running on kubernetes, "
ValueError: output directory must be a remote path when running on kubernetes, instead is /tmp/null
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran into this too and in my case it was because I hadn't pulled and installed the latest fv3config
that includes Noah's changes. Once I did that ti worked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see, thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another good reason to have run these kind of tests in CI.
logger.info(f"Writing {variable} to {group}") | ||
dims = group[variable].attrs["_ARRAY_DIMENSIONS"][1:] | ||
dask_arr = ds[variable].transpose(*dims).data | ||
dask_arr.store(group[variable], regions=(index,)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think index
needs to get passed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whoops...how did that not get passed? I am surprised this didn't cause an error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess index is defined in the __main__
block...anyway I fixed this.
state = fv3gfs.get_state(names=VARIABLES + (TIME,)) | ||
if rank == 0: | ||
logger.info("Beginning steps") | ||
for i in range(fv3gfs.get_step_count()): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The output zarr has a forecast_time length of 15, whereas our previous approached produced a length of 16 (initial conditions + results after 15 steps). Is the thinking that we could extract the 16 values from the current approach using the "begin" and "after_physics" points in the step dimension, or only use 15 values and compute 14 tendencies (instead of 15 as of now)? Not sure this matters much but it's slightly different than before
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed the "begin" and "after_physics" dimension are redundant. The idea is we can compute the total model tendency by taking differences along the "step" dimension (e.g.
ds.sel(step="after_physics") - ds.sel(step="begin")
) rather than along the "forecast_time" dimension.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh OK that makes sense. So after_physics
at time i is the same as begin
at i + 1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right. It's redundant, but does simplify the logic IMO.
@@ -0,0 +1,14 @@ | |||
workdir=$(pwd) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's up with having this script and the empty run_steps.sh
script?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These were for prototyping only, so I deleted these and pasted them into the readme as a quick example.
@@ -1,89 +0,0 @@ | |||
import logging |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for updating the README. If the entry point is the orchestrator can you update the one_step config section in the workflows/end_to_end/full_workflow_config.yaml
with a set if working parameters? FYI I was able to run this using _run_steps.sh
, but when I tried to use the orchestrator I was having some issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks this worked great overall. See comments about using the orchestrator and the forecast time dim.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, approving
Great thanks a lot guys! |
* Feature/one step save baseline (#193) This adds several features to the one-step pipeline - big zarr. Everything is stored as one zarr file - saves physics outputs - some refactoring of the job submission. Sample output: https://gist.github.com/nbren12/84536018dafef01ba5eac0354869fb67 * save lat/lon grid variables from sfc_dt_atmos (#204) * save lat/lon grid variables from sfc_dt_atmos * Feature/use onestep zarr train data (#207) Use the big zarr from the one step workflow as input to the create training data pipeline * One-step sfc variables time alignment (#214) This makes the diagnostics variables appended to the big zarr have the appropriate step and forecast_time dimensions, just as the variables extracted by the wrapper do. * One step zarr fill values (#223) This accomplishes two things: 1) preventing true model 0s from being cast to NaNs in the one-step big zarr output, and 2) initializing big zarr arrays with NaNs via full so that if they are not filled in due to a failed timestep or other reason, it is more apparent than using empty which produces arbitrary output. * adjustments to be able to run workflows in dev branch (#218) Remove references to hard coded dims and data variables or imports from vcm.cubedsphere.constants, replace with arguments. Can provide coords and dims as args for mappable var * One steps start index (#231) Allows for starting the one-step jobs at the specified index in the timestep list to allow for testing/avoiding spinup timesteps * Dev fix/integration tests (#234) * change order of required args so output is last * fix arg for onestep input to be dir containing big zarr * update end to end integration test ymls * prognostic run adjustments * Improved fv3 logging (#225) This PR introduces several improvements to the logging capability of our prognostic run image - include upstream changes to disable output capturing in `fv3config.fv3run` - Add `capture_fv3gfs_func` function. When called this capture the raw fv3gfs outputs and re-emit it as DEBUG level logging statements that can more easily be filtered. - Refactor `runtime` to `external/runtime/runtime`. This was easy since it did not depend on any other module in fv3net. (except implicitly the code in `fv3net.regression` which is imported when loading the sklearn model with pickle). - updates fv3config to master * manually merge in the refactor from master while keeping new names from develop (#237) * lint * remove logging from testing * Dev fix/arg order (#238) * update history * fix positional args * fix function args * update history * linting Co-authored-by: Anna Kwa <[email protected]> Co-authored-by: brianhenn <[email protected]>
This adds several features to the one-step pipeline
Sample output: https://gist.github.com/nbren12/84536018dafef01ba5eac0354869fb67
TODO