Feature/use onestep zarr train data #207

AnnaKwa · 2020-03-30T22:17:46Z

Changes the training data pipeline to use the single big zarr output from the one step workflow.

Option to specify the variable names from file in case they change in the one step data.

frodre · 2020-03-31T18:09:59Z

Adjust the create training data workflow to use the new big zarr output from one step run workflow

nbren12

Thanks Anna. Overall it looks like a good implementation of the approach you outlined to me yesterday. I think the handling of variable names could be improved subtantially though. See my comments below.

Also, did you intentionally include the changes to the fv3config submodule? If not, please revert it to whatever is on develop-one-steps. There are also a bunch of formatting changes showing up in the diff, which I would revert.

external/vcm/tests/test_calc.py

external/vcm/vcm/calc/calc.py

fv3net/pipelines/coarsen_restarts/__main__.py

fv3net/pipelines/common.py

fv3net/pipelines/create_training_data/__init__.py

fv3net/pipelines/create_training_data/pipeline.py

…efault to use same for both

brianhenn

Thanks Anna there's a lot here I am trying to use/be consistent with on the one-step diags step. I had one suggestion/question about sharing code and names across multiple steps.

fv3net/pipelines/create_training_data/__main__.py

brianhenn · 2020-04-01T15:51:56Z

fv3net/pipelines/create_training_data/names.py

@@ -0,0 +1,69 @@
+# suffixes that denote whether diagnostic variable is from the coarsened


This is pretty much exactly the list of var names that the one step diags will use. If we're in agreement that fv3net.pipelines.common is a good idea for sharing/consistency across workflow steps, then this file (and the yaml if it's being used) seems like a good candidate to reside there

referencing this comment from @nbren12 's review: #207 (comment)

Since in a previous discussion we decided against doing the "import names from common .py" route to avoid linking the workflows in that manner, if we're using a lot of common var names across the workflows then I think we should go with the (2) and pass the variable name information to the workflows' respective main/run functions.

@brianhenn As discussed offline, I'll change the source of the var names to be read in and passed to the run function so that a common list can be provided to both workflows

In particular this commit should address your comment: 354d675

external/vcm/vcm/calc/calc.py

AnnaKwa

I'm not actually reviewing this but I somehow started one by responding to Brian's comment

…data

AnnaKwa · 2020-04-01T22:35:00Z

Address PR comments, ready for re-review.

As we discussed where the various workflows would use names from earlier, the module level global vars are replaced with a dict that gets read and passed to the run function.

nbren12

Thanks for making all those changes! I made some suggestions below for how to reduce the complexity of _create_train_cols, but they probably aren't strictly necessary for this PR. I do however feel a little more strongly about not changing vcm.apparent_source. This is almost there...FWIW I think we should keep using this pipeline instead of dask jobs, since beam is very robust for this kind of big complicated calculation.

external/vcm/vcm/calc/calc.py

nbren12 · 2020-04-01T23:35:34Z

fv3net/pipelines/create_training_data/__main__.py

+        try:
+            names = yaml.safe_load(stream)
+        except yaml.YAMLError as exc:
+            raise ValueError(f"Bad yaml config: {exc}")


This try catch seems redundant since you re-raise the error raised by yaml and don't have any special error handling logic. The traceback should make it pretty clear where the error is from.

nbren12 · 2020-04-01T23:36:34Z

fv3net/pipelines/create_training_data/helpers.py

@@ -19,6 +19,11 @@
 logger.setLevel(logging.INFO)


+def convert_forecast_time_to_timedelta(ds, forecast_time_dim):


is this a public function?

No, added _ to name

nbren12 · 2020-04-01T23:47:46Z

fv3net/pipelines/create_training_data/pipeline.py

-
-    logger.info(f"Processing {len(data_batch_urls)} subsets...")
+def run(args, pipeline_args, names):
+    fs = get_fs(args.gcs_input_data_path)


I think this could use a docstring or at least some type hints.

nbren12 · 2020-04-01T23:49:49Z

fv3net/pipelines/create_training_data/pipeline.py

-        fs: GCSFileSystem
-        run_dirs: list of GCS urls to open
-
+        ds (xarray dataset): must have the specified feature vars in cols_to_keep


this docstring seems largely out of date. since this is a helper function you could probably just delete all the argument information below.

nbren12 · 2020-04-01T23:55:13Z

fv3net/pipelines/create_training_data/pipeline.py

-        ds[VAR_Q_U_WIND_ML] = apparent_source(ds.u)
-        ds[VAR_Q_V_WIND_ML] = apparent_source(ds.v)
-        ds[VAR_Q_HEATING_ML] = apparent_source(ds.T)
-        ds[VAR_Q_MOISTENING_ML] = apparent_source(ds.sphum)
        ds = (


Is this not already done by the apparent_source function you use above?

The other feature variables that aren't created by the loop above still have values at coordinates that we want to drop.

nbren12 · 2020-04-01T23:55:53Z

fv3net/pipelines/create_training_data/pipeline.py

        ]
-        features_diags_c48 = diags_c48.rename(RENAMED_PROG_DIAG_VARS)
+        features_diags_c48 = diags_c48.rename(renamed_high_res_vars)
        return xr.merge([ds_run, features_diags_c48])
    except (KeyError, AttributeError, ValueError, TypeError) as e:
        logger.error(f"Failed to merge in features from high res diagnostics: {e}")


Shouldn't the job fail in this case?

Yes, but the problem used to be that the LoadCloudData step would often fail when open_restarts was called, and if I did not catch that exception the entire pipeline would stop because of hitting the 4 failed jobs. But if that exception is caught, then downstream pipelines should fail and if those exceptions aren't caught then we'll just end up with the whole pipeline stopping again.

I had a discussion with @spencerkclark in the initial PR about this, because this goes against the Dataflow guidelines- there's not really a way to increase the number of failures allowed before stopping the whole pipeline, they expect the input should always be clean/valid. We decided that this was ok as a temporary solution but we should add some kind of step after the one step jobs that would write information about which run dirs were good to use in this step, which could be used to filter out the bad data before it went into the pipeline.

nbren12 · 2020-04-01T23:57:42Z

fv3net/pipelines/create_training_data/pipeline.py

+            dict of train and test timesteps
+            )
+    """
+    init_datetime_coords = [


This datetime parsing seems out of place. please lift to the calling function.

nbren12 · 2020-04-02T00:01:11Z

fv3net/pipelines/create_training_data/pipeline.py

+            | "CreateTrainingCols"
+            >> beam.Map(
+                _create_train_cols,
+                cols_to_keep=names["one_step_vars"] + names["target_vars"],


As far as I can tell, you only ever use one_step_vars + target_vars, and never each individually. If that's true I would just make them one list in the yaml.

Condensed to single list

nbren12 · 2020-04-02T00:03:28Z

fv3net/pipelines/create_training_data/pipeline.py

-        ds[VAR_Q_U_WIND_ML] = apparent_source(ds.u)
-        ds[VAR_Q_V_WIND_ML] = apparent_source(ds.v)
-        ds[VAR_Q_HEATING_ML] = apparent_source(ds.T)
-        ds[VAR_Q_MOISTENING_ML] = apparent_source(ds.sphum)
        ds = (
            ds[cols_to_keep]


You can get rid of the cols_to_keep argument too, by lifting it to a beam.Map too e.g.:

... | "SelectVariables" > beam.Map(lambda x: x[list(cols_to_keep)]) ...

Also be sure to use the list. I have been burned by passing tuples or other iterables to Dataset.__getitem__ many times.

AnnaKwa · 2020-04-02T23:35:48Z

Addressed PR comments, ready for re-review.

I also saw that there were a couple parts that had been overlooked in regards to renaming to or using the newer dim names: i) centering vars on cell edges and ii) rename high res variables. Those changes were also added.

nbren12

Thanks for all the hard work!

* Feature/one step save baseline (#193) This adds several features to the one-step pipeline - big zarr. Everything is stored as one zarr file - saves physics outputs - some refactoring of the job submission. Sample output: https://gist.github.com/nbren12/84536018dafef01ba5eac0354869fb67 * save lat/lon grid variables from sfc_dt_atmos (#204) * save lat/lon grid variables from sfc_dt_atmos * Feature/use onestep zarr train data (#207) Use the big zarr from the one step workflow as input to the create training data pipeline * One-step sfc variables time alignment (#214) This makes the diagnostics variables appended to the big zarr have the appropriate step and forecast_time dimensions, just as the variables extracted by the wrapper do. * One step zarr fill values (#223) This accomplishes two things: 1) preventing true model 0s from being cast to NaNs in the one-step big zarr output, and 2) initializing big zarr arrays with NaNs via full so that if they are not filled in due to a failed timestep or other reason, it is more apparent than using empty which produces arbitrary output. * adjustments to be able to run workflows in dev branch (#218) Remove references to hard coded dims and data variables or imports from vcm.cubedsphere.constants, replace with arguments. Can provide coords and dims as args for mappable var * One steps start index (#231) Allows for starting the one-step jobs at the specified index in the timestep list to allow for testing/avoiding spinup timesteps * Dev fix/integration tests (#234) * change order of required args so output is last * fix arg for onestep input to be dir containing big zarr * update end to end integration test ymls * prognostic run adjustments * Improved fv3 logging (#225) This PR introduces several improvements to the logging capability of our prognostic run image - include upstream changes to disable output capturing in `fv3config.fv3run` - Add `capture_fv3gfs_func` function. When called this capture the raw fv3gfs outputs and re-emit it as DEBUG level logging statements that can more easily be filtered. - Refactor `runtime` to `external/runtime/runtime`. This was easy since it did not depend on any other module in fv3net. (except implicitly the code in `fv3net.regression` which is imported when loading the sklearn model with pickle). - updates fv3config to master * manually merge in the refactor from master while keeping new names from develop (#237) * lint * remove logging from testing * Dev fix/arg order (#238) * update history * fix positional args * fix function args * update history * linting Co-authored-by: Anna Kwa <[email protected]> Co-authored-by: brianhenn <[email protected]>

* API: user_project falls back to project Closes #207

nbren12 and others added 10 commits March 27, 2020 18:36

data var names from one step

52abfa9

xy dim rename

b91e658

use big zarr for grid spec

5f95410

distribute ds selected on timesteps to workers instead of urls

b14e4cf

keep target vars

ae53ac4

test for apparent source function

9a81614

update test script

7eb142d

fix optional var name func

613a3c0

fix failing test

efcb9f1

fix test and lint

bcaed9c

nbren12 suggested changes Mar 31, 2020

View reviewed changes

Anna Kwa added 6 commits April 1, 2020 00:04

checkout accidental lint change to dev branch version

70ab265

Address PR comments

c245c49

linting

c370ebd

fix tests

3578d4e

add args for which forecast tstep to use in calculating tendencies. d…

605b5fa

…efault to use same for both

linting

84c10a1

brianhenn reviewed Apr 1, 2020

View reviewed changes

fix vcm calc test

a569561

AnnaKwa commented Apr 1, 2020

View reviewed changes

external/vcm/vcm/calc/calc.py Outdated Show resolved Hide resolved

AnnaKwa commented Apr 1, 2020

View reviewed changes

Anna Kwa added 6 commits April 1, 2020 19:45

pass var names from file

354d675

edit runfile

7d25fd9

Merge branch 'develop-one-steps' into feature/use-onestep-zarr-train-…

4af88a5

…data

checkout to dev version of runfile for one step workflow

5f65a42

changes to fix run errors'

bd25db3

add comments to name file and update run scirpt

af6120b

fix test

03db5fa

AnnaKwa requested a review from nbren12 April 1, 2020 22:39

nbren12 suggested changes Apr 2, 2020

View reviewed changes

Anna Kwa added 5 commits April 2, 2020 19:33

PR comments

5535fe2

fix test and add docstring

72727d9

fix dim name mismatches with high res and edge vars

6f3e299

linting:

7ca578e

fix comment

f5a23e9

AnnaKwa requested a review from nbren12 April 2, 2020 23:35

nbren12 approved these changes Apr 3, 2020

View reviewed changes

AnnaKwa merged commit 3fca7b9 into develop-one-steps Apr 3, 2020

AnnaKwa deleted the feature/use-onestep-zarr-train-data branch April 3, 2020 22:41

spencerkclark pushed a commit that referenced this pull request May 7, 2021

API: user_project falls back to project (#208)

61b2f66

* API: user_project falls back to project Closes #207

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/use onestep zarr train data #207

Feature/use onestep zarr train data #207

AnnaKwa commented Mar 30, 2020 •

edited

Loading

frodre commented Mar 31, 2020

nbren12 left a comment

brianhenn left a comment

brianhenn Apr 1, 2020

AnnaKwa Apr 1, 2020

AnnaKwa Apr 1, 2020

AnnaKwa Apr 1, 2020

AnnaKwa left a comment

AnnaKwa commented Apr 1, 2020

nbren12 left a comment •

edited

Loading

nbren12 Apr 1, 2020

nbren12 Apr 1, 2020

AnnaKwa Apr 2, 2020

nbren12 Apr 1, 2020

AnnaKwa Apr 2, 2020

nbren12 Apr 1, 2020

nbren12 Apr 1, 2020

AnnaKwa Apr 2, 2020

nbren12 Apr 1, 2020

AnnaKwa Apr 2, 2020

nbren12 Apr 1, 2020

AnnaKwa Apr 2, 2020

nbren12 Apr 2, 2020

AnnaKwa Apr 2, 2020

nbren12 Apr 2, 2020

AnnaKwa Apr 2, 2020

AnnaKwa Apr 2, 2020

AnnaKwa commented Apr 2, 2020

nbren12 left a comment

		@@ -0,0 +1,69 @@
		# suffixes that denote whether diagnostic variable is from the coarsened

		@@ -19,6 +19,11 @@
		logger.setLevel(logging.INFO)


		def convert_forecast_time_to_timedelta(ds, forecast_time_dim):

Feature/use onestep zarr train data #207

Feature/use onestep zarr train data #207

Conversation

AnnaKwa commented Mar 30, 2020 • edited Loading

frodre commented Mar 31, 2020

nbren12 left a comment

Choose a reason for hiding this comment

brianhenn left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AnnaKwa left a comment

Choose a reason for hiding this comment

AnnaKwa commented Apr 1, 2020

nbren12 left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AnnaKwa commented Apr 2, 2020

nbren12 left a comment

Choose a reason for hiding this comment

AnnaKwa commented Mar 30, 2020 •

edited

Loading

nbren12 left a comment •

edited

Loading