Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to plot utilities to generate PNG output rather than inline plots #42

Merged
merged 17 commits into from
Apr 22, 2021

Conversation

mnlevy1981
Copy link
Contributor

This PR will generate images and metadata for @andersy005's dashboard. The first commit produces a separate JSON file for each generated plot, but @andersy005 will follow behind with a tool to clean that all up :) Additional commits are needed to generate PNG files from other plot types (time series, etc).

mnlevy1981 and others added 2 commits January 26, 2021 12:19
First pass at generating image files (and a JSON file with associated metadata)
rather than including images inline in the notebook. Additional commits on this
branch will address different types of plots.
savefig() is part of the parent class, get_filename() is part of the child
class. This current maintains the current API for summary_plots, but next
commit will try to make that API more extensible. Still need to get casename
into the SummaryMapClass to use in file name.
Use **plot_options for arguments beyond da and diag_metadata. Also updated
get_filepaths() to return both the png path and the json path (this let me put
output in images/summary_maps/CASENAME/... instead of requiring images/CASENAME
to be the root)
Note that this doesn't update the notebooks yet, it's just the utils package
Note that I noticed a pretty big bug to fix in the next commit -- I'm not
accounting for isel_dict in the file name, so plotting the same variable at two
vertical levels will create a plot at the first level and then over-write it
with the plot at the second level. I'll fix that tomorrow.
E.g.

SiO3.0095-01-16_0104-12-16.z_t--500.00.png
SiO3.0095-01-16_0104-12-16.z_t--35109.35.png
The issue only affected time series, and was due to a bad argument in
self.get_timeseries_files() [wanted files for a specific variable, but was
getting file names for ALL variables]
Added two new child classes of _PlotTypeBaseClass for trend maps and trend
histograms; modified the two trend_maps notebooks to write files instead of
generating plots.
@mnlevy1981
Copy link
Contributor Author

I'm going to turn off draft status because at this point it looks Sanity Check.ipynb is the only notebook with inline plots still, and it's only 133 KB on disk. I've verified that the number of plots I'm creating matches the number of plots being removed from each of the following notebooks (comments in the code block indicate how many png files exist in each directory; totals should sum to the number of plots removed when diffing the notebook against master):

  • plot_suite_003.ipynb

    # histogram/g.e22.G1850ECO_JRA_HR.TL319_t13.003: 160
    # time_series/g.e22.G1850ECO_JRA_HR.TL319_t13.003: 24
    $ git diff master plot_suite_003.ipynb | grep ^-.*png | wc -l
    184
    
  • plot_suite_004.ipynb

    # histogram/g.e22.G1850ECO_JRA_HR.TL319_t13.003: 680
    # time_series/g.e22.G1850ECO_JRA_HR.TL319_t13.003: 24
    $ git diff master plot_suite_004.ipynb | grep ^-.*png | wc -l
    704
    
  • plot_suite_1deg.ipynb

    # histogram/g.e22b05.G1850ECOIAF_JRA.TL319_g17.cocco.001: 400
    # time_series/g.e22b05.G1850ECOIAF_JRA.TL319_g17.cocco.001: 24
    $ git diff master plot_suite_1deg.ipynb | grep ^-.*png | wc -l
    184
    
  • plot_suite_maps_0001_003.ipynb

    # summary_map/g.e22.G1850ECO_JRA_HR.TL319_t13.003: 480
    $ git diff master plot_suite_maps_0001_003.ipynb | grep ^-.*png | wc -l
    480
    
  • plot_suite_maps_0001_004.ipynb

    # summary_map/g.e22.G1850ECO_JRA_HR.TL319_t13.004: 480
    $ git diff master plot_suite_maps_0001_004.ipynb | grep ^-.*png | wc -l
    480
    
  • plot_suite_maps_0095_1deg.ipynb

    # summary_map/g.e22b05.G1850ECOIAF_JRA.TL319_g17.cocco.001: 480
    $ git diff master plot_suite_maps_0095_1deg.ipynb | grep ^-.*png | wc -l
    480
    
  • trend_maps.003.ipynb

    # trend_hist/g.e22.G1850ECO_JRA_HR.TL319_t13.003: 30
    # trend_map/g.e22.G1850ECO_JRA_HR.TL319_t13.003: 30
    $ git diff trend_maps.003.ipynb | grep ^-.*png | wc -l
    60
    
  • trend_maps.004.ipynb

    # trend_hist/g.e22.G1850ECO_JRA_HR.TL319_t13.004: 30
    # trend_map/g.e22.G1850ECO_JRA_HR.TL319_t13.004: 30
    $ git diff trend_maps.004.ipynb | grep ^-.*png | wc -l
    60
    

Still, I think there are several open questions we should discuss before merging:

  1. File naming conventions: for now, running any of the notebooks that produce png output will create an images/ directory. Inside that directory, there are subdirectories for each type of plot (histogram/, summary_map/, time_series/, trend_hist/, and trend_map/). Each of those contains subdirectories for each of the cases, and then there is a pattern defined for each type of plot in utils/PlotTypeClass.py. Subdirectories for each case is not necessary, and I'm happy to just include the case name in the file name instead

  2. Metadata conventions

    • As a first pass, I am just creating one json file per plot... so the plot plot_type/case/filename.png is associated with plot_type/case/metadata/filename.json. I haven't tied in Anderson's generate_plot_catalog() function yet
    • We should look closely at what is currently going into the json files and make sure I'm not omitting anything
  3. Should we do some refactoring to bring routines from Plotting.py into PlotTypeClass.py? Perhaps the notebooks should be importing PlotTypeClass and the plots themselves can be generated by class method objects?

    • Perhaps we would have an additional level of inheritance in PlotTypeClass classes... rather than _PlotTypeBaseClass -> SummaryMapClass & TrendMapClass we could have _PlotTypeBaseClass -> _MapBaseClass -> SummaryMapClass & TrendMapClass and have the actual plotting be defined in _MapBaseClass? I haven't actually sat down and thought about what this would look like, but it seems like SummaryMapClass and TrendMapClass would have more in common with each other than, say TrendMapClass and TrendHistClass

@mnlevy1981 mnlevy1981 marked this pull request as ready for review January 29, 2021 19:58
@mnlevy1981
Copy link
Contributor Author

@andersy005 -- I should have mentioned this in the previous comment, but png files are in /glade/work/mlevy/hi-res_BGC_JRA/plots_from_analysis/ if you want to play with those in your dashboard; I think the best way to figure out what metadata is missing is to look at a mock-up and see what we wish we could sort by :)

@mnlevy1981
Copy link
Contributor Author

Also on the to-do list: I need to add an argument to allow control over the figure size / resolution when writing to disk.

@klindsay28
Copy link
Contributor

I suggest that the __init__ and get_filepaths methods in class _PlotTypeBaseClass have *args and **kwargs arguments.

That way, calls to these methods, that should never be called, will actually get to the raise NotImplementedError statements.

As is, a user of a derived class that doesn't implement these methods will get errors about unexpected keyword arguments or too many positional arguments when the __init__ and get_filepaths methods in _PlotTypeBaseClass are called (, assuming that the user passes arguments to the methods). These argument error messages aren't as helpful as the message in the raise NotImplementedError statements.

1. Address Keith's comment about *args and **kwargs
2. Use time_bounds to determine interval for histograms and time series
3. Reorganize output so that case name is the top-level directory rather than
   plot type -- this will get clunky when comparing multiple cases, but it
   lines up better with the directory structure of webext.cgd.ucar.edu. When we
   have plots comparing multiple cases, we can figure out a good naming scheme
4. generate_plot_catalog() can print relative paths (and the filepath in the
   json file is relative to the casename rather than the notebooks directory)
@mnlevy1981
Copy link
Contributor Author

mnlevy1981 commented Mar 12, 2021

eefc145 introduces a few updates that are mostly related:

 A few changes to plot generation:

1. Address Keith's comment about *args and **kwargs
2. Use time_bounds to determine interval for histograms and time series
3. Reorganize output so that case name is the top-level directory rather than
   plot type -- this will get clunky when comparing multiple cases, but it
   lines up better with the directory structure of webext.cgd.ucar.edu. When we
   have plots comparing multiple cases, we can figure out a good naming scheme
4. generate_plot_catalog() can print relative paths (and the filepath in the
   json file is relative to the casename rather than the notebooks directory)

The first bullet point is just some low-hanging fruit; the rest ties into making data available for the dashboard. I've started posting some PNG files (and associated CSV catalogs) on webext; so now we can inline images directly from that server:

POC Flux at 100m timeseries from 004

Also, @mgrover1 pointed out that we should use time_bounds instead of time to determine the beginning and end of the interval -- that's why the image above is [correctly] labelled from 0001-01-01 to 0017-12-31 where previous versions claimed it was a time series from 0001-01-16 to 0017-12-16

I'm waiting for one more notebook to finish running, then I'll make a commit with those updates.

edit: I guess the dates don't actually appear in the plot; they're in the file name (POC_FLUX_100m.0001-01-01_0017-12-31.png) as well as the csv catalog

edit2: inline image is now a link to original (I think github cached a version when I originally posted)

These notebooks put images in {casename}/{plottype}/ rather than
{plottype}/{casename}/; there is also a gen_csv.ipynb notebook that creates csv
catalogs for each case (the catalog is {casename}/png_catalog.csv)
@mnlevy1981
Copy link
Contributor Author

I noticed that png_catalog.csv has the index in the first column... but when I set index_label=False it doesn't go away. I'll play with it more during the week.

@mnlevy1981
Copy link
Contributor Author

From chatting with @mgrover1, it looks like we pull useful information from isel_dict for the filename (e.g. basins--Pacific and z_t--35109.35) but the metadata contains entry like "isel_dict": {"basins": 0} and "isel_dict": {"z_t": 0}. Those should be replaced with the actual dimension values... and maybe instead of isel_dict I can call it dim_reduction or something?

For this commit, I only pass savefig_kwargs to summary_plot_histogram() and
summary_plot_global_ts(); for the plot_suite_{003,004,1deg}.ipynb files, I use
kwargs to set dpi=72 (which matches the default value). The next commit(s)
should update the rest of the plotting functions to also use
plot_options['savefig_kwargs'].
A few other small changes snuck in with this commit

1. run_notebooks.sh uses PBS rather than SLURM
2. the location of the zonally-averaged output was moved to campaign
2b. the old datasets were missing year 2 from 003, so I regenerated all the za
output files
    -- I did this based on time series rather than history files because that
       is what we have on disk
    -- I used a slightly updated region mask that doesn't seem to have changed
       the plots but did introduce O(1e-4) relative error in the Atlantic and
       Pacific basins (I believe it moved some grid cells from the Atlantic to
       the Pacific while leaving the global and Indian basins untouched)
I replaced metadata['isel_dict'] with metadata['reduced_dims'] and instead of
including the index of the dim reduction, it lists the actual value (e.g.
instead of basins=0, it contains basins="Global")

I had also switched from ncar-jobqueue to dask-jobqueue in a previous commit,
but with ncar-jobqueue now recognizing the PBS nodes on casper I was able to
switch back.
I regenerated the csv files for each case, and this notebook reflects the
change from isel_dict -> reduced_dims
@mnlevy1981
Copy link
Contributor Author

@mgrover1 and @klindsay28 -- this is ready for review.

@mgrover1 -- I've moved the latest output to webext, hopefully the dashboard reflects the change from isel_dict to reduced_dims.

@klindsay28 -- for now, I've left isel_dict in diag_metadata.yaml; we talked earlier this week about changing that so (for example) that the code uses sel_dict(z_t=35000, method="nearest") instead of isel_dict(z_t=28) but I'd rather address that in a separate issue / PR.

I believe all the other comments mentioned in meetings / this PR thread have been addressed.

1. use "sel_dict" instead of "reduced_dims" for name of dictionary in metadata
2. no longer pass casename = ds.attrs["title"] to plotting functions; all
   functions receive ds, and can determine casename itself (updated trend_map
   notebooks to copy time series attrs to the zonally-averaged datasets)
3. use ds[ds["time"].attrs["bounds"]] instead of ds["time_bound"] or
   ds.time_bound
4. clean up gen_csv.ipynb:
   * use relative path for finding pngs / reading metadata
   * use relative path for writing csv file
   * don't explicitly skip trend_map and trend_hist for the 1 degree run;
   * generate_plot_catalog() can handle non-existent directories
5. Add png_catalog.csv to the repository
@mnlevy1981 mnlevy1981 merged commit 01626d7 into marbl-ecosys:master Apr 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants