Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updating Along-slope-velocities.ipynb to use intake from Cosima Coo… #458

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

charles-turner-1
Copy link

…kbook.

Couple of issues:

  1. Cosima Cookbook seems to allow searching for coordinates: eg
st_edges_ocean = cc.querying.getvar(experiment, 'st_edges_ocean', session, start_time=start_time, end_time=end_time, n=1) st_edges_array = st_edges_ocean.expand_dims({'yu_ocean': u.yu_ocean, 'xu_ocean': u.xu_ocean}, axis=[1, 2])

This doesn't seem to be easily achievable with intake (Cell 11 pre-update)? I can't currently figure out how to do this & will keep digging.

  1. Updating from cosima_cookbook => intake seems to massively increase compute time for topographic _slop_magnitude calculation:
    UserWarning: Sending large graph of size 47.19 MiB. => Sending large graph of size 123.9 MiB..

Not quite sure the source of issue 2 yet, discussion in #313 looks there might have been a few other similar performance hits - will keep digging.

Currently don't have the right permissions to create a new branch in cosima-recipes - if someone could add me I'll create a new branch for this

…kbook.

Couple of issues:

1. Cosima Cookbook seems to allow searching for coordinates: eg (`st_edges_ocean = cc.querying.getvar(experiment, 'st_edges_ocean', session,
                                    start_time=start_time, end_time=end_time, n=1)
st_edges_array = st_edges_ocean.expand_dims({'yu_ocean': u.yu_ocean, 'xu_ocean': u.xu_ocean}, axis=[1, 2])`
This doesn't seem to be easily achievable with intake (Cell 11 pre-update)? I can't currently figure out how to do this.

2. Updating from `cosima_cookbook` => `intake` seems to massively increase compute time for `topographic
_slop_magnitude` calculation:
`UserWarning: Sending large graph of size 47.19 MiB.` => `Sending large graph of size 123.9 MiB.`.

The source of issue 2 is currently unclear to me & is dramatically increasing computation time.
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

Recipes/Along-slope-velocities.ipynb Outdated Show resolved Hide resolved
@@ -23,6 +23,7 @@
"outputs": [],
Copy link
Collaborator

@navidcy navidcy Sep 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use lower case for variables?


Reply via ReviewNB

@@ -23,6 +23,7 @@
"outputs": [],
Copy link
Collaborator

@navidcy navidcy Sep 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lower case and more verbose? e.g., latitude_slice?


Reply via ReviewNB

@marc-white
Copy link
Contributor

@charles-turner-1 The pattern for selecting out ranges in intake appears to be to load the full data set, and then apply the .sel function, e.g.:

lat_range = slice(-82, -59)
darray = some_intake_catalog_dataset.to_dask()
ht = darray['ht'].sel(yt_ocean = lat_range)

@@ -23,6 +23,7 @@
"outputs": [],
Copy link
Contributor

@marc-white marc-white Sep 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #2.    cat_filtered = catalog.search(name=EXPERIMENT)

This line can go directly to the experiment by key instead:

cat_filtered = catalog[EXPERIMENT]

Reply via ReviewNB

@@ -23,6 +23,7 @@
"outputs": [],
Copy link
Contributor

@marc-white marc-white Sep 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you use catalog[EXPERIMENT] above, this step becomes redundant.


Reply via ReviewNB

@@ -23,6 +23,7 @@
"outputs": [],
Copy link
Contributor

@marc-white marc-white Sep 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #18.    hu = hu_vars.to_dataset_dict(progress_bar=False).get('ocean_grid.fx').drop(

I suggest splitting this into multiple steps - this makes it clearer what each step is doing, and also makes it easier for those interested to stop the process at a given step and look at what they have.


Reply via ReviewNB

@@ -23,6 +23,7 @@
"outputs": [],
Copy link
Contributor

@marc-white marc-white Sep 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this description above the cell it references


Reply via ReviewNB

@@ -23,6 +23,7 @@
"outputs": [],
Copy link
Contributor

@marc-white marc-white Sep 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #2.    u_vars, v_vars = (esm_datastore.search(variable='u',filename='ocean.nc'),

I have pulled a copy of the notebook and am playing with it now, but my initial thought is, why can't this be pulled into one array that contains both variables?


Reply via ReviewNB

@@ -23,6 +23,7 @@
"outputs": [],
Copy link
Contributor

@marc-white marc-white Sep 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #8.    grid = xgcm.Grid(ds, periodic=['X'])

What does this line do?


Reply via ReviewNB

@@ -23,6 +23,7 @@
"outputs": [],
Copy link
Contributor

@marc-white marc-white Sep 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #5.    alongslope_velocity = alongslope_velocity.load()

I'd want to check if the load here is actually necessary - I think the lazy loading magic happens automatically once it's required.


Reply via ReviewNB

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can confirm loading here isn't necessary, but I'm wondering if not doing that is what's made the final plotting cell run so slowly. Although admittedly, I'm only running on 7 cores, and I'm thinking there's some clever Dask/xarray things we could do to speed it up (see below)

@@ -23,6 +23,7 @@
"outputs": [],
Copy link
Contributor

@marc-white marc-white Sep 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hack I found is that you need to search for and convert to Dask a variable that uses that coordinate scheme. I agree though, there really should be a way to load coordinates only.


Reply via ReviewNB

@@ -23,6 +23,7 @@
"outputs": [],
Copy link
Contributor

@marc-white marc-white Sep 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #1.    salt_ds = esm_datastore.search(path='/g/data/ik11/outputs/access-om2-01/01deg_jra55v13_ryf9091/*',variable='salt',frequency='1mon').to_dataset_dict()

Again, avoid searching on filenames (or other 'magic things' we know because we're experiment experts)


Reply via ReviewNB

@@ -23,6 +23,7 @@
"outputs": [],
Copy link
Contributor

@marc-white marc-white Sep 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #28.    ax.set_title('Along-slope barotropic velocity (m s$^{-1}$)');

Stray ;


Reply via ReviewNB

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, using a 7-CPU ARE instance, running this cell without pre-loading the alongslope_velocity data took about 90 mins

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This speed was back down to <1 min once I put the time select back in.

@marc-white marc-white self-requested a review September 17, 2024 00:57
@marc-white
Copy link
Contributor

marc-white commented Sep 17, 2024

Couple of general comments:

  • Something I found really helpful in a previous notebook I looked at was someone had put the %%time magic at the top of major cells; this left a guideline output in the notebook about how long a cell should take to run (roughly). The notebook then had a comment at the top of it, stating on what system (i.e. how many cores) those time estimates were calculated on.
  • I'm watching the final plot cell run now, and I think there's a lot more we can do to improve performance by properly chunking the input data (I don't think there should be 300k+ concatenate jobs waiting to go :\ ) However, probably not much point doing that until cc is totally eliminated from this cookbook. (What I'm seeing is a lot of tiny computation jobs being run, but not a lot of memory being utilized on each worker; makes the think that data is being parceled into far-too-small pieces)

@charles-turner-1
Copy link
Author

Thanks @marc-white - just digging through your comments now.

I've noticed that accessing u &v through intake is blowing their task graphs up to 400 layers from the 16 cosima_cookbook was returning. I'm trying to figure out why now..

@marc-white
Copy link
Contributor

I've noticed that accessing u &v through intake is blowing their task graphs up to 400 layers from the 16 cosima_cookbook was returning. I'm trying to figure out why now..

Check the data chunking, perhaps? The default chunking respects the file structure, so if there's one file per time step, it's going to make every time step its own chunk... I'm not familiar with cosima_cookbook, but I'm wondering if it did it's own magic under the hood in that regard.

  - Where possible, all direct file reads have been removed - still
    requires a bit of a hack to get the grid file (Cell [10]) &
    `st_edges` (Cell [13]).
  - Varnames lowercased.
  - Typos fixed.
  - Direct indexing of catalog.
  - Redundant xarray calls fixed.

Notes:
  - It appears the catalog has several data frequencies available for
    this experiment, unlike `cosima_cookbook`.
    - I made the choice to disambiguate to 3 month freq. to minimise
    	computation size (lowest frequency available).
    - Migrating from cosima_cookbook => intake substantially increased
      dask_graph layers - was necessary to call dask.optimize() to
      flatten these out. All chunking, etc. is the same as it was -
      currently not clear to me where the discrepancy in compute graph
      size originates.
  - Currently doesn't appear that intake supports any good way to search
    for coordinate variables? Leads to aforementioned hacks.
@charles-turner-1
Copy link
Author

I've updated to notebook to address all comments & remove all direct file opens / using known paths, where possible - there's still a couple dodgy workarounds to get coordinates & grid data.

@navidcy: My general approach would be to treat things like lon/lat slices, time slices etc. that we are doing the analysis over as constants - hence uppercasing them - does this go against the general convention in Cosima? Happy to follow that if so

@marc-white
Copy link
Contributor

@charles-turner-1 can you explain your method for comparing the dask_graph layers?

@anton-seaice
Copy link
Collaborator

The cookbook did attempt to do some stuff with chunksizes

e..g. https://github.com/COSIMA/cosima-cookbook/blob/a9f5fc9b2ff3daf1cca0b9dcd7173636c2e274e4/cosima_cookbook/querying.py#L353 but not without it's issues (e.g. COSIMA/cosima-cookbook#333)

The other change that may impact timing is that xarray changed the default chunks it applied, it now uses the chunksizes defined in the source netcdf files (which for a lot of ACCESS-OM data, are very small). I think go to town on specifying some better sizes.

Copy link

review-notebook-app bot commented Sep 19, 2024

View / edit / reply to this conversation on ReviewNB

anton-seaice commented on 2024-09-19T07:22:43Z
----------------------------------------------------------------

Line #3.                    .to_dask()

I think you can add a

xarray_open_kwargs={'chunks':{'time':...}}

to this to_dask.

I would try to aim for the chunks before the sel and mean operations to be ~300MB ?


charles-turner-1 commented on 2024-09-20T02:23:44Z
----------------------------------------------------------------

Cheers, fiddled with this a little bit more and managed have to remove a good chunk of the explicit optimisations - interestingly the cosima_cookbook version was using the same chunk sizes I was getting here using defaults.

Do you know if the order of operations can affect the graph size? Struggling to see any other likely causes of the number of graph layers blowing up

anton-seaice commented on 2024-09-20T03:45:53Z
----------------------------------------------------------------

The number of graph layers seems to be coming from the number of files. There are two per file opened. I didn't realise thats how graph layers worked ! I don't know of optimisations for that but maybe there are ?

setting

   chunks={'time':-1, 'st_ocean':28, 'yu_ocean':600, 'xu_ocean':1200},

seemed like an ok starting point

marc-white commented on 2024-09-20T06:58:00Z
----------------------------------------------------------------

I've been playing with chunking as well this afternoon, and I think that's the 'solution' (or as far as you might get) - all of the .load commands and optimize commands don't seem to have any further effect on the run-time of the final plot call once chunking is fixed up (I'm getting about 7 mins on a 14-core instance).

Copy link

review-notebook-app bot commented Sep 19, 2024

View / edit / reply to this conversation on ReviewNB

anton-seaice commented on 2024-09-19T07:22:44Z
----------------------------------------------------------------

Line #3.    

I would search the catalog for the data which has st_edges_ocean as a coordinate, and load that instead ? And then grad st_edges_ocean from it ? Saying that, if its fixed frequency, then you might still end up with this ['path'][0]


charles-turner-1 commented on 2024-09-20T02:18:47Z
----------------------------------------------------------------

AFAIK there is currently no way of searching for coordinates? See intake/intake-esm#660, I'm working on seeing if we can patch that.

I agree the current approach is messy.

anton-seaice commented on 2024-09-20T03:52:05Z
----------------------------------------------------------------

Nevermind, my suggestion won't work. st_edges_ocean doesn't appear to be the coordinate for any variable (at least not obviously so)

@charles-turner-1
Copy link
Author

charles-turner-1 commented Sep 20, 2024

@charles-turner-1 can you explain your method for comparing the dask_graph layers?

@marc-white To be honest, I've largely just eyeballed this - I tried to take a look at the dask_graph with dask.visualize(u) in the notebook but it was prohibitively expensive - presumably on account of the large graph size. Computed a couple cursory statistics pre & post optimisation & satisfied myself manual optimisation wasn't having any unintended side effects.

I'm still not sure why manual optimisation is necessary, nor where the intermediate layers (not found in the cookbook version) are coming from in here. The Dask API reference reads to me as if it just prunes redundant graph layers & should leave result unchanged.

Copy link
Author

AFAIK there is currently no way of searching for coordinates? See intake/intake-esm#660, currently working on seeing if we can patch that.

I agree the current approach is messy.


View entire conversation on ReviewNB

Copy link
Author

Cheers, fiddled with this a little bit more and managed have to remove a good chunk of the explicit optimisations - interestingly the cosima_cookbook version was using the same chunk sizes I was getting here using defaults.

Do you know if the order of operations can affect the graph size? Struggling to see any other likely causes of the number of graph layers blowing up


View entire conversation on ReviewNB

Copy link
Collaborator

The number of graph layers seems to be coming from the number of files. There are two per file opened. I didn't realise thats how graph layers worked ! I don't know of optimisations for that but maybe there are ?

setting

   chunks={'time':-1, 'st_ocean':28, 'yu_ocean':600, 'xu_ocean':1200},

seemed like an ok starting point


View entire conversation on ReviewNB

Copy link
Collaborator

Nevermind, my suggestion won't work. st_edges_ocean doesn't appear to be the coordinates for any variable (at least obviously so)


View entire conversation on ReviewNB

- Extracted depth_slice & data freq to constants & changed default data
  frequency to 1 month (`1mon`)
- Added a bunch of comments & explanation surrounding chunking.
- Added some comments surrounding operations & what they're doing
Copy link
Contributor

I've been playing with chunking as well this afternoon, and I think that's the 'solution' (or as far as you might get) - all of the .load commands and optimize commands don't seem to have any further effect on the run-time of the final plot call once chunking is fixed up (I'm getting about 7 mins on a 14-core instance).


View entire conversation on ReviewNB

@marc-white
Copy link
Contributor

I realized that the time selection from the cell creating uv_data_arr had vanished - adding that back in makes the subsequent dask.optimize command really quick and reduces things to a single graph layer. However, I'm pretty confused as to why the uv_data_arr still has 1800+ graph layers after adding the time selection back in - I would have thought that would have massively reduced the number of files to touch (which I'm assuming is tightly coupled to graph layer numbers).

@marc-white
Copy link
Contributor

I realized that the time selection from the cell creating uv_data_arr had vanished - adding that back in makes the subsequent dask.optimize command really quick and reduces things to a single graph layer. However, I'm pretty confused as to why the uv_data_arr still has 1800+ graph layers after adding the time selection back in - I would have thought that would have massively reduced the number of files to touch (which I'm assuming is tightly coupled to graph layer numbers).

Interestingly, if I omit the optimize call and then attempt to load, say, u, the speed/outcome/number of processes shown in the Dask dashboard is the same. I'm wondering if the original recipe/cosima-cookbook is doing to something to pre-compute the number of graph layers it actually needs; while the intake based recipe is showing the full number of graphs it might have, but then realizes it doesn't need them all once the data are actually loaded.

Copy link

review-notebook-app bot commented Sep 23, 2024

View / edit / reply to this conversation on ReviewNB

marc-white commented on 2024-09-23T06:31:14Z
----------------------------------------------------------------

Line #6.    

Add in a time slice - will be needed later!


Copy link

review-notebook-app bot commented Sep 23, 2024

View / edit / reply to this conversation on ReviewNB

marc-white commented on 2024-09-23T06:31:15Z
----------------------------------------------------------------

Line #15.            xarray_open_kwargs = {'chunks': chunk_spec} 

Chunking on time here will have no effect, given the data are split into files by time step. You'd need to separately chunk on time after creating the array.


Copy link

review-notebook-app bot commented Sep 23, 2024

View / edit / reply to this conversation on ReviewNB

marc-white commented on 2024-09-23T06:31:15Z
----------------------------------------------------------------

Line #17.                   ).sel(st_ocean=depth_slice

Need to select on the time slice created above, otherwise data volume is huge!

Also, ignore the Dask graph layer reports - optimizing seems to have no real effect on the speed of the rest of the notebook.


@charles-turner-1 charles-turner-1 marked this pull request as draft October 9, 2024 03:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants