-
Notifications
You must be signed in to change notification settings - Fork 189
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use Case Notebook for "Atmospheric Moisture Budgets" #1
Comments
We should add something about setting up the environment and enumerating the dependencies. Here is my guide to setting up a conda environment for xarray work: But @darothen's "getting started" guide for gcpy is better and more comprehensive: |
You also asked about the gist extension: |
Please feel free to cannibalize my guide. If I can help in any way, do let me know. |
Curious, what motivated this (and "the other three") use case? Just something that satisfies your above constraints? I ask because I have worked on the problem of closing atmospheric column tracer budgets computed post-hoc. Depending on how concerned you are with closure, this can be pretty involved. The Appendices of our recent paper touch on this: http://journals.ametsoc.org/doi/abs/10.1175/JCLI-D-16-0785.1. But I suspect this is more detail than you are looking for? If not, I can provide the Python functions that I defined for these efforts (they relied on my infinite-diff package and were computed via an aospy pipeline). |
@spencerahill: hopefully the project description I published earlier today clarifies what I meant by "use cases": these are the official use cases from the NSF proposal. However, we are certainly in need of as many use cases as possible and welcome contributions from everyone. If you have use cases you want to contribute, I would submit a PR to the pangeo-tutorials repo (currently empty, but hopefully will fill up with cool stuff!) |
Just a heads-up that the lack of any use case notebooks is now blocking progress on our project. It would be great to get some basic use cases checked in. They don't have to be fancy! Any sort of semi-realistic workflow on real data is enough for the systems people to move forward with analyzing performance. |
The calculation of the eddy moisture transports is the most intensive I/O part of the budgets, so I am developing a preliminary notebook which computes the monthly mean co-variances from 6 hourly (e.g., u'q'). I will get the notebook working on cheyenne for the CMIP5 version of CESM. Once it is working, I will submit it to the group for critique before cleaning it up. Then I will start working on the vertical integrals, etc. Sound ok? |
@naomi-henderson if you run into performance issues (like I/O bottlenecks) and want to try scaling things let me know. I'd be happy to try to collaborate to try to accelerate things. |
@naomi-henderson your plan sounds perfect Also feel free to ping us for help with any xarray or general python questions you may encounter while developing your notebook. I know you're still spinning up on these tools, so we are here to help with that too. |
@rabernat - I have made a preliminary version of an eddy covariance notebook and it works fine, albeit slowly, on my local machine (one year has to process 24G). I am having a few issues with the time grids, groupby, etc, as you can see here: covar notebook I may have wandered off into the deep end ... |
Awesome @naomi-henderson! This looks like a great start. I'm very pleased you have gotten your calculation to a preliminary working version. I would love to sit with you for an hour on Wednesday and go though this in detail. It should be possible to eliminate the loops over files. Dask is unable to parallelize such operations, and they can also be confusing to read. The idea is to first create one big xarray dataset with everything in it (although not in memory) and then perform the desired grouping / aggregation on it using xarray's built in capabilities. You should never have to manually accumulate an average, as you do in your notebook--although this is a very natural place to start for someone coming from matlab or another lower level language! I'm free all day on Wednesday, so please suggest a time that works for you! |
It might be useful to keep this notebook around to show before-and-after examples. This Rosetta-stone-style comparison might help other scientists more accustomed to C++/Fortran style development understand XArray-style development and vice versa. |
We only have a few opportunities for this kind of comparison before everyone gets exposed to both types. |
Thanks @rabernat, I was hoping you would have time to help me with this! Does 10am on Wed work for you? |
Yes, perfect.
…On Mon, Oct 9, 2017 at 10:07 PM, Naomi Henderson ***@***.***> wrote:
Thanks @rabernat <https://github.com/rabernat>, I was hoping you would
have time to help me with this! Does 10am on Wed work for you?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABJFJkyDCsJnkBDJZiFLMUqXn12NfqGdks5sqtFcgaJpZM4PHbB5>
.
|
Thanks to @rabernat this morning, I have a new version of the preliminary eddy covariance notebook: New version, no loops. I love the suggestion by @mrocklin to keep my preliminary notebooks around to help others migrate to the xarray/dask world! I hope to check it out on cheyenne (using the 'export LD_LIBRARY_PATH=' temporary fix) and My next step is to make a new notebook which does the vertical interpolation of the monthly data from sigma levels to pressure levels. I know that there are python scripts to do this on the command line file by file, but we need something to work on the whole xarray Dataset. Expect many questions. |
Cool looking notebook @naomi-henderson ! Makes me wish I understood what you all were doing :) Do you have any thoughts about the experience? Both on what felt pleasant and on what felt frustrating at first? It would be useful to understand your thought process as you're going through this. |
(only if you feel like sharing, didn't mean to put you on the spot) |
I forgot to mention the (many) reasons why the eddy covariance notebook is still preliminary. Although it works for the NCAR/CCSM4 model, it will not work for the models with multiple ensemble members. This will require adding a new 'member' dimension. The notebook also needs various checks for the existence of the correct variables as well as debugging for all possible calendars (leap, no leap, 360 days/year, etc). |
@mrocklin: the markdown cell at the end of the notebook gives a description of the calculation and data used. But it is probably still assumes too much domain knowledge and uses lots of jargon. Let us know how we can improve this to make it more understandable to a wider audience. |
@spencerahill: any thoughts on how aospy might fit into the CMIP data processing workflow that Naomi is describing above? |
These kind of things are definitely in the spirit of aospy, but it may be overkill to implement a full aospy-based workflow in this case. Frankly, we require a fair amount of boilerplate at present, which I think cuts against the intent of the notebook. (Perhaps, @rabernat, a separate notebook specifically presenting an aospy workflow for automating calculations across CMIP runs would be welcome?) For ensemble members, this is where xarray's data model shines: insert a new dimension into each dataset called e.g. For the calendars, again xarray does the heavy lifting -- so long as the calendars and time data are CF-compliant (which CMIP data all should be), xarray should decode it properly without much trouble. Where aospy adds additional functionality is in dealing with dates outside the (dare I say narrow) range of permissible dates for See e.g. aospy.utils.times.numpy_datetime_workaround_encode_cf, which re-encodes the time array of a Dataset with out-of-bounds dates that has not been CF-decoded such that, when it is decoded, the time array starts at (nearly) the earliest permissible date. @spencerkclark has much more expertise on calendars/datetimes than me, both within aospy and without, so Spencer feel free to chime in. |
Thanks @spencerahill for the hints on creating a new dimension for ensemble member. I have now tried the eddy covariance notebook on cheyenne, using Jupyter Lab and the Dask dashboard. @mrocklin and @rabernat , I am certainly having performance issues, but it looks like it is mostly reading in the CMIP5 data from '/glade/p/CMIP/CMIP5/output1/...' The open_mfdataset call
Whereas the same on my linux box here at Lamont (data stored on internal 4Tb drive) takes less than 10s:
However the resampling and load() from 6-hourly to monthly
compared to slightly longer time on my linux box:
This is using the sample dask.sh - two nodes, 36 cores each on cheyenne. |
How many files are in your dataset? @rabernat is this possibly the slow file system thing going on where we might want to avoid opening every individual file to check metadata? |
Is this related to the 'too many open files' issue we had in xarray a while back? |
In this test case, I open only 12 files in order to process a single year of a 55-year dataset. Is there a way to turn off the metadata and coordinate checking entirely? Ryan and I did pre-process the files, trying to avoid of the overhead of checking metadata. Since it is only a problem on cheyenne and not on my linux box I was assuming it was something in the file system. The directory structure and files are the same on both. |
I'm guessing if you try |
Seeing all the time being spent in The solution is to tell dask to use a single local thread and then try again import dask
dask.set_options(get=dask.local.get_sync) I get the following profile output
|
So there is a ton of filesystem stuff going on (which is not surprising). I'm also working through the notebook now with a cluster active and am seeing some odd behavior. If anyone with knowledge of XArray internals has some time I would appreciate a video chat where I screenshare and we look at a few things. There is a lot of short-sequential behavior going on. cc @jhamman or @shoyer if he is interested in seeing things run on a cluster. |
This line can take a surprisingly long time when we first start up
|
The odd behavior that I'm seeing is tons of very small sequential computations. It is as though somewhere in XArray there is a for loop calling compute many times over each of the chunks. |
Is that sequential computation happening during |
Yes, also resample |
If you'd like to see I can easily spin up a cluster and screenshare |
Or maybe I'll make a quick screencast |
Neither of these operations is parallelized. They are basically loops. open_mfdataset just opens each file in a list and then loops over them again when concatenating.
You have arrived one of the cruxes of xarray's performance bottlenecks. Welcome! |
I personally don't need to see. I think I know exactly what your bokeh dashboard will show, because I see it myself every time I do these things: serial dispatching of dask tasks to the cluster. A screencast would be very useful, however, to communicate the problem to a wider audience (e.g. xarray devs). |
@rabernat can you point me to that loop? |
You mean the one in |
@mrocklin - I'm guessing you're working off the last release of xarray. I think pydata/xarray#1551 and a few other associated improvements may help with some of what you're seeing here. These changes are on the master branch now and may be worth trying out. |
Yes, that helped considerably. I also made the chunk sizes along time to be smaller, which both helped a lot with the parallelism and avoided crashing the main machine
The computations at the end happen in around 20-30 seconds now. I think that the next largest performance boost for them would be to enable persist so that we can keep data in RAM rather than recompute each time. |
Yeah, most of the time in the plotting lines is just redoing the data loading from disk. Updated Notebook: https://gist.github.com/4800484676155745f3cf318bce2e78a4 |
Currently, a call to
|
Much better @mrocklin ! The new version is also working much faster on my machine. Howver, I am having trouble with your new resample line:
instead I stilll need: Is this a difference in our versions of xarray? (0.9.6) |
Yes, I think so. I got this warning message when I switched to xarray master. I suspect that if you also switch then things will get even faster for you as well.
|
Now that we have the sphinx documentation site working, this use-case notebook can be added into @naomi-henderson: would you be comfortable making a pull request to add the latest and greatest version of your notebook to this repo? |
@rabernat: I will give it a try, but it will have to wait until tomorrow. I will let you know if I have any trouble with the sphinx. |
great! you don't necessarily have to build the docs yourself locally. You can just add your notebook in the |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
This issue has been automatically closed because it had not seen recent activity. The issue can always be reopened at a later date. |
We need a preliminary notebook that uses Xarray to calculate something related to this use case. It doesn't have to be super long or complicated. It just has to express a real-world, scientifically relevant calculation on a large dataset.
Requirements:
cc: @naomi-henderson
I would like to use this as a template for the other three use cases, so I would appreciate general feedback on the checklist above. Are these the right requirements?
The text was updated successfully, but these errors were encountered: