-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Planning issue #1
Comments
@salvis2 might be interested in helping out with that. It would be pretty great to have terraform/aws and terraform/gcp subfolders to manage this.
for AWS any of the geospatial public datasets for a demo. Perhaps @cgentemann 's https://github.com/pangeo-gallery/osm2020tutorial/blob/master/GCP-notebooks/Access_cloud_data_examples.ipynb is a good starting point since it uses CMIP6 on GCP and MUR SST on AWS ? |
👍 The NCAR CESM LENS Dataset on AWS is also good: |
Thanks for the dataset suggestions. Terraform update:
Going to throw Dask Gateway on the clusters next. For now, I think I'm going to skip jupyterhub and just manually connect to the Gateways. We can think about ways of doing auth / config properly later. |
@TomAugspurger - I'd love to help you brainstorm a simple yet scientifically useful example to show. What about something like comparing some statistics of precipitation between the LENS data on AWS to the ERA5 Reanalysis Data on GCS (https://catalog.pangeo.io/browse/master/atmosphere/era5_hourly_reanalysis_single_levels_sa/)? Would be happy to hop on a chat later today to brainstorm. |
@rabernat that sounds good. Specifically: doing some big reduction on each cluster and then comparing the reduced values back on the client machine. I'm free the rest of the day to chat. |
OK, as of master right now, running
should start up a jupyterlab session on localhost with the right versions of packages (assuming you have docker running locally and don't have anything running on port 8888 already. Image is built from https://github.com/pangeo-data/multicloud-demo/tree/binder, and uploaded to my personal dockerhub account. Let me know if you want the password to access the Dask clusters and don't already have it. I'll look into accessing data tomorrow. |
@TomAugspurger - this is coming together nicely! I just tried your notebook but am not seeing dask workers on the aws cluster after ~10min. I noticed it is running in |
OK, switching to different regions shouldn't be difficult. It'll just change the URLs. One thing I noticed, for some reason the number of workers wasn't updating correctly in the AWS GatewayCluster. But the workers were definitely there and could run tasks. A few other things that are broken:
|
Ok, another thought is that it would be neat to map the obscure load balancer URLs to static names. Then the notebooks wouldn't have to change even if the deployments do. For example |
Ive spent last week and today working with DNS in k8s :) A key question driving options is what environment needs to be able to resolve/lookup what domain name and resolve it to what ip, is it user browsers around the world, or only pods inside a single k8s cluster, or a few clusters, etc. |
The Going to figure out loading ERA5 data on GCP next, and then should be ready to hand off to an actual scientist to do the interesting bits :) Managing the "active" Dask client / cluster is a bit tricky. But I'm hopefully that we can asynchronously load data onto each cluster at the same time. We'll see. |
me me me. let me know when up & I'll use please! Thanks Tom!!!! ERA5!!! |
Thanks, hoping to have it done later today. Getting the workers to be able to read the requester pays bucket seems extremely complicated :/ |
Should we just make it public for the time being? |
Actually, all the data in the |
Hmm does the pangeo-datastore catalog need to be updated then? I'm working off https://github.com/pangeo-data/pangeo-datastore/blob/09692e8c5a1e0b49a03a92dde0ed47dca859ca7e/intake-catalogs/atmosphere.yaml#L75-L88, which has |
Can you temporarily override that option when opening the data? I never understood that part of intake very well. If not, then yes, I suppose we should update the catalog. Actually, long term we would prefer to have it be requester_pays, so this is kind of a fluke. |
Yes, you can override any argument (has been possible for a while now) - in this case this would look something like
|
Thanks Martin. I think I'm ~80% of the way to getting workers to read data from buckets. I'll try a bit longer and override with |
Got the ERA5 reduction running. Had to bump up the size of the scheduler and workers. I have some notes on how I got the workers the ability to read from the requester pays bucket. Will clean them up and post them later. Right now, in order to run this notebook you need the secret key to unlock the |
@cgentemann I need to share a couple secret files with you. Do you have keybase account? |
@TomAugspurger - there is a 'pangeo' group on keybase that @rabernat owns. could use that for sharing credentials? But, do we need to share these credentials? It seems like only the dask workers need credentials to read data in requester-pays buckets correct? the client running locally isn't making GET requests? In which case, it seems like if you create the credentials along with the cluster (terraform or separate kubectl), the dask-gateway helm configuration just needs to point to the service account with permissions that you've created as we do here: @salvis2 - could you point to an example of how you do the AWS role permissions linking with kubernetes service account via terraform? |
The local client needs some GCP credentials for reading the era5 metadata.
… On May 13, 2020, at 16:48, Scott Henderson ***@***.***> wrote:
@TomAugspurger - there is a 'pangeo' group on keybase that @rabernat owns. could use that for sharing credentials?
But, do we need to share these credentials? It seems like only the dask workers need credentials to read data in requester-pays buckets correct? the client running locally isn't making GET requests? In which case, it seems like if you create the credentials along with the cluster (terraform or separate kubectl), the dask-gateway helm configuration just needs to point to the service account with permissions that you've created as we do here:
https://github.com/pangeo-data/pangeo-binder/blob/c58c869045b8f1374825ad01dcbf360bdc57a77b/pangeo-binder/values.yaml#L227-L232
@salvis2 - could you point to an example of how you do the AWS role permissions linking with kubernetes service account via terraform?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
Is this a bad time to trial a "dask::" url? :)
…On May 13, 2020 6:12:14 PM EDT, Tom Augspurger ***@***.***> wrote:
The local client needs some GCP credentials for reading the era5
metadata. >
>
> On May 13, 2020, at 16:48, Scott Henderson ***@***.***>
wrote:>
> >
> >
> @TomAugspurger - there is a 'pangeo' group on keybase that @rabernat
owns. could use that for sharing credentials?>
> >
> But, do we need to share these credentials? It seems like only the
dask workers need credentials to read data in requester-pays buckets
correct? the client running locally isn't making GET requests? In which
case, it seems like if you create the credentials along with the
cluster (terraform or separate kubectl), the dask-gateway helm
configuration just needs to point to the service account with
permissions that you've created as we do here:>
>
https://github.com/pangeo-data/pangeo-binder/blob/c58c869045b8f1374825ad01dcbf360bdc57a77b/pangeo-binder/values.yaml#L227-L232>
> >
> @salvis2 - could you point to an example of how you do the AWS role
permissions linking with kubernetes service account via terraform?>
> >
> —>
> You are receiving this because you were mentioned.>
> Reply to this email directly, view it on GitHub, or unsubscribe.>
>
>
-- >
You are receiving this because you were mentioned.>
Reply to this email directly or view it on GitHub:>
#1 (comment)
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.
|
Perhaps :) Does anyone have time to do a similar analysis to https://github.com/pangeo-data/multicloud-demo/blob/master/multicloud.ipynb on the LENS data, or perhaps write some prose on what it's doing? If not I can attempt that later tonight. I'm free for the next couple hours to assist with infrastructure things. |
Except it doesn't! This data is fully public. The catalog is wrong! |
Thanks for the reminder. Changed to The only auth requirement is that you know the password to connect to the Dask gateways. |
Are you hoping that I will contribute more of a science use-case than what I have in #2? No problem if so, just need to plan my next 24 hours. |
To answer my own question...#2 just looks at ERA5. To make this more complete, we would want to compare to the Large Ensemble data in AWS. |
Yep, that's my question: Is there a specific variable from LENS that I should call |
Ok, I have just added a LENS notebook to #2. They could easily be merged into one notebook. It will look very familiar. The main differences between the ERA5 data and LENS are
|
Thanks. Trying it out in the multicloud notebook now.
…On Thu, May 14, 2020 at 2:14 PM Ryan Abernathey ***@***.***> wrote:
Ok, I have just added a LENS notebook to #2
<#2>. They could
easily be merged into one notebook. It will look very familiar.
The main differences between the ERA5 data and LENS are
- ERA5 is hourly, LENS 6hourly, so ERA5 had to be coarsened (note that
coarsen works much better than resample with dask)
- ERA5 precip units are m, LENS are m/s, so a unit conversion is
needed for comparison
- LENS has an extra dimension, member_id, which corresponds to the
ensemble member of the simulation. It can be used to assess natural
variability in the climate system
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAKAOIUX3IYTEXEVMQV6YGLRRQ7J5ANCNFSM4M4DTI5Q>
.
|
Thanks @rabernat. In the notebook you do precip_in_m = ds.PRECT.isel(member_id=0) * (6 * hour) |
Does fine on the LENS subset, but seems to have some trouble with the full dataset (with the I'll scale up the instance types and memory per worker and try it again later. |
Just pushed an update: https://nbviewer.jupyter.org/github/pangeo-data/multicloud-demo/blob/master/multicloud.ipynb. Notes:
|
Sorry, just seeing this now.
If you remove |
@rabernat unfortunately, no. Same behavior: Seeing workers killed off, I think as part of the The pattern of computation looks quite different too. From what I can tell, for LENS all of the zarr files are being loaded at once before progressing on any of the histogram tasks. I don't think that's the case for the ERA data. Seems like the |
I wouldn't lose too much sleep over this. You may be getting into the weeds with xhistogram. https://github.com/xgcm/xhistogram/blob/master/xhistogram/core.py I could have sworn I opened a dask issue about a related issue (related to how reshape works, which xhistogram uses internally), but I can't find it. |
This should work: tp_hist = histogram(precip_in_m.rename('tp_6hr'), bins=[tp_6hr_bins], dim=['lon', 'member_id']).mean(dim='time') Apparently the order of operations matters a lot here. To compare with the other results, you'll have to renormalize, i.e. tp_hist /= len(ds.member_id) |
Ah yes, here it is: |
With 200 workers and lens_tp_hist = histogram(
precip_in_m.rename('tp_6hr'), bins=[tp_6hr_bins], dim=['lon', 'member_id']
).mean(dim=('time'))
lens_tp_hist /= len(lens.member_id)
lens_tp_hist.data we get about 1/2 through the Will maybe look a bit more later, but won't spend too much time on it. |
Tom, all I can say here is, welcome to my life! 😂 Perhaps this exercise helps you appreciate a bit some of the frustrations that Pangeo users are feeling around Dask! As always, thanks for your persistence and patience. |
It's good to get some real-world experience :) Would it make sense to take the histogram of the mean over the ensemble members? lens_tp_hist = histogram(
precip_in_m.rename('tp_6hr').mean(dim="member_id"),
bins=[tp_6hr_bins], dim=['lon']
).mean(dim=('time')) Or is a simple average of the predictions a bad idea? |
Histogram of the mean != mean of the histograms In particular, by taking the histogram of the ensemble mean, we will definitely lose lots of the tail of the distribution. Scientifically, the mean of the histograms is much more useful and relevant to assessing rainfall extremes. |
Yeah that was my fear. I just pushed a commit to take each histogram separately, and then histograms = [
histogram(
precip_in_m.sel(member_id=member_id).rename("tp_6hr"),
bins=[tp_6hr_bins], dim=["lon"]
).mean(dim=('time'))
for member_id in precip_in_m.member_id.values
] To visualize, I made a FacetGrid plotting each member separately. Is there anything interesting to say about it https://nbviewer.jupyter.org/github/pangeo-data/multicloud-demo/blob/master/multicloud.ipynb#Compare-results? |
No. 😄 It looks like there is very little variability across the ensemble. So you may be better off sticking with just one member, and keeping the notebook clean. |
OK, back to simple! I'll submit this later tonight. |
Tom - any chance you can add some more ERA5 variables? |
FYI, I updated the README with a link to a screencast walking through the notebook: https://www.youtube.com/watch?v=IeKjLiUqpT4 |
I think the notebook is due on May 15th. I'd like to quickly spike out a demo that
Stuff that would be helpful soonish
Stuff that would be helpful later
cc @martindurant @rabernat @jhamman (and @scottyhq since I'm using your AWS credits 😄)
The text was updated successfully, but these errors were encountered: