Planning issue #1

TomAugspurger · 2020-05-08T11:55:10Z

I think the notebook is due on May 15th. I'd like to quickly spike out a demo that

Stuff that would be helpful soonish

What's a good GCP-hosted dataset to play with (probably something from https://pangeo-data.github.io/pangeo-datastore/)? AWS?
Start thinking about UX. I'm comfortable just manually passing around / setting tokens. But thinking forward, we'd need that to be handled on a per-user basis.

Stuff that would be helpful later

Assuming we get all the infrastructure pieces working, do we have an actually useful analysis that we could put together in time?

cc @martindurant @rabernat @jhamman (and @scottyhq since I'm using your AWS credits 😄)

scottyhq · 2020-05-08T14:55:45Z

Deploys two Dask-Gateways, one in some GCP region and one in some AWS region (will try terraform for cloud infra stuff first, but might give up quickly)

@salvis2 might be interested in helping out with that. It would be pretty great to have terraform/aws and terraform/gcp subfolders to manage this.

What's a good GCP-hosted dataset to play with (probably something from https://pangeo-data.github.io/pangeo-datastore/)? AWS?

for AWS any of the geospatial public datasets for a demo.

Perhaps @cgentemann 's https://github.com/pangeo-gallery/osm2020tutorial/blob/master/GCP-notebooks/Access_cloud_data_examples.ipynb is a good starting point since it uses CMIP6 on GCP and MUR SST on AWS ?

rabernat · 2020-05-08T15:02:53Z

Perhaps @cgentemann 's https://github.com/pangeo-gallery/osm2020tutorial/blob/master/GCP-notebooks/Access_cloud_data_examples.ipynb is a good starting point since it uses CMIP6 on GCP and MUR SST on AWS ?

👍

The NCAR CESM LENS Dataset on AWS is also good:
https://github.com/NCAR/cesm-lens-aws/

TomAugspurger · 2020-05-08T16:17:47Z

Thanks for the dataset suggestions.

Terraform update:

GCP seems to be working? I did auth / service account stuff and VPC manually (see the README) but terraform took it from there. I think terraform can at least handle the VPC things.
ran into an issue with AWS: Error creating AutoScaling Group terraform-deploy#29

Going to throw Dask Gateway on the clusters next.

For now, I think I'm going to skip jupyterhub and just manually connect to the Gateways. We can think about ways of doing auth / config properly later.

rabernat · 2020-05-08T17:27:03Z

@TomAugspurger - I'd love to help you brainstorm a simple yet scientifically useful example to show. What about something like comparing some statistics of precipitation between the LENS data on AWS to the ERA5 Reanalysis Data on GCS (https://catalog.pangeo.io/browse/master/atmosphere/era5_hourly_reanalysis_single_levels_sa/)?

Would be happy to hop on a chat later today to brainstorm.

TomAugspurger · 2020-05-08T17:36:44Z

@rabernat that sounds good. Specifically: doing some big reduction on each cluster and then comparing the reduced values back on the client machine.

I'm free the rest of the day to chat.

TomAugspurger · 2020-05-11T20:17:37Z

OK, as of master right now, running

make lab

should start up a jupyterlab session on localhost with the right versions of packages (assuming you have docker running locally and don't have anything running on port 8888 already.

Image is built from https://github.com/pangeo-data/multicloud-demo/tree/binder, and uploaded to my personal dockerhub account.

Let me know if you want the password to access the Dask clusters and don't already have it.

I'll look into accessing data tomorrow.

scottyhq · 2020-05-11T21:18:48Z

@TomAugspurger - this is coming together nicely! I just tried your notebook but am not seeing dask workers on the aws cluster after ~10min.

I noticed it is running in us-east-1. If using CESM lens, should change to us-west-2. I looked at running a terraform apply command myself but don't want to overwrite any of your current settings (would have to setup up an S3 backend for multiple people to modify the terraform infrastructure - https://github.com/ICESAT-2HackWeek/terraform-deploy/tree/d64e1d129aeff74fb99e9fe52d9b3b8c2f0b07a0). I don't think that'll be necessary for this demo though.

TomAugspurger · 2020-05-11T21:22:47Z

OK, switching to different regions shouldn't be difficult. It'll just change the URLs.

One thing I noticed, for some reason the number of workers wasn't updating correctly in the AWS GatewayCluster. But the workers were definitely there and could run tasks.

A few other things that are broken:

client.run(...) / client.run_on_scheduler(...). Anything that creates a new connect. Getting None for the security when it should be an SSLContext.
Client repr / Dashboard link. Need to set JUPYTERHUB_USER

scottyhq · 2020-05-11T21:37:31Z

Ok, another thought is that it would be neat to map the obscure load balancer URLs to static names. Then the notebooks wouldn't have to change even if the deployments do. For example aef3XXXXXXXXXXXX.us-west-2.elb.amazonaws.com --> http://gateway.aws-uswest2.pangeo.io and http://gateway.gcp-us-central-1.pangeo.io @yuvipanda pointed out https://github.com/kubernetes-sigs/external-dns/blob/master/docs/faq.md as a solution that we haven't yet explored. Maybe @consideRatio also has some experience with externalDNS? We could also punt this down the road for future work...

consideRatio · 2020-05-11T22:25:16Z

Ive spent last week and today working with DNS in k8s :)

A key question driving options is what environment needs to be able to resolve/lookup what domain name and resolve it to what ip, is it user browsers around the world, or only pods inside a single k8s cluster, or a few clusters, etc.

TomAugspurger · 2020-05-12T14:59:26Z

The multicloud.ipynb notebook now has an example loading CESM LENS data onto the cluster with intake-esm on the AWS cluster.

Going to figure out loading ERA5 data on GCP next, and then should be ready to hand off to an actual scientist to do the interesting bits :)

Managing the "active" Dask client / cluster is a bit tricky. But I'm hopefully that we can asynchronously load data onto each cluster at the same time. We'll see.

cgentemann · 2020-05-12T16:26:00Z

me me me. let me know when up & I'll use please! Thanks Tom!!!! ERA5!!!

TomAugspurger · 2020-05-12T18:47:12Z

Thanks, hoping to have it done later today. Getting the workers to be able to read the requester pays bucket seems extremely complicated :/

rabernat · 2020-05-12T18:49:37Z

Should we just make it public for the time being?

rabernat · 2020-05-12T18:51:57Z

Actually, all the data in the pangeo-era5 bucket is already public, not requester pays: https://console.cloud.google.com/storage/browser/pangeo-era5

TomAugspurger · 2020-05-12T18:57:29Z

Hmm does the pangeo-datastore catalog need to be updated then? I'm working off https://github.com/pangeo-data/pangeo-datastore/blob/09692e8c5a1e0b49a03a92dde0ed47dca859ca7e/intake-catalogs/atmosphere.yaml#L75-L88, which has requester_pays=True as a storage_option.

rabernat · 2020-05-12T18:59:06Z

Can you temporarily override that option when opening the data? I never understood that part of intake very well.

If not, then yes, I suppose we should update the catalog. Actually, long term we would prefer to have it be requester_pays, so this is kind of a fluke.

martindurant · 2020-05-12T19:02:16Z

Yes, you can override any argument (has been possible for a while now) - in this case this would look something like

catalog.era5_hourly_reanalysis_single_levels_sa(storage_options={'token': 'anon'})

TomAugspurger · 2020-05-12T19:03:59Z

Thanks Martin.

I think I'm ~80% of the way to getting workers to read data from buckets. I'll try a bit longer and override with requester_pays=False if necessary.

TomAugspurger · 2020-05-12T21:20:10Z

Got the ERA5 reduction running. Had to bump up the size of the scheduler and workers.

I have some notes on how I got the workers the ability to read from the requester pays bucket. Will clean them up and post them later.

Right now, in order to run this notebook you need the secret key to unlock the secrets folder. That has a service-account file so that you can read the requester pays bucket on the client (your local machine). I have to run now but will reach out to people with the secret key later.

TomAugspurger · 2020-05-13T11:52:09Z

@cgentemann I need to share a couple secret files with you. Do you have keybase account?

scottyhq · 2020-05-13T21:48:43Z

@TomAugspurger - there is a 'pangeo' group on keybase that @rabernat owns. could use that for sharing credentials?

But, do we need to share these credentials? It seems like only the dask workers need credentials to read data in requester-pays buckets correct? the client running locally isn't making GET requests? In which case, it seems like if you create the credentials along with the cluster (terraform or separate kubectl), the dask-gateway helm configuration just needs to point to the service account with permissions that you've created as we do here:
https://github.com/pangeo-data/pangeo-binder/blob/c58c869045b8f1374825ad01dcbf360bdc57a77b/pangeo-binder/values.yaml#L227-L232

@salvis2 - could you point to an example of how you do the AWS role permissions linking with kubernetes service account via terraform?

TomAugspurger · 2020-05-13T22:12:01Z

The local client needs some GCP credentials for reading the era5 metadata.

…

On May 13, 2020, at 16:48, Scott Henderson ***@***.***> wrote: @TomAugspurger - there is a 'pangeo' group on keybase that @rabernat owns. could use that for sharing credentials? But, do we need to share these credentials? It seems like only the dask workers need credentials to read data in requester-pays buckets correct? the client running locally isn't making GET requests? In which case, it seems like if you create the credentials along with the cluster (terraform or separate kubectl), the dask-gateway helm configuration just needs to point to the service account with permissions that you've created as we do here: https://github.com/pangeo-data/pangeo-binder/blob/c58c869045b8f1374825ad01dcbf360bdc57a77b/pangeo-binder/values.yaml#L227-L232 @salvis2 - could you point to an example of how you do the AWS role permissions linking with kubernetes service account via terraform? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

martindurant · 2020-05-13T23:16:59Z

Is this a bad time to trial a "dask::" url? :)

…

On May 13, 2020 6:12:14 PM EDT, Tom Augspurger ***@***.***> wrote: The local client needs some GCP credentials for reading the era5 metadata. > > > On May 13, 2020, at 16:48, Scott Henderson ***@***.***> wrote:> > > > > > @TomAugspurger - there is a 'pangeo' group on keybase that @rabernat owns. could use that for sharing credentials?> > > > But, do we need to share these credentials? It seems like only the dask workers need credentials to read data in requester-pays buckets correct? the client running locally isn't making GET requests? In which case, it seems like if you create the credentials along with the cluster (terraform or separate kubectl), the dask-gateway helm configuration just needs to point to the service account with permissions that you've created as we do here:> > https://github.com/pangeo-data/pangeo-binder/blob/c58c869045b8f1374825ad01dcbf360bdc57a77b/pangeo-binder/values.yaml#L227-L232> > > > @salvis2 - could you point to an example of how you do the AWS role permissions linking with kubernetes service account via terraform?> > > > —> > You are receiving this because you were mentioned.> > Reply to this email directly, view it on GitHub, or unsubscribe.> > > -- > You are receiving this because you were mentioned.> Reply to this email directly or view it on GitHub:> #1 (comment)

-- Sent from my Android device with K-9 Mail. Please excuse my brevity.

TomAugspurger · 2020-05-14T18:37:57Z

Is this a bad time to trial a "dask::" url? :)

Perhaps :)

Does anyone have time to do a similar analysis to https://github.com/pangeo-data/multicloud-demo/blob/master/multicloud.ipynb on the LENS data, or perhaps write some prose on what it's doing? If not I can attempt that later tonight.

I'm free for the next couple hours to assist with infrastructure things.

rabernat · 2020-05-14T18:40:21Z

The local client needs some GCP credentials for reading the era5 metadata.

Except it doesn't! This data is fully public. The catalog is wrong!

TomAugspurger · 2020-05-14T18:50:03Z

Thanks for the reminder. Changed to catalog.era5_hourly_reanalysis_single_levels_sa(storage_options={"requester_pays": False, 'token': 'anon'}) so we don't need to share any secret files to run this locally.

The only auth requirement is that you know the password to connect to the Dask gateways.

rabernat · 2020-05-14T18:53:10Z

Are you hoping that I will contribute more of a science use-case than what I have in #2? No problem if so, just need to plan my next 24 hours.

rabernat · 2020-05-14T18:54:42Z

To answer my own question...#2 just looks at ERA5. To make this more complete, we would want to compare to the Large Ensemble data in AWS.

TomAugspurger · 2020-05-14T18:55:47Z

Yep, that's my question: Is there a specific variable from LENS that I should call histogram on? :) If I knew what data to plug in I think I can take it from there :)

rabernat · 2020-05-14T19:14:24Z

Ok, I have just added a LENS notebook to #2. They could easily be merged into one notebook. It will look very familiar.

The main differences between the ERA5 data and LENS are

ERA5 is hourly, LENS 6hourly, so ERA5 had to be coarsened (note that coarsen works much better than resample with dask)
ERA5 precip units are m, LENS are m/s, so a unit conversion is needed for comparison
LENS has an extra dimension, member_id, which corresponds to the ensemble member of the simulation. It can be used to assess natural variability in the climate system

TomAugspurger · 2020-05-14T19:16:40Z

Thanks. Trying it out in the multicloud notebook now.

…

On Thu, May 14, 2020 at 2:14 PM Ryan Abernathey ***@***.***> wrote: Ok, I have just added a LENS notebook to #2 <#2>. They could easily be merged into one notebook. It will look very familiar. The main differences between the ERA5 data and LENS are - ERA5 is hourly, LENS 6hourly, so ERA5 had to be coarsened (note that coarsen works much better than resample with dask) - ERA5 precip units are m, LENS are m/s, so a unit conversion is needed for comparison - LENS has an extra dimension, member_id, which corresponds to the ensemble member of the simulation. It can be used to assess natural variability in the climate system — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAKAOIUX3IYTEXEVMQV6YGLRRQ7J5ANCNFSM4M4DTI5Q> .

TomAugspurger · 2020-05-14T19:52:31Z

Thanks @rabernat.

In the notebook you do .isel(0). To be comparable with the ERA5 dataset do I want to remove that?

precip_in_m = ds.PRECT.isel(member_id=0)  * (6 * hour)

TomAugspurger · 2020-05-14T20:13:54Z

Does fine on the LENS subset, but seems to have some trouble with the full dataset (with the .isel(member_id=0) removed). (killed worker)

I'll scale up the instance types and memory per worker and try it again later.

TomAugspurger · 2020-05-14T23:12:14Z

Just pushed an update: https://nbviewer.jupyter.org/github/pangeo-data/multicloud-demo/blob/master/multicloud.ipynb. Notes:

I didn't have luck with removing the .isel(member_id=0). Workers kept running out of memory. @rabernat does histogram do something different when the input is a 4d object rather than 3d? Do the ensemble members need to be pre-aggregated or aggregated independently?
I switched to adaptive mode. This works well, at least when there are already VMs around from a previous run. ERA5 needs at least 100-150 workers. LENS needs fewer (25 - 50 is plenty)
I added bits of prose, but would welcome additions / revisions. I have a "Compare results" section that is empty aside from the plots of the two histograms.

TomAugspurger · 2020-05-15T16:30:04Z

Probably spent too long on this image, but the diagrams is cool.

Source is in images/diagram.py. I've included in the notebook.

@rabernat @jhamman I'll plan to submit the notebook at about 5:00 Central (5.5 hours from now) but I can do it later if needed.

rabernat · 2020-05-15T17:04:21Z

Sorry, just seeing this now.

I didn't have luck with removing the .isel(member_id=0). Workers kept running out of memory. @rabernat does histogram do something different when the input is a 4d object rather than 3d? Do the ensemble members need to be pre-aggregated or aggregated independently?

If you remove .isel(member_id=0), you probably want to add, .mean(dim=['time', 'member_id']) at the end of the histogram computation (rather than just mean(dim='time'). Does this work any better?

TomAugspurger · 2020-05-15T18:06:23Z

@rabernat unfortunately, no. Same behavior: Seeing workers killed off, I think as part of the histogram.

The pattern of computation looks quite different too. From what I can tell, for LENS all of the zarr files are being loaded at once before progressing on any of the histogram tasks. I don't think that's the case for the ERA data.

Seems like the time dim of LENS might have larger chunks? 504 vs. 1? I'll try rechunking along that dim before doing the histogram.

rabernat · 2020-05-15T18:15:56Z

I wouldn't lose too much sleep over this. You may be getting into the weeds with xhistogram.

https://github.com/xgcm/xhistogram/blob/master/xhistogram/core.py

I could have sworn I opened a dask issue about a related issue (related to how reshape works, which xhistogram uses internally), but I can't find it.

rabernat · 2020-05-15T18:19:42Z

This should work:

tp_hist = histogram(precip_in_m.rename('tp_6hr'), bins=[tp_6hr_bins], dim=['lon', 'member_id']).mean(dim='time')

Apparently the order of operations matters a lot here. To compare with the other results, you'll have to renormalize, i.e.

tp_hist /= len(ds.member_id)

rabernat · 2020-05-15T18:24:00Z

Ah yes, here it is:
dask/dask#5544

TomAugspurger · 2020-05-15T18:28:11Z

With 200 workers and

lens_tp_hist = histogram(
    precip_in_m.rename('tp_6hr'), bins=[tp_6hr_bins], dim=['lon', 'member_id']
).mean(dim=('time'))

lens_tp_hist /= len(lens.member_id)

lens_tp_hist.data

we get about 1/2 through the digitize tasks whereas previously that failed immediately, but still seeing some killed workers :/

Will maybe look a bit more later, but won't spend too much time on it.

rabernat · 2020-05-15T18:39:40Z

Tom, all I can say here is, welcome to my life! 😂 Perhaps this exercise helps you appreciate a bit some of the frustrations that Pangeo users are feeling around Dask!

As always, thanks for your persistence and patience.

TomAugspurger · 2020-05-15T19:11:21Z

It's good to get some real-world experience :)

Would it make sense to take the histogram of the mean over the ensemble members?

lens_tp_hist = histogram(
    precip_in_m.rename('tp_6hr').mean(dim="member_id"),
    bins=[tp_6hr_bins], dim=['lon']
).mean(dim=('time'))

Or is a simple average of the predictions a bad idea?

TomAugspurger · 2020-05-15T19:26:44Z

Doing the .mean(dim="member_id") before the histogram "worked". Dunno if it's appropriate though. I've pushed that to master.

I was fortunately recording my screen at the time, and they happened to finish within seconds of eachother :)

rabernat · 2020-05-15T20:25:40Z

Would it make sense to take the histogram of the mean over the ensemble members?

Histogram of the mean != mean of the histograms

In particular, by taking the histogram of the ensemble mean, we will definitely lose lots of the tail of the distribution. Scientifically, the mean of the histograms is much more useful and relevant to assessing rainfall extremes.

TomAugspurger · 2020-05-15T21:23:40Z

Yeah that was my fear. I just pushed a commit to take each histogram separately, and then xr.concat() the results.

histograms = [
    histogram(
        precip_in_m.sel(member_id=member_id).rename("tp_6hr"),
        bins=[tp_6hr_bins], dim=["lon"]
    ).mean(dim=('time'))
    for member_id in precip_in_m.member_id.values
]

To visualize, I made a FacetGrid plotting each member separately. Is there anything interesting to say about it https://nbviewer.jupyter.org/github/pangeo-data/multicloud-demo/blob/master/multicloud.ipynb#Compare-results?

rabernat · 2020-05-15T21:29:00Z

s there anything interesting to say about it

No. 😄 It looks like there is very little variability across the ensemble. So you may be better off sticking with just one member, and keeping the notebook clean.

TomAugspurger · 2020-05-15T21:38:44Z

OK, back to simple!

I'll submit this later tonight.

cgentemann · 2020-05-19T22:23:28Z

Tom - any chance you can add some more ERA5 variables?
specifically SST, Surface latent heat flux, Surface sensible heat flux, Significant height of combined wind waves and swell, ...
Chelle

TomAugspurger · 2020-06-05T15:54:59Z

FYI, I updated the README with a link to a screencast walking through the notebook: https://www.youtube.com/watch?v=IeKjLiUqpT4

scottyhq mentioned this issue May 11, 2020

client.run() causing TypeError: TLS expects a ssl_context pangeo-data/pangeo-cloud-federation#603

Closed

Planning issue #1

Planning issue #1

Comments

TomAugspurger commented May 8, 2020 • edited Loading

scottyhq commented May 8, 2020

rabernat commented May 8, 2020

TomAugspurger commented May 8, 2020 • edited Loading

rabernat commented May 8, 2020

TomAugspurger commented May 8, 2020 • edited Loading

TomAugspurger commented May 11, 2020

scottyhq commented May 11, 2020

TomAugspurger commented May 11, 2020

scottyhq commented May 11, 2020

consideRatio commented May 11, 2020

TomAugspurger commented May 12, 2020

cgentemann commented May 12, 2020

TomAugspurger commented May 12, 2020

rabernat commented May 12, 2020

rabernat commented May 12, 2020

TomAugspurger commented May 12, 2020

rabernat commented May 12, 2020

martindurant commented May 12, 2020 • edited Loading

TomAugspurger commented May 12, 2020

TomAugspurger commented May 12, 2020

TomAugspurger commented May 13, 2020

scottyhq commented May 13, 2020

TomAugspurger commented May 13, 2020 via email

martindurant commented May 13, 2020 via email

TomAugspurger commented May 14, 2020

rabernat commented May 14, 2020

TomAugspurger commented May 14, 2020

rabernat commented May 14, 2020

rabernat commented May 14, 2020

TomAugspurger commented May 14, 2020

rabernat commented May 14, 2020

TomAugspurger commented May 14, 2020 via email

TomAugspurger commented May 14, 2020 • edited Loading

TomAugspurger commented May 14, 2020

TomAugspurger commented May 14, 2020

TomAugspurger commented May 15, 2020

rabernat commented May 15, 2020

TomAugspurger commented May 15, 2020

rabernat commented May 15, 2020

rabernat commented May 15, 2020

rabernat commented May 15, 2020

TomAugspurger commented May 15, 2020

rabernat commented May 15, 2020

TomAugspurger commented May 15, 2020

TomAugspurger commented May 15, 2020 • edited Loading

rabernat commented May 15, 2020

TomAugspurger commented May 15, 2020

rabernat commented May 15, 2020

TomAugspurger commented May 15, 2020

cgentemann commented May 19, 2020

TomAugspurger commented Jun 5, 2020

TomAugspurger commented May 8, 2020 •

edited

Loading

TomAugspurger commented May 8, 2020 •

edited

Loading

TomAugspurger commented May 8, 2020 •

edited

Loading

martindurant commented May 12, 2020 •

edited

Loading

TomAugspurger commented May 14, 2020 •

edited

Loading

TomAugspurger commented May 15, 2020 •

edited

Loading