Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Planning issue #1

Open
6 tasks done
TomAugspurger opened this issue May 8, 2020 · 51 comments
Open
6 tasks done

Planning issue #1

TomAugspurger opened this issue May 8, 2020 · 51 comments

Comments

@TomAugspurger
Copy link
Member

TomAugspurger commented May 8, 2020

I think the notebook is due on May 15th. I'd like to quickly spike out a demo that

  • AWS k8s
  • GCP k8s
  • AWS Dask Gateway
  • GCP Dask Gateway
  • Load AWS data
  • Load GCP data

Stuff that would be helpful soonish

  1. What's a good GCP-hosted dataset to play with (probably something from https://pangeo-data.github.io/pangeo-datastore/)? AWS?
  2. Start thinking about UX. I'm comfortable just manually passing around / setting tokens. But thinking forward, we'd need that to be handled on a per-user basis.

Stuff that would be helpful later

  • Assuming we get all the infrastructure pieces working, do we have an actually useful analysis that we could put together in time?

cc @martindurant @rabernat @jhamman (and @scottyhq since I'm using your AWS credits 😄)

@scottyhq
Copy link
Member

scottyhq commented May 8, 2020

Deploys two Dask-Gateways, one in some GCP region and one in some AWS region (will try terraform for cloud infra stuff first, but might give up quickly)

@salvis2 might be interested in helping out with that. It would be pretty great to have terraform/aws and terraform/gcp subfolders to manage this.

What's a good GCP-hosted dataset to play with (probably something from https://pangeo-data.github.io/pangeo-datastore/)? AWS?

for AWS any of the geospatial public datasets for a demo.

Perhaps @cgentemann 's https://github.com/pangeo-gallery/osm2020tutorial/blob/master/GCP-notebooks/Access_cloud_data_examples.ipynb is a good starting point since it uses CMIP6 on GCP and MUR SST on AWS ?

@rabernat
Copy link
Member

rabernat commented May 8, 2020

Perhaps @cgentemann 's https://github.com/pangeo-gallery/osm2020tutorial/blob/master/GCP-notebooks/Access_cloud_data_examples.ipynb is a good starting point since it uses CMIP6 on GCP and MUR SST on AWS ?

👍

The NCAR CESM LENS Dataset on AWS is also good:
https://github.com/NCAR/cesm-lens-aws/

@TomAugspurger
Copy link
Member Author

TomAugspurger commented May 8, 2020

Thanks for the dataset suggestions.

Terraform update:

  • GCP seems to be working? I did auth / service account stuff and VPC manually (see the README) but terraform took it from there. I think terraform can at least handle the VPC things.
  • ran into an issue with AWS: Error creating AutoScaling Group terraform-deploy#29

Going to throw Dask Gateway on the clusters next.

For now, I think I'm going to skip jupyterhub and just manually connect to the Gateways. We can think about ways of doing auth / config properly later.

@rabernat
Copy link
Member

rabernat commented May 8, 2020

@TomAugspurger - I'd love to help you brainstorm a simple yet scientifically useful example to show. What about something like comparing some statistics of precipitation between the LENS data on AWS to the ERA5 Reanalysis Data on GCS (https://catalog.pangeo.io/browse/master/atmosphere/era5_hourly_reanalysis_single_levels_sa/)?

Would be happy to hop on a chat later today to brainstorm.

@TomAugspurger
Copy link
Member Author

TomAugspurger commented May 8, 2020

@rabernat that sounds good. Specifically: doing some big reduction on each cluster and then comparing the reduced values back on the client machine.

I'm free the rest of the day to chat.

@TomAugspurger
Copy link
Member Author

OK, as of master right now, running

make lab

should start up a jupyterlab session on localhost with the right versions of packages (assuming you have docker running locally and don't have anything running on port 8888 already.

Image is built from https://github.com/pangeo-data/multicloud-demo/tree/binder, and uploaded to my personal dockerhub account.

Let me know if you want the password to access the Dask clusters and don't already have it.

I'll look into accessing data tomorrow.

@scottyhq
Copy link
Member

@TomAugspurger - this is coming together nicely! I just tried your notebook but am not seeing dask workers on the aws cluster after ~10min.

I noticed it is running in us-east-1. If using CESM lens, should change to us-west-2. I looked at running a terraform apply command myself but don't want to overwrite any of your current settings (would have to setup up an S3 backend for multiple people to modify the terraform infrastructure - https://github.com/ICESAT-2HackWeek/terraform-deploy/tree/d64e1d129aeff74fb99e9fe52d9b3b8c2f0b07a0). I don't think that'll be necessary for this demo though.

@TomAugspurger
Copy link
Member Author

OK, switching to different regions shouldn't be difficult. It'll just change the URLs.

One thing I noticed, for some reason the number of workers wasn't updating correctly in the AWS GatewayCluster. But the workers were definitely there and could run tasks.

A few other things that are broken:

  1. client.run(...) / client.run_on_scheduler(...). Anything that creates a new connect. Getting None for the security when it should be an SSLContext.
  2. Client repr / Dashboard link. Need to set JUPYTERHUB_USER

@scottyhq
Copy link
Member

Ok, another thought is that it would be neat to map the obscure load balancer URLs to static names. Then the notebooks wouldn't have to change even if the deployments do. For example aef3XXXXXXXXXXXX.us-west-2.elb.amazonaws.com --> http://gateway.aws-uswest2.pangeo.io and http://gateway.gcp-us-central-1.pangeo.io @yuvipanda pointed out https://github.com/kubernetes-sigs/external-dns/blob/master/docs/faq.md as a solution that we haven't yet explored. Maybe @consideRatio also has some experience with externalDNS? We could also punt this down the road for future work...

@consideRatio
Copy link
Member

Ive spent last week and today working with DNS in k8s :)

A key question driving options is what environment needs to be able to resolve/lookup what domain name and resolve it to what ip, is it user browsers around the world, or only pods inside a single k8s cluster, or a few clusters, etc.

@TomAugspurger
Copy link
Member Author

The multicloud.ipynb notebook now has an example loading CESM LENS data onto the cluster with intake-esm on the AWS cluster.

Going to figure out loading ERA5 data on GCP next, and then should be ready to hand off to an actual scientist to do the interesting bits :)

Managing the "active" Dask client / cluster is a bit tricky. But I'm hopefully that we can asynchronously load data onto each cluster at the same time. We'll see.

@cgentemann
Copy link
Member

me me me. let me know when up & I'll use please! Thanks Tom!!!! ERA5!!!

@TomAugspurger
Copy link
Member Author

Thanks, hoping to have it done later today. Getting the workers to be able to read the requester pays bucket seems extremely complicated :/

@rabernat
Copy link
Member

Should we just make it public for the time being?

@rabernat
Copy link
Member

Actually, all the data in the pangeo-era5 bucket is already public, not requester pays: https://console.cloud.google.com/storage/browser/pangeo-era5

@TomAugspurger
Copy link
Member Author

Hmm does the pangeo-datastore catalog need to be updated then? I'm working off https://github.com/pangeo-data/pangeo-datastore/blob/09692e8c5a1e0b49a03a92dde0ed47dca859ca7e/intake-catalogs/atmosphere.yaml#L75-L88, which has requester_pays=True as a storage_option.

@rabernat
Copy link
Member

Can you temporarily override that option when opening the data? I never understood that part of intake very well.

If not, then yes, I suppose we should update the catalog. Actually, long term we would prefer to have it be requester_pays, so this is kind of a fluke.

@martindurant
Copy link

martindurant commented May 12, 2020

Yes, you can override any argument (has been possible for a while now) - in this case this would look something like

catalog.era5_hourly_reanalysis_single_levels_sa(storage_options={'token': 'anon'})

@TomAugspurger
Copy link
Member Author

Thanks Martin.

I think I'm ~80% of the way to getting workers to read data from buckets. I'll try a bit longer and override with requester_pays=False if necessary.

@TomAugspurger
Copy link
Member Author

Got the ERA5 reduction running. Had to bump up the size of the scheduler and workers.

I have some notes on how I got the workers the ability to read from the requester pays bucket. Will clean them up and post them later.

Right now, in order to run this notebook you need the secret key to unlock the secrets folder. That has a service-account file so that you can read the requester pays bucket on the client (your local machine). I have to run now but will reach out to people with the secret key later.

@TomAugspurger
Copy link
Member Author

@cgentemann I need to share a couple secret files with you. Do you have keybase account?

@scottyhq
Copy link
Member

@TomAugspurger - there is a 'pangeo' group on keybase that @rabernat owns. could use that for sharing credentials?

But, do we need to share these credentials? It seems like only the dask workers need credentials to read data in requester-pays buckets correct? the client running locally isn't making GET requests? In which case, it seems like if you create the credentials along with the cluster (terraform or separate kubectl), the dask-gateway helm configuration just needs to point to the service account with permissions that you've created as we do here:
https://github.com/pangeo-data/pangeo-binder/blob/c58c869045b8f1374825ad01dcbf360bdc57a77b/pangeo-binder/values.yaml#L227-L232

@salvis2 - could you point to an example of how you do the AWS role permissions linking with kubernetes service account via terraform?

@TomAugspurger
Copy link
Member Author

TomAugspurger commented May 13, 2020 via email

@martindurant
Copy link

martindurant commented May 13, 2020 via email

@TomAugspurger
Copy link
Member Author

Is this a bad time to trial a "dask::" url? :)

Perhaps :)


Does anyone have time to do a similar analysis to https://github.com/pangeo-data/multicloud-demo/blob/master/multicloud.ipynb on the LENS data, or perhaps write some prose on what it's doing? If not I can attempt that later tonight.

I'm free for the next couple hours to assist with infrastructure things.

@rabernat
Copy link
Member

The local client needs some GCP credentials for reading the era5 metadata.

Except it doesn't! This data is fully public. The catalog is wrong!

@TomAugspurger
Copy link
Member Author

Thanks for the reminder. Changed to catalog.era5_hourly_reanalysis_single_levels_sa(storage_options={"requester_pays": False, 'token': 'anon'}) so we don't need to share any secret files to run this locally.

The only auth requirement is that you know the password to connect to the Dask gateways.

@rabernat
Copy link
Member

Are you hoping that I will contribute more of a science use-case than what I have in #2? No problem if so, just need to plan my next 24 hours.

@rabernat
Copy link
Member

To answer my own question...#2 just looks at ERA5. To make this more complete, we would want to compare to the Large Ensemble data in AWS.

@TomAugspurger
Copy link
Member Author

Yep, that's my question: Is there a specific variable from LENS that I should call histogram on? :) If I knew what data to plug in I think I can take it from there :)

@rabernat
Copy link
Member

Ok, I have just added a LENS notebook to #2. They could easily be merged into one notebook. It will look very familiar.

The main differences between the ERA5 data and LENS are

  • ERA5 is hourly, LENS 6hourly, so ERA5 had to be coarsened (note that coarsen works much better than resample with dask)
  • ERA5 precip units are m, LENS are m/s, so a unit conversion is needed for comparison
  • LENS has an extra dimension, member_id, which corresponds to the ensemble member of the simulation. It can be used to assess natural variability in the climate system

@TomAugspurger
Copy link
Member Author

TomAugspurger commented May 14, 2020 via email

@TomAugspurger
Copy link
Member Author

TomAugspurger commented May 14, 2020

Thanks @rabernat.

In the notebook you do .isel(0). To be comparable with the ERA5 dataset do I want to remove that?

precip_in_m = ds.PRECT.isel(member_id=0)  * (6 * hour)

@TomAugspurger
Copy link
Member Author

Does fine on the LENS subset, but seems to have some trouble with the full dataset (with the .isel(member_id=0) removed). (killed worker)

I'll scale up the instance types and memory per worker and try it again later.

@TomAugspurger
Copy link
Member Author

Just pushed an update: https://nbviewer.jupyter.org/github/pangeo-data/multicloud-demo/blob/master/multicloud.ipynb. Notes:

  1. I didn't have luck with removing the .isel(member_id=0). Workers kept running out of memory. @rabernat does histogram do something different when the input is a 4d object rather than 3d? Do the ensemble members need to be pre-aggregated or aggregated independently?
  2. I switched to adaptive mode. This works well, at least when there are already VMs around from a previous run. ERA5 needs at least 100-150 workers. LENS needs fewer (25 - 50 is plenty)
  3. I added bits of prose, but would welcome additions / revisions. I have a "Compare results" section that is empty aside from the plots of the two histograms.

@TomAugspurger
Copy link
Member Author

Probably spent too long on this image, but the diagrams is cool.

multi-cloud

Source is in images/diagram.py. I've included in the notebook.


@rabernat @jhamman I'll plan to submit the notebook at about 5:00 Central (5.5 hours from now) but I can do it later if needed.

@rabernat
Copy link
Member

Sorry, just seeing this now.

  1. I didn't have luck with removing the .isel(member_id=0). Workers kept running out of memory. @rabernat does histogram do something different when the input is a 4d object rather than 3d? Do the ensemble members need to be pre-aggregated or aggregated independently?

If you remove .isel(member_id=0), you probably want to add, .mean(dim=['time', 'member_id']) at the end of the histogram computation (rather than just mean(dim='time'). Does this work any better?

@TomAugspurger
Copy link
Member Author

@rabernat unfortunately, no. Same behavior: Seeing workers killed off, I think as part of the histogram.

The pattern of computation looks quite different too. From what I can tell, for LENS all of the zarr files are being loaded at once before progressing on any of the histogram tasks. I don't think that's the case for the ERA data.

Screen Shot 2020-05-15 at 12 59 12 PM

Seems like the time dim of LENS might have larger chunks? 504 vs. 1? I'll try rechunking along that dim before doing the histogram.

@rabernat
Copy link
Member

I wouldn't lose too much sleep over this. You may be getting into the weeds with xhistogram.

https://github.com/xgcm/xhistogram/blob/master/xhistogram/core.py

I could have sworn I opened a dask issue about a related issue (related to how reshape works, which xhistogram uses internally), but I can't find it.

@rabernat
Copy link
Member

This should work:

tp_hist = histogram(precip_in_m.rename('tp_6hr'), bins=[tp_6hr_bins], dim=['lon', 'member_id']).mean(dim='time')

Apparently the order of operations matters a lot here. To compare with the other results, you'll have to renormalize, i.e.

tp_hist /= len(ds.member_id)

@rabernat
Copy link
Member

Ah yes, here it is:
dask/dask#5544

@TomAugspurger
Copy link
Member Author

With 200 workers and

lens_tp_hist = histogram(
    precip_in_m.rename('tp_6hr'), bins=[tp_6hr_bins], dim=['lon', 'member_id']
).mean(dim=('time'))

lens_tp_hist /= len(lens.member_id)

lens_tp_hist.data

we get about 1/2 through the digitize tasks whereas previously that failed immediately, but still seeing some killed workers :/

Will maybe look a bit more later, but won't spend too much time on it.

@rabernat
Copy link
Member

Tom, all I can say here is, welcome to my life! 😂 Perhaps this exercise helps you appreciate a bit some of the frustrations that Pangeo users are feeling around Dask!

As always, thanks for your persistence and patience.

@TomAugspurger
Copy link
Member Author

It's good to get some real-world experience :)

Would it make sense to take the histogram of the mean over the ensemble members?

lens_tp_hist = histogram(
    precip_in_m.rename('tp_6hr').mean(dim="member_id"),
    bins=[tp_6hr_bins], dim=['lon']
).mean(dim=('time'))

Or is a simple average of the predictions a bad idea?

@TomAugspurger
Copy link
Member Author

TomAugspurger commented May 15, 2020

Doing the .mean(dim="member_id") before the histogram "worked". Dunno if it's appropriate though. I've pushed that to master.

Screen Shot 2020-05-15 at 2 17 10 PM

I was fortunately recording my screen at the time, and they happened to finish within seconds of eachother :)

output

@rabernat
Copy link
Member

Would it make sense to take the histogram of the mean over the ensemble members?

Histogram of the mean != mean of the histograms

In particular, by taking the histogram of the ensemble mean, we will definitely lose lots of the tail of the distribution. Scientifically, the mean of the histograms is much more useful and relevant to assessing rainfall extremes.

@TomAugspurger
Copy link
Member Author

Yeah that was my fear. I just pushed a commit to take each histogram separately, and then xr.concat() the results.

histograms = [
    histogram(
        precip_in_m.sel(member_id=member_id).rename("tp_6hr"),
        bins=[tp_6hr_bins], dim=["lon"]
    ).mean(dim=('time'))
    for member_id in precip_in_m.member_id.values
]

To visualize, I made a FacetGrid plotting each member separately. Is there anything interesting to say about it https://nbviewer.jupyter.org/github/pangeo-data/multicloud-demo/blob/master/multicloud.ipynb#Compare-results?

@rabernat
Copy link
Member

s there anything interesting to say about it

No. 😄 It looks like there is very little variability across the ensemble. So you may be better off sticking with just one member, and keeping the notebook clean.

@TomAugspurger
Copy link
Member Author

OK, back to simple!

I'll submit this later tonight.

@cgentemann
Copy link
Member

Tom - any chance you can add some more ERA5 variables?
specifically SST, Surface latent heat flux, Surface sensible heat flux, Significant height of combined wind waves and swell, ...
Chelle

@TomAugspurger
Copy link
Member Author

FYI, I updated the README with a link to a screencast walking through the notebook: https://www.youtube.com/watch?v=IeKjLiUqpT4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants