YAXArrays seems to download too much data #358

SimonDanisch · 2024-01-11T15:03:17Z

I'm trying the example from the docs:

using Zarr, YAXArrays, Dates, DimensionalData

store = "gs://cmip6/CMIP6/ScenarioMIP/DKRZ/MPI-ESM1-2-HR/ssp585/r1i1p1f1/3hr/tas/gn/v20190710/"
g = open_dataset(zopen(store, consolidated=true))
c = g["tas"]
ct = c[Ti=At(Date("2018-08-01"):Day(10):Date("2050-08-01"))]

in_memory = ct.data[:, :, :]

This takes reaally long and fills up all my RAM (32gb).
A few infos:

The selected slice:

Download speed of the julia process

I was expecting it to only download the 328mb, but from the download speed and RAM usage I suspect it's downloading much more data, making it almost impossible to download this part of the dataset...
Am I missing something or is this a bug, or just a limitation of the package?

Balinus · 2024-01-12T18:32:50Z

One thought I have in mind reading the example. I might be wrong though.

Depending on the chunks of the zarr folder on Google, the specific slice asked will still need to download the whole dataset between 2018 and 2050, probably a little bit more for the edges on 2018 and 2050. The whole dataset between 2018 and 2050 is 3.21GB. Is it closer to your measurement?

c = g["tas"]
ct = c[Ti=At(Date("2018-08-01"):Date("2050-08-01"))]
384×192×11689 YAXArray{Float32,3} with dimensions: 
  Dim{:lon} Sampled{Float64} 0.0:0.9375:359.0625 ForwardOrdered Regular Points,
  Dim{:lat} Sampled{Float64} Float64[-89.28422753251364, -88.35700351866494, …, 88.35700351866494, 89.28422753251364] ForwardOrdered Irregular Points,
  Ti Sampled{DateTime} DateTime[2018-08-01T00:00:00, …, 2050-08-01T00:00:00] ForwardOrdered Irregular Points
units: K
name: tas
Total size: 3.21 GB

Balinus · 2024-01-12T19:04:25Z

Note that I tried to do the same approach in Python and it seems to behave similarly

(in python, I specified the whole timeseries between 2018 and 2050 for simplicity)

import xarray as xr
import zarr

file = 'gs://cmip6/CMIP6/ScenarioMIP/DKRZ/MPI-ESM1-2-HR/ssp585/r1i1p1f1/3hr/tas/gn/v20190710/'
ds = xr.open_dataset(file, engine='zarr')

c = ds.tas
ct = c.sel(time=slice("2018-08-01", "2050-08-01"))
%time ct.values

CPU times: user 3min 19s, sys: 1min 29s, total: 4min 49s
Wall time: 21min 58s
Out[12]:
array([[[216.41226, 216.48257, 216.44742, ..., 216.32828, 216.38297,
         216.40054],

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

YAXArrays seems to download too much data #358

YAXArrays seems to download too much data #358

SimonDanisch commented Jan 11, 2024

Balinus commented Jan 12, 2024

Balinus commented Jan 12, 2024

YAXArrays seems to download too much data #358

YAXArrays seems to download too much data #358

Comments

SimonDanisch commented Jan 11, 2024

The selected slice:

Download speed of the julia process

Balinus commented Jan 12, 2024

Balinus commented Jan 12, 2024