diff --git a/docs/src/UserGuide/read.md b/docs/src/UserGuide/read.md index 68fd2259..d703b694 100644 --- a/docs/src/UserGuide/read.md +++ b/docs/src/UserGuide/read.md @@ -2,7 +2,17 @@ This section describes how to read files, URLs, and directories into YAXArrays and datasets. -## Read Zarr +## open_dataset + +The usual method for reading any format is using this function. See its `docstring` for more information. + +````@docs +open_dataset +```` + +Now, let's explore different examples. + +### Read Zarr Open a Zarr store as a `Dataset`: @@ -23,7 +33,7 @@ Individual arrays can be accessed using subsetting: ds.tas ```` -## Read NetCDF +### Read NetCDF Open a NetCDF file as a `Dataset`: @@ -55,7 +65,7 @@ end This code will ensure that the data is only accessed by one thread at a time, i.e. making it actual single-threaded but thread-safe. -## Read GDAL (GeoTIFF, GeoJSON) +### Read GDAL (GeoTIFF, GeoJSON) All GDAL compatible files can be read as a `YAXArrays.Dataset` after loading [ArchGDAL](https://yeesian.com/ArchGDAL.jl/latest/): @@ -68,11 +78,11 @@ path = download("https://github.com/yeesian/ArchGDALDatasets/raw/307f8f0e584a39a ds = open_dataset(path) ```` -## Load data into memory +### Load data into memory For datasets or variables that could fit in RAM, you might want to load them completely into memory. This can be done using the `readcubedata` function. As an example, let's use the NetCDF workflow; the same should be true for other cases. -### readcubedata +#### readcubedata :::tabs @@ -99,4 +109,79 @@ ds_loaded["tos"] # Load the variable of interest; the loaded status is shown for ::: -Note how the loading status changes from `loaded lazily` to `loaded in memory`. \ No newline at end of file +Note how the loading status changes from `loaded lazily` to `loaded in memory`. + +## open_mfdataset + +There are situations when we would like to open and concatenate a list of dataset paths along a certain dimension. For example, to concatenate a list of `NetCDF` files along a new `time` dimension, one can use: + +::: details creation of NetCDF files + +````@example open_list_netcdf +using YAXArrays, NetCDF, Dates +using YAXArrays: YAXArrays as YAX + +dates_1 = [Date(2020, 1, 1) + Dates.Day(i) for i in 1:3] +dates_2 = [Date(2020, 1, 4) + Dates.Day(i) for i in 1:3] + +a1 = YAXArray((lon(1:5), lat(1:7)), rand(5, 7)) +a2 = YAXArray((lon(1:5), lat(1:7)), rand(5, 7)) + +a3 = YAXArray((lon(1:5), lat(1:7), YAX.time(dates_1)), rand(5, 7, 3)) +a4 = YAXArray((lon(1:5), lat(1:7), YAX.time(dates_2)), rand(5, 7, 3)) + +savecube(a1, "a1.nc") +savecube(a2, "a2.nc") +savecube(a3, "a3.nc") +savecube(a4, "a4.nc") +```` +::: + +### along a new dimension + +````@example open_list_netcdf +using YAXArrays, NetCDF, Dates +using YAXArrays: YAXArrays as YAX +import DimensionalData as DD + +files = ["a1.nc", "a2.nc"] + +dates_read = [Date(2024, 1, 1) + Dates.Day(i) for i in 1:2] +ds = open_mfdataset(DD.DimArray(files, YAX.time(dates_read))) +```` + +and even opening files along a new `Time` dimension that already have a `time` dimension + +````@example open_list_netcdf +files = ["a3.nc", "a4.nc"] +ds = open_mfdataset(DD.DimArray(files, YAX.Time(dates_read))) +```` + +Note that opening along a new dimension name without specifying values also works; however, it defaults to `1:length(files)` for the dimension values. + +````@example open_list_netcdf +files = ["a1.nc", "a2.nc"] +ds = open_mfdataset(DD.DimArray(files, YAX.time)) +```` + +### along a existing dimension + +Another use case is when we want to open files along an existing dimension. In this case, `open_mfdataset` will concatenate the paths along the specified dimension + +````@example open_list_netcdf +using YAXArrays, NetCDF, Dates +using YAXArrays: YAXArrays as YAX +import DimensionalData as DD + +files = ["a3.nc", "a4.nc"] + +ds = open_mfdataset(DD.DimArray(files, YAX.time())) +```` + +where the contents of the `time` dimension are the merged values from both files + +````@ansi open_list_netcdf +ds["time"] +```` + +providing us with a wide range of options to work with. \ No newline at end of file diff --git a/src/DatasetAPI/Datasets.jl b/src/DatasetAPI/Datasets.jl index e8375135..a8f6f5dd 100644 --- a/src/DatasetAPI/Datasets.jl +++ b/src/DatasetAPI/Datasets.jl @@ -348,7 +348,11 @@ open_mfdataset(g::Vector{<:AbstractString}; kwargs...) = merge_datasets(map(i -> open_dataset(i; kwargs...), g)) function merge_new_axis(alldatasets, firstcube,var,mergedim) - newdim = DD.rebuild(mergedim,1:length(alldatasets)) + newdim = if !(typeof(DD.lookup(mergedim)) <: DD.NoLookup) + DD.rebuild(mergedim, DD.val(mergedim)) + else + DD.rebuild(mergedim, 1:length(alldatasets)) + end alldiskarrays = map(ds->ds.cubes[var].data,alldatasets).data newda = diskstack(alldiskarrays) newdims = (DD.dims(firstcube)...,newdim) @@ -407,10 +411,21 @@ end """ - open_dataset(g; driver=:all) + open_dataset(g; skip_keys=(), driver=:all) Open the dataset at `g` with the given `driver`. The default driver will search for available drivers and tries to detect the useable driver from the filename extension. + +### Keyword arguments + +- `skip_keys` are passed as symbols, i.e., `skip_keys = (:a, :b)` +- `driver=:all`, common options are `:netcdf` or `:zarr`. + +Example: + +````julia +ds = open_dataset(f, driver=:zarr, skip_keys = (:c,)) +```` """ function open_dataset(g; skip_keys=(), driver = :all) str_skipkeys = string.(skip_keys)