Merge pull request #481 from JuliaDataCubes/la/new_dim

new dim fix
JuliaDataCubes · Dec 16, 2024 · 82d9954 · 82d9954
2 parents b0da856 + cfff98d
commit 82d9954
Show file tree

Hide file tree

Showing 2 changed files with 108 additions and 8 deletions.
diff --git a/docs/src/UserGuide/read.md b/docs/src/UserGuide/read.md
@@ -2,7 +2,17 @@
 
 This section describes how to read files, URLs, and directories into YAXArrays and datasets.
 
-## Read Zarr
+## open_dataset
+
+The usual method for reading any format is using this function. See its `docstring` for more information.
+
+````@docs
+open_dataset
+````
+
+Now, let's explore different examples.
+
+### Read Zarr
 
 Open a Zarr store as a `Dataset`:
 
@@ -23,7 +33,7 @@ Individual arrays can be accessed using subsetting:
 ds.tas
 ````
 
-## Read NetCDF
+### Read NetCDF
 
 Open a NetCDF file as a `Dataset`:
 
@@ -55,7 +65,7 @@ end
 
 This code will ensure that the data is only accessed by one thread at a time, i.e. making it actual single-threaded but thread-safe.
 
-## Read GDAL (GeoTIFF, GeoJSON)
+### Read GDAL (GeoTIFF, GeoJSON)
 
 All GDAL compatible files can be read as a `YAXArrays.Dataset` after loading [ArchGDAL](https://yeesian.com/ArchGDAL.jl/latest/):
 
@@ -68,11 +78,11 @@ path = download("https://github.com/yeesian/ArchGDALDatasets/raw/307f8f0e584a39a
 ds = open_dataset(path)
 ````
 
-## Load data into memory
+### Load data into memory
 
 For datasets or variables that could fit in RAM, you might want to load them completely into memory. This can be done using the `readcubedata` function. As an example, let's use the NetCDF workflow; the same should be true for other cases.
 
-### readcubedata
+#### readcubedata
 
 :::tabs
 
@@ -99,4 +109,79 @@ ds_loaded["tos"] # Load the variable of interest; the loaded status is shown for
 
 :::
 
-Note how the loading status changes from `loaded lazily` to `loaded in memory`.
+Note how the loading status changes from `loaded lazily` to `loaded in memory`.
+
+## open_mfdataset
+
+There are situations when we would like to open and concatenate a list of dataset paths along a certain dimension. For example, to concatenate a list of `NetCDF` files along a new `time` dimension, one can use:
+
+::: details creation of NetCDF files
+
+````@example open_list_netcdf
+using YAXArrays, NetCDF, Dates
+using YAXArrays: YAXArrays as YAX
+
+dates_1 = [Date(2020, 1, 1) + Dates.Day(i) for i in 1:3]
+dates_2 = [Date(2020, 1, 4) + Dates.Day(i) for i in 1:3]
+
+a1 = YAXArray((lon(1:5), lat(1:7)), rand(5, 7))
+a2 = YAXArray((lon(1:5), lat(1:7)), rand(5, 7))
+
+a3 = YAXArray((lon(1:5), lat(1:7), YAX.time(dates_1)), rand(5, 7, 3))
+a4 = YAXArray((lon(1:5), lat(1:7), YAX.time(dates_2)), rand(5, 7, 3))
+
+savecube(a1, "a1.nc")
+savecube(a2, "a2.nc")
+savecube(a3, "a3.nc")
+savecube(a4, "a4.nc")
+````
+:::
+
+### along a new dimension
+
+````@example open_list_netcdf
+using YAXArrays, NetCDF, Dates
+using YAXArrays: YAXArrays as YAX
+import DimensionalData as DD
+
+files = ["a1.nc", "a2.nc"]
+
+dates_read = [Date(2024, 1, 1) + Dates.Day(i) for i in 1:2]
+ds = open_mfdataset(DD.DimArray(files, YAX.time(dates_read)))
+````
+
+and even opening files along a new `Time` dimension that already have a `time` dimension
+
+````@example open_list_netcdf
+files = ["a3.nc", "a4.nc"]
+ds = open_mfdataset(DD.DimArray(files, YAX.Time(dates_read)))
+````
+
+Note that opening along a new dimension name without specifying values also works; however, it defaults to `1:length(files)` for the dimension values.
+
+````@example open_list_netcdf
+files = ["a1.nc", "a2.nc"]
+ds = open_mfdataset(DD.DimArray(files, YAX.time))
+````
+
+### along a existing dimension
+
+Another use case is when we want to open files along an existing dimension. In this case, `open_mfdataset` will concatenate the paths along the specified dimension
+
+````@example open_list_netcdf
+using YAXArrays, NetCDF, Dates
+using YAXArrays: YAXArrays as YAX
+import DimensionalData as DD
+
+files = ["a3.nc", "a4.nc"]
+
+ds = open_mfdataset(DD.DimArray(files, YAX.time()))
+````
+
+where the contents of the `time` dimension are the merged values from both files
+
+````@ansi open_list_netcdf
+ds["time"]
+````
+
+providing us with a wide range of options to work with.
diff --git a/src/DatasetAPI/Datasets.jl b/src/DatasetAPI/Datasets.jl
@@ -348,7 +348,11 @@ open_mfdataset(g::Vector{<:AbstractString}; kwargs...) =
 merge_datasets(map(i -> open_dataset(i; kwargs...), g))
 
 function merge_new_axis(alldatasets, firstcube,var,mergedim)
-    newdim = DD.rebuild(mergedim,1:length(alldatasets))
+    newdim = if !(typeof(DD.lookup(mergedim)) <: DD.NoLookup)
+        DD.rebuild(mergedim, DD.val(mergedim))
+    else
+        DD.rebuild(mergedim, 1:length(alldatasets))
+    end
     alldiskarrays = map(ds->ds.cubes[var].data,alldatasets).data
     newda = diskstack(alldiskarrays)
     newdims = (DD.dims(firstcube)...,newdim)
@@ -407,10 +411,21 @@ end
 
 
 """
-    open_dataset(g; driver=:all)
+    open_dataset(g; skip_keys=(), driver=:all)
 
 Open the dataset at `g` with the given `driver`.
 The default driver will search for available drivers and tries to detect the useable driver from the filename extension.
+
+### Keyword arguments
+
+- `skip_keys` are passed as symbols, i.e., `skip_keys = (:a, :b)`
+- `driver=:all`, common options are `:netcdf` or `:zarr`.
+
+Example:
+
+````julia
+ds = open_dataset(f, driver=:zarr, skip_keys = (:c,))
+````
 """
 function open_dataset(g; skip_keys=(), driver = :all)
     str_skipkeys = string.(skip_keys)