Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

writing a lazy raster doesn't use chunks #889

Open
tiemvanderdeure opened this issue Feb 6, 2025 · 8 comments
Open

writing a lazy raster doesn't use chunks #889

tiemvanderdeure opened this issue Feb 6, 2025 · 8 comments
Labels
bug Something isn't working

Comments

@tiemvanderdeure
Copy link
Collaborator

MWE:

using Rasters, NCDatasets
ras1 = Rasters.create("ras1.nc", Float64, (X(1:100), Y(1:100)); force = true, chunks = (10, 10))
@time write("ras2.nc", ras1; force = true, chunks = (10,10))

is slow! 0.284892 seconds (1.14 M allocations: 41.458 MiB). I checked and it does call setindex_disk! for every index instead of once per chunk.

YAXArrays does manage to do this chunk-wise (although it's extremely slow because of a GC.gc())

using YAXArrays, NetCDF
yax1 = open_dataset("ras1.nc")
@time savecube(yax1.unnamed, "ras2.nc", overwrite = true)

Something like this is similarly slow

open(ras1) do src
    open(Raster("ras2.nc"; lazy = true); write = true) do dest
        @time dest .= src
        nothing
    end
end
@tiemvanderdeure tiemvanderdeure added the bug Something isn't working label Feb 6, 2025
@tiemvanderdeure
Copy link
Collaborator Author

I guess there isn't always a super efficient way to do this since the raster to be read and the raster to be written might have different chunks. What if one is chunked in rows and the other one in columns?

@rafaqz
Copy link
Owner

rafaqz commented Feb 6, 2025

Yeah, that's the hard problem. And what if one is a view with an offset. RechunkedDiskArray can go a long way by making larger chunks that match both arrays as much as possible. But also DiskArrays makes a lot of effort to handle this now.

The garbage collection is another priblem, but mostly because your chunks are waaay too small. Try 256 * 256. Reusing memory blobs would make more sense inside a broadcast as we know it's not threaded.

But this is likely a simple bug where dispatch isn't working somewhere.

What happens with a tiff?

@tiemvanderdeure
Copy link
Collaborator Author

I found out DiskArrays has a function called common_chunks that should handle it, but it's not used due to a bug :)

The garbage collection is another priblem, but mostly because your chunks are waaay too small.

This was just a toy example to not have to wait for minutes if an attempted fix doesn't work

@tiemvanderdeure
Copy link
Collaborator Author

tiemvanderdeure commented Feb 6, 2025

But otherwise I think this might be a commondatemodel or ncdatasets problem.

To compare:

ras1 = Raster("ras1.nc"; lazy = true, missingval = missing)
ras2 = Raster("ras1.nc"; lazy = true)

open(ras1; write = true) do dest
    @show typeof(parent(dest))
    @time dest .= dest
end;

open(ras2; write = true) do dest
    @show typeof(parent(dest))
    @time dest .= dest
end;
typeof(parent(dest)) = Rasters.ModifiedDiskArray{false, Union{Missing, Float64}, 2, NCDatasets.Variable{Float64, 2, NCDataset{Nothing, Missing}}, Rasters.Mod{Union{Missing, Float64}, Float64, Pair{Nothing, Missing}, Nothing, Nothing, typeof(convert)}}
  0.007569 seconds (28.43 k allocations: 1.455 MiB)
typeof(parent(dest)) = NCDatasets.Variable{Float64, 2, NCDataset{Nothing, Missing}}
  0.218933 seconds (1.14 M allocations: 41.432 MiB, 5.79% gc time)

So open I think is flattening the parent into a Variable, which then is index one by one.

I'm not sure if we should change it so that it's always wrapped in a FileArray, or if this is just a bug in NCDatasets

@rafaqz
Copy link
Owner

rafaqz commented Feb 6, 2025

Maybe make a DiskArrays.jl issue about GC.gc()

It cant be wrapped in a FileArray - that always just holds a String. but Variable should work, so NCDatasets.jl bug is likely.

(welcome to hell btw)

@tiemvanderdeure
Copy link
Collaborator Author

tiemvanderdeure commented Feb 6, 2025

Okay I think I figured it out!

Actually Variable does work, but... viewing into a Variable returns a CommonDataModel.SubVariable, and the DiskArray interface isn't implemented for it. So as soon as you create a view (e.g. by selecting a chunk), then it gets super slow.

Just to illustrate this:

using Rasters, NCDatasets
ras = Rasters.create("ras1.nc", Float64, (X(1:100), Y(1:100)); force = true, chunks = (10, 10))
A = read(ras)
open(ras; write = true) do dest
    @time view(dest, :,:) .= A
    @time dest .= A
    @time dest .= dest
    return
end
  0.378773 seconds (1.14 M allocations: 41.352 MiB, 2.50% gc time)
  0.002886 seconds (10.72 k allocations: 560.047 KiB)
  0.342204 seconds (1.15 M allocations: 41.738 MiB, 2.46% gc time)

The last one is slow because DiskArrays loops over the chunks if you copy data from a chunked raster

So yeah basically a bug in NCDatasets. I don't know what the best fix is, though.

(welcome to hell btw)

Thanks

@rafaqz
Copy link
Owner

rafaqz commented Feb 6, 2025

I've actually spoken to @Alexander-Barth about this bug in person. Alexander did you try the solution from the DiskArrays code we discussed in Lisbon? (the fix is to remove the SubArray thing in CommonDataModel/NCDatasets wherever it is and use SubDiskArray, which will work by default on Variable if we just remove the specialised view methods)

But @tiemvanderdeure if you also make a NCDatasets issue for it that would help make it explicit, we need more people pushing for this interop to actually work, I'm getting tired

@rafaqz
Copy link
Owner

rafaqz commented Feb 7, 2025

An immediate solution is to always wrap with ModifiedDiskArray even when there is no modification. Then we control dispatch and can ignore implementation problems in other packages as long as the basic Variable has correct chunks and chunk traits.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants