Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

type-stable variable #256

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

haakon-e
Copy link
Contributor

This is an attempt to write a type-stable variant of variable, which enables completely type stable reading of data from NetCDF files.

Currently, reading data is not type stable since at read-time Julia doesn't know the type of the data to be read, nor the number of dimensions. See for example:

@code_warntype variable(ds, "temperature")

@code_warntype ds["temperature"]
output
julia> @code_warntype variable(ds, "temperature")
MethodInstance for CommonDataModel.variable(::NCDataset{Nothing, Missing}, ::String)
  from variable(ds::NCDataset, varname::AbstractString) @ NCDatasets ~/.julia/packages/NCDatasets/JrRxr/src/variable.jl:83
Arguments
  #self#::Core.Const(CommonDataModel.variable)
  ds::NCDataset{Nothing, Missing}
  varname::String
Body::NCDatasets.Variable{_A, _B, NCDataset{Nothing, Missing}} where {_A, _B}
1%1 = NCDatasets._variable(ds, varname)::NCDatasets.Variable{_A, _B, NCDataset{Nothing, Missing}} where {_A, _B}
└──      return %1
# "Body" is not inferred

julia> @code_warntype ds["temperature"]
MethodInstance for getindex(::NCDataset{Nothing, Missing}, ::String)
  from getindex(ds::CommonDataModel.AbstractDataset, varname::Union{AbstractString, Symbol}) @ CommonDataModel ~/.julia/packages/CommonDataModel/GGvMn/src/dataset.jl:169
Arguments
  #self#::Core.Const(getindex)
  ds::NCDataset{Nothing, Missing}
  varname::String
Body::CommonDataModel.CFVariable{_A, _B, NCDatasets.Variable{_A1, _B1, NCDataset{Nothing, Missing}}, CommonDataModel.Attributes{TDS}, NamedTuple{(:fillvalue, :missing_values, :scale_factor, :add_offset, :calendar, :time_origin, :time_factor, :maskingvalue), var"#s178"}} where {_A, _B, _A1, _B1, TDS<:(NCDatasets.Variable{_A, _B, NCDataset{Nothing, Missing}} where {_A, _B}), var"#s178"<:Tuple{Any, Tuple, Any, Any, Any, Any, Union{Nothing, Int64}, Missing}}
1%1 = CommonDataModel.cfvariable(ds, varname)::CommonDataModel.CFVariable{_A, _B, NCDatasets.Variable{_A1, _B1, NCDataset{Nothing, Missing}}, CommonDataModel.Attributes{TDS}, NamedTuple{(:fillvalue, :missing_values, :scale_factor, :add_offset, :calendar, :time_origin, :time_factor, :maskingvalue), var"#s178"}} where {_A, _B, _A1, _B1, TDS<:(NCDatasets.Variable{_A, _B, NCDataset{Nothing, Missing}} where {_A, _B}), var"#s178"<:Tuple{Any, Tuple, Any, Any, Any, Any, Union{Nothing, Int64}, Missing}}
└──      return %1
# "Body" is not inferred

In some cases you have both these pieces of information. This PR extends variable with another signature on the form variable(ds, varname, DType, dimnames) where ds and varname is as before. Now also supplied is the data type DType (e.g. Float32) and dimension names dimnames (e.g. ("lon", "lat") or ("time",)).

Supplying incorrect dimnames errors. However, correct dimnames, but ordered incorrectly, does not error (and in fact gives "correct" results) unless you go out-of-bonds of the stored data (see tests for an example).

More worryingly, providing incorrect DType gives garbage output with no warning. As such, it might be preferable to give this function a more scary name such as unsafe_variable and not export it (like load!).

example of incorrect loading
julia> d1 = __variable(ds, "temperature", Float32, ("lon", "lat"))[1:5, 1:3]
5×3 Matrix{Float32}:
 2.0  3.0  4.0
 3.0  4.0  5.0
 4.0  5.0  6.0
 5.0  6.0  7.0
 6.0  7.0  8.0

julia> d2 = __variable(ds, "temperature", Float64, ("lon", "lat"))[1:5, 1:3]
5×3 Matrix{Float64}:
    32.0   2048.0           5.4e-323
  2048.0  32768.0           6.0e-323
    32.0      5.38788e-315  6.4e-323
  2048.0      4.4e-323      7.0e-323
 32768.0      5.0e-323      7.4e-323

julia> __variable(ds, "temperature", Int32, ("lon", "lat"))[1:5, 1:3]
5×3 Matrix{Int32}:
 1073741824  1077936128  1082130432
 1077936128  1082130432  1084227584
 1082130432  1084227584  1086324736
 1084227584  1086324736  1088421888
 1086324736  1088421888  1090519040

Other info

create dataset for above examples
fname = tempname()
ds = NCDataset(fname,"c")
defDim(ds,"lon",100)
defDim(ds,"lat",110)
v = defVar(ds,"temperature",Float32,("lon","lat"))
data = [Float32(i+j) for i = 1:100, j = 1:110];
v[:,:] = data;
close(ds)

# reading
ds = NCDataset(fname)

@haakon-e
Copy link
Contributor Author

haakon-e commented Apr 25, 2024

The failing tests appear related to adding JET to test dependencies. It runs locally for me, so not sure what's going on. This can alternatively be replaced with Test.@inferred, which to my understanding only checks that the return type is as expected, as opposed to everything within the call (ref)

Surprisingly,

variable(ds, "temperature", Float32, ("lon",))

appears to pass the test, but fail when run in REPL

REPL error
julia> variable(ds, "temperature", Float32, ("lon",))
100-element NCDatasets.Variable{Float32, 1, NCDataset{Nothing, Missing}}

Error showing value of type NCDatasets.Variable{Float32, 1, NCDataset{Nothing, Missing}}:
ERROR: MethodError: Cannot `convert` an object of type 
  Tuple{Int64, Int64} to an object of type 
  Tuple{T} where T

Closest candidates are:
  convert(::Type{T}, ::Tuple{Vararg{Any, N}}) where {N, T<:Tuple}
   @ Base essentials.jl:457
  convert(::Type{T}, ::T) where T<:Tuple
   @ Base essentials.jl:456
  convert(::Type{T}, ::T) where T
   @ Base Base.jl:84
  ...

Stacktrace:
       internal @ Base
  [3] (Tuple{T} where T)(x::Tuple{Int64, Int64})
    @ Base ./tuple.jl:386
  [4] _chunking(v::NCDatasets.Variable{Float32, 1, NCDataset{Nothing, Missing}})
    @ NCDatasets ~/Documents/CliMA/other-repos/NCDatasets.jl/src/variable.jl:306
  [5] haschunks
    @ ~/Documents/CliMA/other-repos/NCDatasets.jl/src/variable.jl:467 [inlined]
  [6] _show(io::IOContext{Base.TTY}, mime::MIME{Symbol("text/plain")}, A::NCDatasets.Variable{Float32, 1, NCDataset{Nothing, Missing}})
    @ DiskArrays ~/.julia/packages/DiskArrays/bZBJE/src/show.jl:15
  [7] show(io::IOContext{Base.TTY}, mime::MIME{Symbol("text/plain")}, A::NCDatasets.Variable{Float32, 1, NCDataset{Nothing, Missing}})
    @ NCDatasets ~/.julia/packages/DiskArrays/bZBJE/src/show.jl:5
       internal @ OhMyREPL, REPL, Base.Multimedia, AbbreviatedStackTraces, REPL.LineEdit, Unknown
 [24] (::REPL.var"#62#68"{REPL.LineEditREPL, REPL.REPLBackendRef})()
    @ REPL /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-honeycrisp-HL2F7YQ3XH.0/build/default-honeycrisp-HL2F7YQ3XH-0/julialang/julia-release-1-dot-10/usr/share/julia/stdlib/v1.10/REPL/src/REPL.jl:386
Use `err` to retrieve the full stack trace.

edit: It appears the code successfully runs. but unsuccessfully prints to terminal. To see this, run e.g.

variable(ds, "temperature", Float32, ("lon",));  # note the ";"

however, attempting to index into the resulting variable always fails.

@Alexander-Barth
Copy link
Member

Alexander-Barth commented Apr 29, 2024

As in Julia we have e.g. Array{Float32,2} maybe we could a function call Variable{Float32,2}(ds,varname):

using NCDatasets
import NCDatasets: Variable, _jltype, nc_inq_varid, nc_inq_vardimid, nc_inq_vartype
function Variable{T,N}(ds::NCDataset,varname) where {T,N}
    varid = nc_inq_varid(ds.ncid,varname)
    dimids = nc_inq_vardimid(ds.ncid,varid)
    nctype = _jltype(ds.ncid,nc_inq_vartype(ds.ncid,varid))
    ndims = length(dimids)

    # proper error message to add
    @assert nctype == T
    @assert ndims == N
    # reverse dimids to have the dimension order in Fortran style
    return Variable{T,N,typeof(ds)}(ds,varid, (reverse(dimids)...,))
end

fname = tempname()
ds = NCDataset(fname,"c")
defDim(ds,"lon",100)
defDim(ds,"lat",110)
v = defVar(ds,"temperature",Float32,("lon","lat"))
data = [Float32(i+j) for i = 1:100, j = 1:110];
v[:,:] = data;
close(ds)

ds = NCDataset(fname)
v = Variable{Float32,2}(ds,"temperature")

It seems that the type-instability does not spread:

function foo(ds)
    v = Variable{Float32,2}(ds,"temperature")
    v[:,:]
end

@code_warntype foo(ds)
MethodInstance for foo(::NCDataset{Nothing, Missing})
  from foo(ds) @ Main REPL[10]:1
Arguments
  #self#::Core.Const(foo)
  ds::NCDataset{Nothing, Missing}
Locals
  v::Variable{Float32, 2, NCDataset{Nothing, Missing}}
Body::Matrix{Float32}
1 ─ %1 = Core.apply_type(Main.Variable, Main.Float32, 2)::Core.Const(Variable{Float32, 2})                                                      
│        (v = (%1)(ds, "temperature"))
│   %3 = Base.getindex(v, Main.:(:), Main.:(:))::Matrix{Float32}
└──      return %3

The type check is still type unstable, but it will not spread. I think it is better to return an error for some inconsistent user input.

May the JET test can only be run on julia 1.10? On julia 1.6, a relatively old version of JET is installed (JET v0.4.6 versus v0.8.29 on 1.10).

@haakon-e
Copy link
Contributor Author

haakon-e commented Apr 30, 2024

I like the idea of Variable{Float32,2}(ds,varname). It would be nice if it is possible to opt out of type checking if you're confident (e.g. through unit testing of the code) of what you're passing through.

For example, this should let you skip typecheck

function Variable{T,N,checktypes}(ds::NCDataset,varname) where {T,N,checktypes}
     varid = nc_inq_varid(ds.ncid,varname)
     dimids = nc_inq_vardimid(ds.ncid,varid)
     if checktypes
         nctype = _jltype(ds.ncid,nc_inq_vartype(ds.ncid,varid))
         ndims = length(dimids)
       
         # proper error message to add
         @assert nctype == T
         @assert ndims == N
     end
     # reverse dimids to have the dimension order in Fortran style
     return Variable{T,N,typeof(ds)}(ds,varid, (reverse(dimids)...,))
end
function Variable{T,N}(ds::NCDataset,varname) where {T,N}
    Variable{T,N,true}(ds::NCDataset,varname)
end

One note: There is still type instability even if you remove the assertions. This is due to the uninferred length of dimids. Is there a way around that? (I get around it by explicitly passing the dimension names)

@Alexander-Barth
Copy link
Member

Do you have a real-word use case where checking the type and dimension would be a problem? There are already a quite a few internal checks in the netcdf c library. I am curious to know where these additional checks would be a problem. Note that also Array{Int,2} errors if you provide size e.g. (2,3,4).

This is due to the uninferred length of dimids. Is there a way around that?

Of course! :-)

using NCDatasets
import NCDatasets: Variable, _jltype, ncType, nc_inq_varid, nc_inq_vardimid, nc_inq_vartype
function Variable{T,N}(ds::NCDataset,varname) where {T,N}
    ncid = ds.ncid
    varid = nc_inq_varid(ncid,varname)
    dimids = nc_inq_vardimid(ncid,varid)
    xtype = nc_inq_vartype(ncid,varid)
    ndims = length(dimids)

    if ncType[T] != xtype
        error("Variable '$varname' has the type $(_jltype(ncid,xtype)) (instead of $T)")
    end

    if ndims != N
        error("Variable '$varname' has $ndims dimension(s) (instead of $N)")
    end

    # reverse dimids to have the dimension order in Fortran style
    dimeids_reversed = ntuple(N) do i
        dimids[end-i+1]
    end

    return Variable{T,N,typeof(ds)}(ds,varid, dimeids_reversed)
end

fname = tempname()
ds = NCDataset(fname,"c")
defDim(ds,"lon",100)
defDim(ds,"lat",110)
v = defVar(ds,"temperature",Float32,("lon","lat"))
data = [Float32(i+j) for i = 1:100, j = 1:110];
v[:,:] = data;
close(ds)

ds = NCDataset(fname)
varname = "temperature"

v = Variable{Float32,2}(ds,varname)

# the only type-unstable code is in the error message to provide julia types (e.g. Float64 or Vector{Float64} rather than netCDF code like the value of NC_DOUBLE, NC_VLEN
@code_warntype Variable{Float32,2}(ds,varname)

using Test
@test_throws ErrorException Variable{Float32,3}(ds,varname)
@test_throws ErrorException Variable{Int32,2}(ds,varname)

@haakon-e
Copy link
Contributor Author

haakon-e commented May 2, 2024

@dennisYatunin can you chip in here?

@dennisYatunin
Copy link

Sure! It sounds like we're planning to read from around a dozen netcdf files on every update to our atmosphere model, and each of these unstable type checks seems to add a few kilobytes of allocations. After a several thousand timesteps (each of which includes multiple model evaluations on different Runge-Kutta stages), this will amount to nearly a gigabyte of allocations. For now, our model is still runs at around one timestep per second, so the garbage collector could easily keep up with these allocations. In the future, though, we will probably be able to run simple configurations much more quickly, and this garbage collection will take up a larger fraction of our runtime.

We still have a fair number of allocations triggered by our atmosphere model as it runs, so this particular type instability is not a huge issue at the moment. However, our end goal is to have as few allocations and instabilities as possible, and it's good to tackle simple issues like this as they come up. That being said, we don't necessarily need the non-allocating version of this constructor to be part of the NCDatasets API. If this is considered too unsafe to be publicly exported, we could always just write our own version of it.

@Alexander-Barth
Copy link
Member

Just to be clear, you are not instantiating a NCDatasets.Variable at every time step, or do you? Only when you have a new file to read?

I use typically this approach:

ds = NCDataset('file.nc')
ncvar = ds["your_variable"]
buffer = zeros(Float32,2,3,4)

for n = 1:nmax
   step!(n,ncvar,buffer)
end

where step should not allocate or instantiate a NCDatasets.Variable (not at least at every time step). step would not allocate if you use NCDatasets.load!:

function step!(n,ncvar,buffer)
  NCDatasets.load!(ncvar,buffer,:,n)
  # some useful with buffer
end

Are you doing something like this already? Or is this not possible?
(however NCDatasets.load! allocates some small vectors for the indices as the C API can only work with vectors not tuples)

@haakon-e
Copy link
Contributor Author

haakon-e commented May 8, 2024

Currently, we do instantiate NCDatasets.Variable at every time step.

This is because I only keep the dataset open for the duration I read into memory, then close it, using code (roughly) on the format

function read_data!(file, var, buffer, n)
    NCDataset(file) do ds
        NCDatasets.load!(variable(ds, var), buffer, :, n)
    end
end

I suppose in principle we could open the dataset at the start of the simulation, then create the variables we need to access at every time step, and store them in the cache (which is what you suggest above).

opening the file "directly" (ds = NCDataset("...")) is something I've been taught not to do (because the do-block safely closes the file in case of errors). Perhaps I should rethink this belief?

A few comments/questions:

  • These simulations may be running for several hours -- so I'd need to "remember" to close the file at the end of the simulation (or when it crashes)
  • I will be running many (10-500) copies of the same simulation at the same time in separate Julia processes, would all of them keeping the file open throughout the simulation be a concern of some sort?
  • I can add the variables to our cache. After reading the data to memory, it will typically be moved to the GPU (I assume we cannot read directly onto the gpu).

@Alexander-Barth
Copy link
Member

I would prefer to keep the discussion here a bit more focused. As I mentioned above, it is preferable to instantiate Variable outside the time loop for performance. You can always call the sync function in the time loop to ensure that all data are saved on disk.
In this tests here, there is a factor of 500 between an allocation (and deallocation) cycle and the C function nc_open/nc_close. Note that there are likely allocation in the C layer that are not reported by julia.

julia> using NCDatasets, BenchmarkTools

julia> @btime begin; ncid = NCDatasets.nc_open("large_file.nc",NCDatasets.NC_NOWRITE); NCDatasets.nc_close(ncid); end
  992.927 μs (0 allocations: 0 bytes)

julia> @btime begin; a = zeros(100,100); a = 0; end
  2.063 μs (2 allocations: 78.17 KiB)
0

As far as I know the number of open files is an OS setting, see here
https://stackoverflow.com/questions/34588/how-do-i-change-the-number-of-open-files-limit-in-linux

I would be surprised if you hit a limit with 500.

Yes, do-blocks are better as the file automaticall closed, but this discussion is orthogonal to the present one. I do not use do-blocks for experiments in the repl.

I am not aware of a way to load the netcdf data directly into the GPU memory.

Feel free to open a thread at the julia discourse forum and ping me there for a more general discussion. 😀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants