Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault reading file with NCDatasets v0.12.6 #187

Closed
sjdaines opened this issue Aug 21, 2022 · 13 comments
Closed

Segmentation fault reading file with NCDatasets v0.12.6 #187

sjdaines opened this issue Aug 21, 2022 · 13 comments

Comments

@sjdaines
Copy link

Describe the bug

Julia exits with segmentation fault when attempting to read from a netcdf file, apparently at random.

To Reproduce
No issues seen with NCdatasets v0.12.5 (with NetCDF_jll v400.702.400+0, using Julia v1.7), or with earlier versions (going back to approx ~1yr ago).

Occurs with NCDatasets v0.12.6 (with NetCDF_jll v400.902.5+0), using either Julia v1.7 or v1.8

This happens while running an application that is repeatedly opening and closing two netcdf files. Fails seemingly at random while opening either file after successfully open/read/close for ~10 attempts, doesn't seem to be associated with opening and reading any particular field or file.

Apologies, this isn't an example or dataset I can share. The code is of the form:

NCDatasets.Dataset(netcdf_filename) do ds
        prepare_data(ds)
end

and it looks like the failure is when opening the netcdf file (see stacktrace below)

julia> Pkg.test("NCDatasets") passes all tests.

Environment

  • operating system: Ubuntu 16.04.7 LTS
  • Julia version: v1.7 and v1.8, official binaries from https://julialang.org/downloads/
    julia> versioninfo()
    Julia Version 1.7.3
    Commit 742b9abb4d (2022-05-06 12:58 UTC)
    Platform Info:
    OS: Linux (x86_64-pc-linux-gnu)
    CPU: Intel(R) Xeon(R) CPU E5-2697 v4 @ 2.30GHz
    WORD_SIZE: 64
    LIBM: libopenlibm
    LLVM: libLLVM-12.0.1 (ORCJIT, broadwell)
    Environment:
    JULIA_NUM_PRECOMPILE_TASKS = 4
    JULIA_DEPOT_PATH = /data/sd336/software/julia/depot
  • NCDatasets v0.12.6
  • NetCDF_jll v400.902.5+0

Full output

signal (11): Segmentation fault
in expression starting at /data/sd336/runtests.jl:11
posixio_open at /data/sd336/software/julia/depot/artifacts/b51d396f750184183d5594c558b417883f749e07/lib/libnetcdf.so (unknown line)
ncio_open at /data/sd336/software/julia/depot/artifacts/b51d396f750184183d5594c558b417883f749e07/lib/libnetcdf.so (unknown line)
NC3_open at /data/sd336/software/julia/depot/artifacts/b51d396f750184183d5594c558b417883f749e07/lib/libnetcdf.so (unknown line)
NC_open at /data/sd336/software/julia/depot/artifacts/b51d396f750184183d5594c558b417883f749e07/lib/libnetcdf.so (unknown line)
nc_open at /data/sd336/software/julia/depot/artifacts/b51d396f750184183d5594c558b417883f749e07/lib/libnetcdf.so (unknown line)
nc_open at /data/sd336/software/julia/depot/packages/NCDatasets/gV1QR/src/netcdf_c.jl:267
unknown function (ip: 0x7f91e3a5fcf9)
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2247 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2429
#NCDataset#12 at /data/sd336/software/julia/depot/packages/NCDatasets/gV1QR/src/dataset.jl:203
NCDataset at /data/sd336/software/julia/depot/packages/NCDatasets/gV1QR/src/dataset.jl:172 [inlined]
NCDataset at /data/sd336/software/julia/depot/packages/NCDatasets/gV1QR/src/dataset.jl:172 [inlined]
#NCDataset#13 at /data/sd336/software/julia/depot/packages/NCDatasets/gV1QR/src/dataset.jl:239
NCDataset at /data/sd336/software/julia/depot/packages/NCDatasets/gV1QR/src/dataset.jl:239 [inlined]
prepare_do_force_grid at /data/sd336/software/julia/depot/packages/PALEOboxes/iA0AD/src/reactioncatalog/GridForcings.jl:125

...etc...

@Alexander-Barth
Copy link
Member

As a test, does the error persists if you comment-out this line? (assuming that you do not need OPENDAP over HTTPS support):

https://github.com/Alexander-Barth/NCDatasets.jl/blob/master/src/NCDatasets.jl#L31

If the error is still present without the call to init_certificate_authority(), can you provide a minimum reproducible example? I don't need the whole data set or the complete function prepare_data just a minimal one, possibly with random data which still exhibit the segfault.

@sjdaines
Copy link
Author

The error is still there if I comment out init_certificate_authority(), this is using Julia 1.7.3

Here's a cut down (although probably not minimal) code example that still fails, although less frequently than the full app.
The two netcdf files here report as classic using ncdump -k (I'll see if I can reproduce this with files I can share):

file testnc2.jl contains:

import NCDatasets

dataarrays = []

niter = 1

netcdf_filename1 = "unshareable_classic_netcdf_1.nc"
fields1 = ["time"]
netcdf_filename2 = "unshareable_classic_netcdf_2.nc"
fields2 = ["time", "phys_ocn_v"]

@noinline function prepare_data(darrays, fields, ds)
    for f in fields
        push!(darrays, ds[f][:])
    end
end

while niter < 100
    println("niter: ", niter)
    
    NCDatasets.Dataset(netcdf_filename1) do ds
        prepare_data(dataarrays, fields1, ds)
    end

    NCDatasets.Dataset(netcdf_filename2) do ds
        prepare_data(dataarrays, fields2, ds)
    end

    global niter += 1
end

and then the test was:

julia> nouter = 1
julia> while true; println("nouter: ", nouter);include("testnc2.jl");global nouter += 1;end

Example stacktrace (this is the most common failure, although it can fail in different ways, see below):

...
nouter: 34
...
niter: 80

signal (11): Segmentation fault
in expression starting at /data/sd336/PALEOdev.jl/PALEOexamples/testnc2.jl:18
posixio_open at /data/sd336/software/julia/depot/artifacts/b51d396f750184183d5594c558b417883f749e07/lib/libnetcdf.so (unknown line)
ncio_open at /data/sd336/software/julia/depot/artifacts/b51d396f750184183d5594c558b417883f749e07/lib/libnetcdf.so (unknown line)
NC3_open at /data/sd336/software/julia/depot/artifacts/b51d396f750184183d5594c558b417883f749e07/lib/libnetcdf.so (unknown line)
NC_open at /data/sd336/software/julia/depot/artifacts/b51d396f750184183d5594c558b417883f749e07/lib/libnetcdf.so (unknown line)
nc_open at /data/sd336/software/julia/depot/artifacts/b51d396f750184183d5594c558b417883f749e07/lib/libnetcdf.so (unknown line)
nc_open at /data/sd336/NCDatasets.jl/src/netcdf_c.jl:267
unknown function (ip: 0x7fd81440e259)
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2247 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2429
#NCDataset#12 at /data/sd336/NCDatasets.jl/src/dataset.jl:203
NCDataset at /data/sd336/NCDatasets.jl/src/dataset.jl:172 [inlined]
NCDataset at /data/sd336/NCDatasets.jl/src/dataset.jl:172 [inlined]
#NCDataset#13 at /data/sd336/NCDatasets.jl/src/dataset.jl:239
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2247 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2429
NCDataset at /data/sd336/NCDatasets.jl/src/dataset.jl:239
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2247 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2429
jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1788 [inlined]
do_call at /buildworker/worker/package_linux64/build/src/interpreter.c:126
eval_value at /buildworker/worker/package_linux64/build/src/interpreter.c:215
...

@sjdaines
Copy link
Author

A couple of example of less common failures (with init_certificate_authority() commented out, using Julia 1.7.3)

With a similar, but not identical, test script:

nouter: 12
niter: 1
niter: 2
niter: 3
niter: 4
niter: 5
niter: 6
niter: 7
niter: 8
niter: 9
niter: 10
niter: 11
*** Error in `julia': double free or corruption (out): 0x000000005bc3eb60 ***
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x777f5)[0x7ff2b16797f5]
/lib/x86_64-linux-gnu/libc.so.6(+0x8038a)[0x7ff2b168238a]
/lib/x86_64-linux-gnu/libc.so.6(cfree+0x4c)[0x7ff2b168658c]
/data/sd336/software/julia/depot/artifacts/b51d396f750184183d5594c558b417883f749e07/lib/libnetcdf.so(free_NC+0x30)[0x7ff243acbb69]
/data/sd336/software/julia/depot/artifacts/b51d396f750184183d5594c558b417883f749e07/lib/libnetcdf.so(NC_open+0x4a3)[0x7ff243abacdd]
/data/sd336/software/julia/depot/artifacts/b51d396f750184183d5594c558b417883f749e07/lib/libnetcdf.so(nc_open+0x3f)[0x7ff243ab9cc0]
[0x7ff24560fd23]
[0x7ff24560fefa]
/data/biogeochemdata/software/julia-1.7.3/bin/../lib/julia/libjulia-internal.so.1(jl_apply_generic+0x1fa)[0x7ff2b083fe7a]
[0x7ff24560d257]
[0x7ff245625b1d]
/data/biogeochemdata/software/julia-1.7.3/bin/../lib/julia/libjulia-internal.so.1(jl_apply_generic+0x1fa)[0x7ff2b083fe7a]
/data/biogeochemdata/software/julia-1.7.3/bin/../lib/julia/libjulia-internal.so.1(+0xd71c9)[0x7ff2b08521c9]
[0x7ff245607a4a]
/data/biogeochemdata/software/julia-1.7.3/bin/../lib/julia/libjulia-internal.so.1(jl_apply_generic+0x1fa)[0x7ff2b083fe7a]
[0x7ff2456253f3]
[0x7ff24562583d]
/data/biogeochemdata/software/julia-1.7.3/bin/../lib/julia/libjulia-internal.so.1(+0x1039ea)[0x7ff2b087e9ea]
/data/biogeochemdata/software/julia-1.7.3/bin/../lib/julia/libjulia-internal.so.1(+0x102b35)[0x7ff2b087db35]
/data/biogeochemdata/software/julia-1.7.3/bin/../lib/julia/libjulia-internal.so.1(jl_toplevel_eval_in+0xaa)[0x7ff2b087f77a]
/data/biogeochemdata/software/julia-1.7.3/lib/julia/sys.so(+0x11b79eb)[0x7ff29cb149eb]
/data/biogeochemdata/software/julia-1.7.3/bin/../lib/julia/libjulia-internal.so.1(jl_apply_generic+0x1fa)[0x7ff2b083fe7a]
/data/biogeochemdata/software/julia-1.7.3/lib/julia/sys.so(+0x11fa16b)[0x7ff29cb5716b]
[0x7ff2456065ac]
/data/biogeochemdata/software/julia-1.7.3/bin/../lib/julia/libjulia-internal.so.1(+0x1039ea)[0x7ff2b087e9ea]
/data/biogeochemdata/software/julia-1.7.3/bin/../lib/julia/libjulia-internal.so.1(+0x102b35)[0x7ff2b087db35]
/data/biogeochemdata/software/julia-1.7.3/bin/../lib/julia/libjulia-internal.so.1(jl_toplevel_eval_in+0xaa)[0x7ff2b087f77a]
/data/biogeochemdata/software/julia-1.7.3/lib/julia/sys.so(+0xf4f1b3)[0x7ff29c8ac1b3]
/data/biogeochemdata/software/julia-1.7.3/lib/julia/sys.so(+0xf4f9d5)[0x7ff29c8ac9d5]
/data/biogeochemdata/software/julia-1.7.3/lib/julia/sys.so(+0x824d3d)[0x7ff29c181d3d]
/data/biogeochemdata/software/julia-1.7.3/lib/julia/sys.so(+0x839692)[0x7ff29c196692]
/data/biogeochemdata/software/julia-1.7.3/lib/julia/sys.so(+0x83994c)[0x7ff29c19694c]
/data/biogeochemdata/software/julia-1.7.3/bin/../lib/julia/libjulia-internal.so.1(jl_apply_generic+0x1fa)[0x7ff2b083fe7a]
/data/biogeochemdata/software/julia-1.7.3/lib/julia/sys.so(+0x8e61ab)[0x7ff29c2431ab]
/data/biogeochemdata/software/julia-1.7.3/lib/julia/sys.so(+0x8e624c)[0x7ff29c24324c]
/data/biogeochemdata/software/julia-1.7.3/bin/../lib/julia/libjulia-internal.so.1(jl_apply_generic+0x1fa)[0x7ff2b083fe7a]
/data/biogeochemdata/software/julia-1.7.3/bin/../lib/julia/libjulia-internal.so.1(jl_f__call_latest+0x47)[0x7ff2b0850647]
/data/biogeochemdata/software/julia-1.7.3/lib/julia/sys.so(+0x12ef06f)[0x7ff29cc4c06f]
/data/biogeochemdata/software/julia-1.7.3/lib/julia/sys.so(+0x12fa59d)[0x7ff29cc5759d]
/data/biogeochemdata/software/julia-1.7.3/lib/julia/sys.so(+0xd97068)[0x7ff29c6f4068]
/data/biogeochemdata/software/julia-1.7.3/lib/julia/sys.so(+0xd971d9)[0x7ff29c6f41d9]
/data/biogeochemdata/software/julia-1.7.3/bin/../lib/julia/libjulia-internal.so.1(jl_apply_generic+0x1fa)[0x7ff2b083fe7a]
/data/biogeochemdata/software/julia-1.7.3/bin/../lib/julia/libjulia-internal.so.1(+0x128426)[0x7ff2b08a3426]
/data/biogeochemdata/software/julia-1.7.3/bin/../lib/julia/libjulia-internal.so.1(jl_repl_entrypoint+0x8d)[0x7ff2b08a3dcd]
julia(main+0x9)[0x4007d9]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0)[0x7ff2b1622840]
julia[0x400809]
======= Memory map: ========
...

With the full app:

signal (11): Segmentation fault
in expression starting at /data/sd336/runtests.jl:11
strlen at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
processuri at /data/sd336/software/julia/depot/artifacts/b51d396f750184183d5594c558b417883f749e07/lib/libnetcdf.so (unknown line)
NC_infermodel at /data/sd336/software/julia/depot/artifacts/b51d396f750184183d5594c558b417883f749e07/lib/libnetcdf.so (unknown line)
NC_open at /data/sd336/software/julia/depot/artifacts/b51d396f750184183d5594c558b417883f749e07/lib/libnetcdf.so (unknown line)
nc_open at /data/sd336/software/julia/depot/artifacts/b51d396f750184183d5594c558b417883f749e07/lib/libnetcdf.so (unknown line)
nc_open at /data/sd336/NCDatasets.jl/src/netcdf_c.jl:267
unknown function (ip: 0x7f5c93d44cc9)
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2247 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2429
#NCDataset#12 at /data/sd336/NCDatasets.jl/src/dataset.jl:203
NCDataset at /data/sd336/NCDatasets.jl/src/dataset.jl:172 [inlined]
NCDataset at /data/sd336/NCDatasets.jl/src/dataset.jl:172 [inlined]
#NCDataset#13 at /data/sd336/NCDatasets.jl/src/dataset.jl:239
unknown function (ip: 0x7f5c93d4c939)
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2247 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2429
NCDataset at /data/sd336/NCDatasets.jl/src/dataset.jl:239
#CartesianGrid#29 at /data/sd336/software/julia/depot/packages/PALEOboxes/iA0AD/src/Grids.jl:597
CartesianGrid at /data/sd336/software/julia/depot/packages/PALEOboxes/iA0AD/src/Grids.jl:591 [inlined]

@sjdaines
Copy link
Author

Also fails with a minimal netcdf file coords.nc (attached), although much less frequently, and in a different place. This file reports as netCDF-4 with ncdump -k

Test script is modified with:

netcdf_filename1 = "coords.nc"
fields1 = ["latitude"] 
netcdf_filename2 = "coords.nc"
fields2 = ["latitude", "longitude"]

Example stack trace of failure:

...
nouter: 4241
...
niter: 99
...

signal (11): Segmentation fault
in expression starting at /data/sd336/testnc3.jl:18
strlen at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
ncindexadd at /data/sd336/software/julia/depot/artifacts/b51d396f750184183d5594c558b417883f749e07/lib/libnetcdf.so (unknown line)
nc4_att_list_add at /data/sd336/software/julia/depot/artifacts/b51d396f750184183d5594c558b417883f749e07/lib/libnetcdf.so (unknown line)
att_read_callbk at /data/sd336/software/julia/depot/artifacts/b51d396f750184183d5594c558b417883f749e07/lib/libnetcdf.so (unknown line)
H5A__attr_iterate_table at /data/sd336/software/julia/depot/artifacts/eb2e2f806144b25e5db6cc8941eb33b9a13f58a3/lib/libhdf5.so (unknown line)
H5O_attr_iterate_real at /data/sd336/software/julia/depot/artifacts/eb2e2f806144b25e5db6cc8941eb33b9a13f58a3/lib/libhdf5.so (unknown line)
H5O__attr_iterate at /data/sd336/software/julia/depot/artifacts/eb2e2f806144b25e5db6cc8941eb33b9a13f58a3/lib/libhdf5.so (unknown line)
H5A__iterate_common at /data/sd336/software/julia/depot/artifacts/eb2e2f806144b25e5db6cc8941eb33b9a13f58a3/lib/libhdf5.so (unknown line)
H5A__iterate at /data/sd336/software/julia/depot/artifacts/eb2e2f806144b25e5db6cc8941eb33b9a13f58a3/lib/libhdf5.so (unknown line)
H5VL__native_attr_specific at /data/sd336/software/julia/depot/artifacts/eb2e2f806144b25e5db6cc8941eb33b9a13f58a3/lib/libhdf5.so (unknown line)
H5VL__attr_specific.isra.0 at /data/sd336/software/julia/depot/artifacts/eb2e2f806144b25e5db6cc8941eb33b9a13f58a3/lib/libhdf5.so (unknown line)
H5VL_attr_specific at /data/sd336/software/julia/depot/artifacts/eb2e2f806144b25e5db6cc8941eb33b9a13f58a3/lib/libhdf5.so (unknown line)
H5Aiterate2 at /data/sd336/software/julia/depot/artifacts/eb2e2f806144b25e5db6cc8941eb33b9a13f58a3/lib/libhdf5.so (unknown line)
nc4_read_atts at /data/sd336/software/julia/depot/artifacts/b51d396f750184183d5594c558b417883f749e07/lib/libnetcdf.so (unknown line)
nc4_hdf5_find_grp_var_att at /data/sd336/software/julia/depot/artifacts/b51d396f750184183d5594c558b417883f749e07/lib/libnetcdf.so (unknown line)
NC4_HDF5_inq_var_all at /data/sd336/software/julia/depot/artifacts/b51d396f750184183d5594c558b417883f749e07/lib/libnetcdf.so (unknown line)
nc_inq_var at /data/sd336/software/julia/depot/artifacts/b51d396f750184183d5594c558b417883f749e07/lib/libnetcdf.so (unknown line)
nc_inq_varname at /data/sd336/software/julia/depot/artifacts/b51d396f750184183d5594c558b417883f749e07/lib/libnetcdf.so (unknown line)
nc_inq_varname at /data/sd336/NCDatasets.jl/src/netcdf_c.jl:1515
listVar at /data/sd336/NCDatasets.jl/src/variable.jl:12
keys at /data/sd336/NCDatasets.jl/src/dataset.jl:258 [inlined]
initboundsmap! at /data/sd336/NCDatasets.jl/src/dataset.jl:80
NCDataset#1 at /data/sd336/NCDatasets.jl/src/types.jl:109
NCDataset at /data/sd336/NCDatasets.jl/src/types.jl:90 [inlined]
#NCDataset#12 at /data/sd336/NCDatasets.jl/src/dataset.jl:227
NCDataset at /data/sd336/NCDatasets.jl/src/dataset.jl:172 [inlined]
NCDataset at /data/sd336/NCDatasets.jl/src/dataset.jl:172 [inlined]
#NCDataset#13 at /data/sd336/NCDatasets.jl/src/dataset.jl:239
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2247 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2429
NCDataset at /data/sd336/NCDatasets.jl/src/dataset.jl:239
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2247 [inlined]

coords.zip

@Alexander-Barth
Copy link
Member

Alexander-Barth commented Aug 22, 2022

Can you also test if the issue is also present in NCDatasets v0.12.6 with NetCDF_jll v400.702.400+0 by forcing to use the version ]add [email protected]. Is it necessary to have the outer and inner loop, or can you just have a long inner loop?

A smaller reproducer:

mport NCDatasets                                                                                                                                                     
                                                                                                                                                                      
netcdf_filename1 = "coords.nc"                                                                                                                                        
                                                                                                                                                                      
total = 0.                                                                                                                                                            
niter = 0                                                                                                                                                             
tmp = zeros(Float32,90)                                                                                                                                               
                                                                                                                                                                      
while true                                                                                                                                                            
    global total, niter                                                                                                                                               
    (niter % 1000 == 0) && println("niter: ", niter)                                                                                                                  
                                                                                                                                                                      
    NCDatasets.Dataset(netcdf_filename1) do ds                                                                                                                        
        varid = 0                                                                                                                                                     
        NCDatasets.nc_get_var!(ds.ncid,varid,tmp)                                                                                                                     
        total += sum(tmp)                                                                                                                                             
    end                                                                                                                                                               
                                                                                                                                                                      
    niter += 1                                                                                                                                                        
end                                                                                                                                                                   

craches with:

niter: 922000                                                                                                                                                         
                                                                                                                                                                      
signal (11): Speicherzugriffsfehler                                                                                                                                   
in expression starting at /mnt/data1/abarth/.julia/dev/NCDatasets/test/test_segfault3.jl:9                                                                            
unknown function (ip: 0x7f1fee67f507)                                                                                                                                 
ncindexadd at /home/abarth/.julia/artifacts/b51d396f750184183d5594c558b417883f749e07/lib/libnetcdf.so (unknown line)                                                  
nc4_att_list_add at /home/abarth/.julia/artifacts/b51d396f750184183d5594c558b417883f749e07/lib/libnetcdf.so (unknown line)                                            
att_read_callbk at /home/abarth/.julia/artifacts/b51d396f750184183d5594c558b417883f749e07/lib/libnetcdf.so (unknown line)                                             
H5A__attr_iterate_table at /home/abarth/.julia/artifacts/eb2e2f806144b25e5db6cc8941eb33b9a13f58a3/lib/libhdf5.so (unknown line)                                       
H5O_attr_iterate_real at /home/abarth/.julia/artifacts/eb2e2f806144b25e5db6cc8941eb33b9a13f58a3/lib/libhdf5.so (unknown line)                                         
H5O__attr_iterate at /home/abarth/.julia/artifacts/eb2e2f806144b25e5db6cc8941eb33b9a13f58a3/lib/libhdf5.so (unknown line)        

(using NetCDF_jll v400.902.5+0)

@Alexander-Barth
Copy link
Member

This is likely to be an upstream issue.
Unidata/netcdf-c#2486

@sjdaines
Copy link
Author

I can reproduce the first failure above ('classic' format netcdf files, fails in nc_open) using a single publicly available test file downloaded from the Unidata website.

The 'inner' and 'outer' loop do seem to be necessary (at least to provoke a failure quickly, a test with a single loop is still running after >1500 outer iterations).

To me it looks like this is a different error and plausibly a different issue to the netCDF-4 case?

This is using:
julia 1.7.3
NetCDF_jll v400.902.5+0
NCDatasets v0.12.7

(also there is no failure after changing to NetCDF_jll v400.702.400+0 using ]add [email protected], at least after 1000 iterations)

File testnc6.jl contains:

import NCDatasets

dataarrays = []

niter = 1

# Test file downloaded from 
# https://www.unidata.ucar.edu/software/netcdf/examples/files.html
# ('classic' format)
# https://www.unidata.ucar.edu/software/netcdf/examples/ECMWF_ERA-40_subset.nc
netcdf_filename2 = "/data/sd336/ECMWF_ERA-40_subset.nc"
fields2 = ["time", "tcw"]

@noinline function prepare_data(darrays, fields, ds)
    for f in fields
        push!(darrays, ds[f][:])
    end
end

while niter < 100
    println("niter: ", niter)

    NCDatasets.Dataset(netcdf_filename2) do ds
        prepare_data(dataarrays, fields2, ds)
    end

    global niter += 1
end

with test:

julia> nouter = 1
julia> while true; println("nouter: ", nouter);include("testnc6.jl");global nouter += 1;end

and stacktrace:

...
nouter: 132
...
niter: 52

signal (11): Segmentation fault
in expression starting at /data/sd336/testnc6.jl:24
posixio_open at /data/sd336/software/julia/depot/artifacts/b51d396f750184183d5594c558b417883f749e07/lib/libnetcdf.so (unknown line)
ncio_open at /data/sd336/software/julia/depot/artifacts/b51d396f750184183d5594c558b417883f749e07/lib/libnetcdf.so (unknown line)
NC3_open at /data/sd336/software/julia/depot/artifacts/b51d396f750184183d5594c558b417883f749e07/lib/libnetcdf.so (unknown line)
NC_open at /data/sd336/software/julia/depot/artifacts/b51d396f750184183d5594c558b417883f749e07/lib/libnetcdf.so (unknown line)
nc_open at /data/sd336/software/julia/depot/artifacts/b51d396f750184183d5594c558b417883f749e07/lib/libnetcdf.so (unknown line)
nc_open at /data/sd336/software/julia/depot/packages/NCDatasets/EkOvO/src/netcdf_c.jl:267
unknown function (ip: 0x7f91acdf93b9)
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2247 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2429
#NCDataset#12 at /data/sd336/software/julia/depot/packages/NCDatasets/EkOvO/src/dataset.jl:203
NCDataset at /data/sd336/software/julia/depot/packages/NCDatasets/EkOvO/src/dataset.jl:172 [inlined]
NCDataset at /data/sd336/software/julia/depot/packages/NCDatasets/EkOvO/src/dataset.jl:172 [inlined]
#NCDataset#13 at /data/sd336/software/julia/depot/packages/NCDatasets/EkOvO/src/dataset.jl:239
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2247 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2429
NCDataset at /data/sd336/software/julia/depot/packages/NCDatasets/EkOvO/src/dataset.jl:239
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2247 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2429
jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1788 [inlined]
do_call at /buildworker/worker/package_linux64/build/src/interpreter.c:126
...



@Alexander-Barth
Copy link
Member

Thank for your additional testing! There seem to be also an issue during initialization of NetCDF:
Unidata/netcdf-c#2486 (comment)

Hopefully this is the same problem, because this error happens right away.

@Alexander-Barth
Copy link
Member

Can you test with https://github.com/Alexander-Barth/NetCDF_jll.jl/releases/tag/NetCDF-v400.902.29%2B0 ?

  1. start with an empty environment julia --project some_empty_folder
  2. install NCDatasets dev NCDatasets
  3. you need to comment out https://github.com/Alexander-Barth/NCDatasets.jl/blob/master/Project.toml#L19
  4. install the new NetCDF_jll via ]add https://github.com/Alexander-Barth/NetCDF_jll.jl

On my end, it does not crash any more after 3000 outer iterations using this reproducer:

import NCDatasets

dataarrays = []

niter = 1

netcdf_filename1 = "coords.nc"
fields1 = ["latitude"]
netcdf_filename2 = "coords.nc"
fields2 = ["latitude", "longitude"]

@noinline function prepare_data(darrays, fields, ds)
    for f in fields
        push!(darrays, ds[f][:])
    end
end

while niter < 100
    #println("niter: ", niter)

    NCDatasets.Dataset(netcdf_filename1) do ds
        prepare_data(dataarrays, fields1, ds)
    end

    NCDatasets.Dataset(netcdf_filename2) do ds
        prepare_data(dataarrays, fields2, ds)
    end

    global niter += 1
end

Run with:

nouter = 1; while true; println("nouter: ", nouter);include("testnc2.jl");global nouter += 1;end

@sjdaines
Copy link
Author

Looks good using https://github.com/Alexander-Barth/NetCDF_jll.jl/releases/tag/NetCDF-v400.902.29%2B0 !!

I've run three tests:

  1. Reproducer with 'coords.nc' netCDF-4 file as above:
    8000 outer iterations (cf failure at ~4000 outer iterations before)
  2. Reproducer with 'classic' format file downloaded from https://www.unidata.ucar.edu/software/netcdf/examples/ECMWF_ERA-40_subset.nc
    3000 outer iterations (cf failure at ~150 outer iterations before)
  3. Our application test suite (that lead to the initial report);
    2 runs (cf used to fail every time before halfway)

@Alexander-Barth
Copy link
Member

Thanks a lot for this comprehensive testing! In this build a remove the NetCDF c-flag -std=c99.

@sjdaines
Copy link
Author

Many thanks for addressing this issue, as well as your work on NCDatasets !

(and confirm all still looks good here after updating to the latest released packages)

@Alexander-Barth
Copy link
Member

Great! Thank you testing and creating the reproducer! The new NetCDF_jll has been released. I think that an Pkg.update() should be sufficient to get it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants