-
Notifications
You must be signed in to change notification settings - Fork 265
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NetCDF 4.9.0: segmentation fault after repeatedly opening a NetCDF 4 file, reading a vector and closing the file #2486
Comments
Interesting, and thank you for the comprehensive information! I'll set up to replicate this, thanks! |
Notes to self: v4.9.0 w/ HDF5 1.12.2 does not manifest this issue on MacOS M1. Setting up a 1.12.1 environment. @Alexander-Barth would it be possible to get the corresponding |
Also, @Alexander-Barth, on the system that's failing, what happens if you compile and run the program as follows:
This makes some assumptions about the capabilities of the underlying compiler/system, and it's possible it will simply fail to compile due to unrecognized arguments. But if it compiles, does it still fail after iteration |
This is libhdf5.settings (HDF5 is taken from https://anaconda.org/conda-forge/hdf5/1.12.1/download/linux-64/hdf5-1.12.1-nompi_h2750804_103.tar.bz2)
This is libnetcdf.settings
Interestingly with
|
I am rerunning this test case with gcc 12.1 and NetCDF compiled with -g which gives a more complete stack trace and line numbers:
The failure in
This is the line where the error occurs: In the build environment the
This involves also the |
@sjdaines also reported an error with classical NetCDF files: |
Interesting, thank you. I will continue working on duplicating this; I am unable thus far to replicate the error on MacOS or Linux, using the .nc and C code provided, with or without memory sanitizing. I'll take a closer look at the |
For completeness, this is the only patch that we apply to 4.9.0: Based on this: But I don't think this is relevant here. Thank you for looking into this! |
Surprisingly, if netcdf is compiled without
I am making the long test now. |
No problem looking into this; we have a long list of things to address but it comes down more to resource management than anything else; we will get to everything in time XD. One hopes. The Sanitizer is great (I was aware of it, but thanks to @edwardhartnett for bringing it into our regular toolbox), it helps flag memory management issues the moment they occur, not when they eventually become problematic. I confess I'm really curious why this is occurring on your system and not on mine; my next step will be to test using the conda-packaged HDF5 instead of the version I compile myself. The only downside to running with the sanitizer is the (to be expected) additional overhead incurred; it changed the 1,000,000 iteration run I was doing from a 2 minute test to a 16 minute test, last night. |
Frustratingly, adding |
If it is helpful, setting up a sandbox with gcc 12.0.1 (or other version) and all necessary libraries (HDF5, zlib and libcurl) can be achieved with the following commands on a Linux system. You only need to have git pre-installed (which is likely :-)) DIR=$PWD
wget -O - https://julialang-s3.julialang.org/bin/linux/x64/1.8/julia-1.8.0-linux-x86_64.tar.gz | tar -xzf -
$DIR/julia-1.8.0/bin/julia --eval 'using Pkg; Pkg.add("BinaryBuilder")'
git clone https://github.com/JuliaPackaging/Yggdrasil.git
cd $DIR/Yggdrasil/N/NetCDF
sed 's/preferred_gcc_version=v"5"/preferred_gcc_version=v"12"/' ./build_tarballs.jl > ./build_tarballs_gcc12.jl
$DIR/julia-1.8.0/bin/julia --color=yes ./build_tarballs_gcc12.jl x86_64-linux-gnu --debug=end The warning NetCDF is installed in /workspace/destdir. To reproduce the issue, one can use: wget -O test_segfault6.c https://dox.ulg.ac.be/index.php/s/QI3R0UKx3QdKBra/download
wget https://github.com/Alexander-Barth/NCDatasets.jl/files/9393436/coords.zip
unzip coords.zip
gcc -g test_segfault6.c $(nc-config --cflags --libs) -fsanitize=address -fno-omit-frame-pointer && ./a.out
# -> should reproduce the error The folder Use Control-D to exit sandbox. The Some commands need to download some large files can take a while (for example a couple of minutes). To uninstall run In any case, I would completely understand if you do not want to adventure into using unfamiliar tools. |
@Alexander-Barth This is great, actually; it will help a lot to be able to replicate the environment. I will take a look at this tomorrow; my day-to-day machine is ARM, so I will move over to an x86-64 machine to test this out. I will also try under emulation if it comes down to it. Thanks! |
@Alexander-Barth So, I'm running the scripts you provided above. First, this is great, I will definitely spend some time unpicking the actual Julia scripts. Julia is one of those languages that's been on my radar/to-do list, but I haven't had the chance to explore. But this containerized build system is a big help. I'm running the provided scripts on my x86_64 dev system, under WSL (if that makes a difference); unfortunately, I'm still not able to reproduce the error. That doesn't necessarily mean there isn't a problem; after I've let this run for a while just to make sure, I'm going to run it through some additional memory profiling tools. Do you happen to know how much system RAM is available in the environment where this issue is being observed? What I suspect, at this point, is that something isn't being freed when it should be; if available RAM is relatively low, perhaps we're seeing an out-of-memory issue? Beyond that, I'm happy to continue helping diagnose this issue, particularly when it's so easy to use the "same" environment as where the issue is being observed. |
Ah, I suspect maybe I'm not seeing the failure because of your PR from an hour ago. That might also explain it. Let me try this again and see if I recreate it if I step back before that was merged. |
Ok, I am able to recreate this now, and I have a couple of leads on it. Thanks! |
With some testing, I'm now observing the following: (where v8.8.8 is a temporary tag I've created in
Running through gdb, I'm observing the following:
It appears something unexpected is happening with |
Note to self: I've confirmed that removing the |
@Alexander-Barth So, this has been a valuable and interesting exercise. I believe that the fix you have adopted here, the removal of A cursory Google search for I'll wait to close this issue so that you have a chance to share your thoughts, @Alexander-Barth; feel free to close the issue yourself if you'd like, or I'll address it in the next couple of days. Thanks again! |
OK, this a very interesting find! So What is surprising, is that according to
Despite the test using the option
For reference, there is previous discussion about this: I am closing this issue because because the option is not necessary anymore in NetCDF 4.9.0. After intensive testing by @sjdaines all reported failure cases are fixed by dropping (And a learned a lot too; above all that C is really hard :-)) |
I recall now. It turns out that, as you note, a number of functions like strdup |
The julia user @sjdaines reported this segmentation fault (JuliaGeo/NCDatasets.jl#187 ), when repeatedly open a NetCDF 4 file, reading a vector and closing the file. After doing this ~1 000 000 times we have a segmentation fault. For the original use-case, the error occurs much earlier.
NetCDF 4.9.0 with HDF5 1.12.1 on Linux 5.15.0 with gcc 5.2.0 or gcc 12.1.0.
NetCDF 4.9.0 is compiled with:
The segmentation fault can also be reproduced with the following C code:
Compiled with:
gcc -g test_segfault6.c $(nc-config --cflags --libs)
After
niter: 944000
, the output isSegmentation fault (core dumped)
. Running the programm under gdb, we see the following stack trace:On a different system with HDF5 1.10.0 this error could not be reproduced (tested up to 5 000 000 iterations).
The NetCDF file is available at:
https://github.com/Alexander-Barth/NCDatasets.jl/files/9393436/coords.zip and contains the following data:
The text was updated successfully, but these errors were encountered: