-
-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Workaround for N_Vector segfaults #380
Workaround for N_Vector segfaults #380
Conversation
Issue #367 This fixes a problem seen with Julia >= 1.8, where a temporary NVector was created and the N_Vector pointer it contains passed to ccall, which is unsafe as the temporary NVector may be garbage collected. The fix is to remove convert(N_Vector, x) and replace with the combination of cconvert and unsafe_convert. An NVector can new be supplied as an argument to ccall with memory management then handled correctly ( and it is now impossible to explicitly create an N_Vector). Changes: - change to cconvert / unsafe_convert is in src/nvector_wrapper.jl . - A global edit (~400 changes) to the autogenerated code in lib/libsundials_api.jl. - small number of changes elsewhere, replacing N_Vector with NVector NB: The correct fix would be to update the code generator in gen/generate.jl, and regenerate the wrappers. NB: As the comments in src/nvector_wrapper.jl show this was an intentional and (now) incorrect use of temporary variables to avoid garbage collection, it is possible there are other similar problems (pointers from other types of temporary variables passed to ccall).
# TODO replace with use of cconvert / unsafe_convert | ||
# The code and comment below is incorrect and leads to memory errors with Julia >= 1.8, | ||
# as storing in a local variable DOES NOT protect against garbage collection. | ||
# Current workaround is to edit the generated files and apply a fix. | ||
# | ||
# Incorrect comment: 'first convert argument to NVector() and store in local var, | ||
# this guarantees that the wrapper and associated Sundials object (e.g. N_Vector) | ||
# is not removed by GC' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah it would be good to handle this in full., but I think for now this comment is sufficient.
Codecov Report
@@ Coverage Diff @@
## master #380 +/- ##
==========================================
- Coverage 78.09% 78.05% -0.05%
==========================================
Files 11 11
Lines 1484 1481 -3
==========================================
- Hits 1159 1156 -3
Misses 325 325
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
Change identity(x) -> x (this was just a result of being lazy with the global edit)
CI failure with Julia 1.8.5 looks like the test runner is shut down after ~30mins, while somewhere in the middle of the preconditioner tests ? Perhaps hiiting some resource limit? Tests pass locally on 1.8.3 and 1.9.0-beta2, but do take ~60 mins. |
I set it to run again and see. It shouldn't be that big of a test to cause a resource failure, so I'd assume something is wrong with the memory still if it's still failing (it's been our smoking gun MWE) |
Maybe more than one problem here ? The earlier CI failures (Julia 1.8 with preconditioner tests) all ended with 'Error: Process completed with exit code 1', the failures with this fix have different exit codes as if the test runner was killed. Perhaps worth releasing this, with the preconditioners tests temporarily removed as before? This definitely fixes one problem seen in application code that otherwise makes CVODE / IDA unusable here with 1.9.0-beta2, |
Added some diagnostics to precs.jl tests and commented out the 'solve' calls at the end of the file to diagnose CI failure Sundials.jl/test/common_interface/precs.jl Lines 106 to 122 in 56added
Failure with Julia 1.8.5 still happens https://github.com/SciML/Sundials.jl/actions/runs/3903341325/jobs/6667576487 |
Add a |
precs.jl tests fail on github CI with Julia >= 1.8, before Sundials is even involved, ie failure looks to be unrelated to Sundials.
To be totally sure, I had commented out the solve calls (which is the first place where Sundials is even involved) and the failure still happened !! So I figured that was independent of any diagnostic output... |
I've just pushed a commit which omits the precs.jl tests for Julia >= 1.8 (as before), thinking it would be easier to release this and pursue the precs.jl issue in a different PR ? |
NVector is now a mutable struct with finalizer. (cf previous implementation with an immutable struct containing a Ref{NVector} to which a finalizer was attached)
I've opened new issues for:
|
Addresses #381 This is theoretically a breaking change, however any application code that was using the low-level API in this way (passing a Julia function into eg CVodeInit and relying on CVRhsFn_wrapper) would likely segfault randomly due to garbage collection of the wrapper. The problem with CVRhsFn_wrapper and similar is that a temporary Base.CFunction closure is returned by @cfunction($f, ...), which may be garbage collected hence it is not safe to return the enclosed pointer. This is somewhat similar to the problem with NVector addressed by #380, but more fundamental as the RHS function needs to persist for the whole lifetime of the solver use. Fix here is to remove these wrappers (which were only used by a few tests of the low-level interface), and require that the caller either uses the @cfunction(f, ...) form to generate a C pointer (ie f known at compile time and the runtime Julia RHS function passed in to Sundials C code as part of the opaque user data, this is what the high-level interface does), or manages the lifetime of the @cfunction explicitly.
This commit causes KiteModels.jl to fail with:
|
I'm not sure MVectors ever worked with Sundials? If it did that was unexpected and untested. Sundials is documented to only work with Array{Float64} specifically IIRC. We can add a conversion method for it though. Open an issue. |
Here it is: #392 |
Addresses SciML#435 and also fixes a potential problem with memory safety: 1) Allocations in convert(Ptr{T}, Handle{T}) Fix is as suggested in SciML#435 (Handle is now a mutable struct with a ptr field) 2) Memory safety robustness fix Remove convert and use paired Base.cconvert / Base.unsafe_convert to get the ptr field from a Handle object h, so that the h is preserved from GC across the ccall This fix is analogous to that for the NVector wrapper in PR SciML#380 (NB: there is no actual problem at least when Sundials is used with the SciML interface, as the Handle{CVODEMem} and similar objects are held by a persistent solver data structure, but this change should reduce the risk of something going wrong in the future, or for eg test harnesses that don't use the SciML interface)
Addresses #435 and also fixes a potential problem with memory safety: 1) Allocations in convert(Ptr{T}, Handle{T}) Fix is as suggested in #435 (Handle is now a mutable struct with a ptr field) 2) Memory safety robustness fix Remove convert and use paired Base.cconvert / Base.unsafe_convert to get the ptr field from a Handle object h, so that the h is preserved from GC across the ccall This fix is analogous to that for the NVector wrapper in PR #380 (NB: there is no actual problem at least when Sundials is used with the SciML interface, as the Handle{CVODEMem} and similar objects are held by a persistent solver data structure, but this change should reduce the risk of something going wrong in the future, or for eg test harnesses that don't use the SciML interface)
Workaround for issue #367
This fixes a problem seen with Julia >= 1.8, where a temporary NVector was created and the N_Vector pointer it contains passed to ccall, which is unsafe as the temporary NVector may be garbage collected.
The fix is to remove convert(N_Vector, x) and replace with the combination of cconvert and unsafe_convert. An NVector can now be supplied as an argument to ccall with memory management then handled correctly ( and it is now impossible to explicitly create an N_Vector).
Changes:
NB: The correct fix would be to update the code generator in gen/generate.jl, and regenerate the wrappers.
NB: As the comments in src/nvector_wrapper.jl show this was an intentional and (now) incorrect use of temporary variables to avoid garbage collection, it is possible there are other similar problems (pointers from other types of temporary variables passed to ccall).