Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfault when running a test with custom pivot #105

Open
sphynx opened this issue Nov 13, 2020 · 8 comments
Open

Segfault when running a test with custom pivot #105

sphynx opened this issue Nov 13, 2020 · 8 comments

Comments

@sphynx
Copy link

sphynx commented Nov 13, 2020

On current master (which has the fix for the memory leak: 72d66b5 ) I'm getting a segfault when running standard cylp tests (from the file cylp/py/test_PySolve.py).

I've converted that test to a simple repro:

from cylp.cy import CyClpSimplex
from cylp.py.pivots import DantzigPivot

s = CyClpSimplex()
s.readMps('p0033.mps')  # or adlittle.mps
pivot = DantzigPivot(s)
s.setPivotMethod(pivot)
s.primal()
print(s.objectiveValue)

File p0033.mps comes from cylp/input directory.

I'm getting the following output:

$ python repro.py
Coin0001I At line 15 NAME          P0033
Coin0001I At line 16 ROWS
Coin0001I At line 34 COLUMNS
Coin0001I At line 109 RHS
Coin0001I At line 118 BOUNDS
Coin0001I At line 152 ENDATA
Coin0002I Problem P0033 has 16 rows, 33 columns and 98 elements
Clp0027I Model was imported from p0033.mps in 0.000493 seconds
Clp0006I 0  Obj 0 Primal inf 18.492499 (10) Dual inf 5.5987499e+11 (32)
Segmentation fault (core dumped)

I could reproduce this after building and installing cylp using pip install . on both ArchLinux (with system-wide coin-or-cbc package of version 2.10.5-4) and WSL and also when installing using conda.

Looking at core dump I can see that it failed in ClpSimplexPrimal::statusOfProblemInPrimal function (due to null pointer dereferencing):

#0  0x00007fabd5176671 in ClpSimplexPrimal::statusOfProblemInPrimal(int&, int, ClpSimplexProgress*, bool, int, ClpSimplex*) () from /usr/lib/libClp.so.1
#1  0x00007fabd517aba9 in ClpSimplexPrimal::primal(int, int) () from /usr/lib/libClp.so.1
#2  0x00007fabd54b5b76 in IClpSimplex::primal (this=<optimized out>, ifValuesPass=0, startFinishOptions=0) at cylp/cpp/IClpSimplex.cpp:1327
#3  0x00007fabd54ef95a in __pyx_pf_4cylp_2cy_12CyClpSimplex_12CyClpSimplex_118primal (__pyx_v_presolve=<optimized out>, __pyx_v_startFinishOptions=<optimized out>,
    __pyx_v_ifValuesPass=<optimized out>, __pyx_v_self=0x7fabcaf89820) at cylp/cy/CyClpSimplex.cpp:25122
#4  __pyx_pw_4cylp_2cy_12CyClpSimplex_12CyClpSimplex_119primal (__pyx_v_self=0x7fabcaf89820, __pyx_args=<optimized out>, __pyx_kwds=<optimized out>)
    at cylp/cy/CyClpSimplex.cpp:24993

This problem does not happen if I do any of the following:

  1. Checkout and build cylp version before memory leak fixes, i.e. this commit: 875d5d5
  2. Do not set pivot method in the repro.
  3. Use initialSolve instead of primal in the repro.

I tried to understand what exactly is causing the null pointer, but the function ClpSimplexPrimal::statusOfProblemInPrimal is around 900 lines long, it's not easy to understand and debug.

In general, it looks like this memory leak fix (as described in #102 ) is causing subtle memory issues, so we are either getting the segfault or the memory leaks.

Could you please take a look at this?

Thank you!

@sphynx
Copy link
Author

sphynx commented Nov 13, 2020

/CCing @vbadanov and @BenChampion since it was their code which was fixing the memory leak (and introducing this segafult as an unpleasant side effect).

Could you guys please take a look and perhaps propose a solution to this problem?

Thank you!

@Quogu
Copy link

Quogu commented Nov 18, 2020

I've been looking into this today, and I've found a different segfault - my nullptr dereference is here:
https://github.com/coin-or/CyLP/blob/master/cylp/cpp/IClpSimplex.cpp#L1111

objective_ is null when we dereference it. The reason I get this is because I was running @sphynx's repro from the wrong folder and the input file was failing to load. Segfaulting when we fail to load the input file is still pretty bad, so I think that the error handling here should be improved - it seems safest to just check for objective_ being null and running the example through, fixing any more issues until the repro with a failed load doesn't segfault.

Once I fixed this and ran the repro from the correct directory, I can confirm that I can repro @sphynx's segfault on Windows 10 too, my callstack is:

 	CyClpSimplex.cp37-win_amd64.pyd!ClpSimplexPrimal::statusOfProblemInPrimal(int &,int,class ClpSimplexProgress *,bool,int,class ClpSimplex *)	C++
 	CyClpSimplex.cp37-win_amd64.pyd!ClpSimplexPrimal::primal(int,int)	C++
>	CyClpSimplex.cp37-win_amd64.pyd!IClpSimplex::primal(int ifValuesPass, int startFinishOptions) Line 1327	C++
 	CyClpSimplex.cp37-win_amd64.pyd!__pyx_pf_4cylp_2cy_12CyClpSimplex_12CyClpSimplex_118primal(__pyx_obj_4cylp_2cy_12CyClpSimplex_CyClpSimplex * __pyx_v_self, _object * __pyx_v_ifValuesPass, _object * __pyx_v_startFinishOptions, _object * __pyx_v_presolve) Line 25093	C++
 	CyClpSimplex.cp37-win_amd64.pyd!__pyx_pw_4cylp_2cy_12CyClpSimplex_12CyClpSimplex_119primal(_object * __pyx_v_self, _object * __pyx_args, _object * __pyx_kwds) Line 24964	C++
 	python37.dll!_PyMethodDef_RawFastCallKeywords(PyMethodDef * method, _object * self, _object * const * args, __int64 nargs, _object * kwnames) Line 693	C
 	python37.dll!_PyMethodDescr_FastCallKeywords(_object * descrobj, _object * const * args, __int64 nargs, _object * kwnames) Line 290	C
 	python37.dll!call_function(_object * * * pp_stack, __int64 oparg, _object * kwnames) Line 4593	C
 	python37.dll!_PyEval_EvalFrameDefault(_frame * f, int throwflag) Line 3111	C
 	[Inline Frame] python37.dll!PyEval_EvalFrameEx(_frame *) Line 547	C
 	python37.dll!_PyEval_EvalCodeWithName(_object * _co, _object * globals, _object * locals, _object * const * args, __int64 argcount, _object * const * kwnames, _object * const * kwargs, __int64 kwcount, int kwstep, _object * const * defs, __int64 defcount, _object * kwdefs, _object * closure, _object * name, _object * qualname) Line 3930	C
 	[Inline Frame] python37.dll!PyEval_EvalCodeEx(_object *) Line 3959	C
 	[Inline Frame] python37.dll!PyEval_EvalCode(_object *) Line 524	C
 	python37.dll!run_mod(_mod * mod, _object * filename, _object * globals, _object * locals, PyCompilerFlags * flags, _arena * arena) Line 1036	C
 	python37.dll!PyRun_InteractiveOneObjectEx(_iobuf * fp, _object * filename, PyCompilerFlags * flags) Line 257	C
 	python37.dll!PyRun_InteractiveLoopFlags(_iobuf * fp, const char * filename_str, PyCompilerFlags * flags) Line 120	C
 	python37.dll!PyRun_AnyFileExFlags(_iobuf * fp, const char * filename, int closeit, PyCompilerFlags * flags) Line 78	C
 	[Inline Frame] python37.dll!pymain_run_file(_iobuf * filename, const wchar_t *) Line 456	C
 	python37.dll!pymain_run_filename(_PyMain * pymain, PyCompilerFlags * cf) Line 1646	C
 	python37.dll!pymain_run_python(_PyMain * pymain) Line 2910	C
 	python37.dll!pymain_main(_PyMain * pymain) Line 3070	C
 	python37.dll!Py_Main(int argc, wchar_t * * argv) Line 3092	C

@sphynx
Copy link
Author

sphynx commented Nov 25, 2020

Any feedback on this, please?

@tkralphs
Copy link
Member

I finally got to tracking this down. Debugging auto-generated Cython code is painful to say the least. The proximal cause of the seg fault is that matrix_ suddenly gets set to zero inside a ClpSimplex object that is still in use and then that NULL pointer is later dereferenced. What I pieced together is that what is happening is that the ClpSimplex object is passed into a newly allocated CyClpSimplex object in the saveWeights method of CyPivotPythonBase here.

cdef void saveWeights(self, CppIClpSimplex * model, int mode):
cymodel = CyClpSimplex()
cymodel.setCppSelf(model)
self.pivotMethodObject.saveWeights(cymodel, mode)

Then because the pointer to that that CyClpSimplex is only stored in a local variable, it is deallocated again once that variable goes out of scope, which then deallocates the ClpSimplex object. But this object is still in use. It wasn't a problem when the ClpSimplex object wasn't being properly deallocated (and this is in fact probably the reason the original author commented out the deallocation). The exact place this happens in the code is line 5284 of cylp/cy/CyPivotPythonBase.cpp, which is in some cryptic auto-generated part of the code. It's hard to see what's going on, so I'm making an educated guess on all of this.

Obviously, one hacky fix would be just to undo the original patch and let the memory leak, which is what has been happening up until recently. Fixing this the right way would probably involve a lot of effort and some intricate knowledge of both Cython and Clp. What is there is already kind of hacky and I'm wondering how many people are actually using this capability of CyLP versus just wanting to be able to make all the tests pass. I can think of a few hacks to fix this, but so far, I don't see one obvious solution and my Cython knowledge is somewhat limited. If anyone has ideas, let me know.

For posterity, here is the backtrace at the exact moment in the execution when matrix_ becomes NULL. It is not until a bit later that the code actually crashes trying to dereference that NULL pointer.

#0  __Pyx_PyFunction_FastCallNoKw (co=0x7fffbda963f0, args=0x7ffffffea438, na=3, 
    globals=0x7fffbda9cec0) at cylp/cy/CyPivotPythonBase.cpp:5284
#1  0x00007fffbd2a002a in __Pyx_PyFunction_FastCallDict (func=0x7fffbda8d120, 
    args=0x7ffffffea420, nargs=3, kwargs=0x0)
    at cylp/cy/CyPivotPythonBase.cpp:5316
#2  0x00007fffbd29c8ad in __pyx_f_4cylp_2cy_17CyPivotPythonBase_17CyPivotPythonBase_saveWeights (__pyx_v_self=0x7ffffe226000, __pyx_v_model=0x88d6f20, 
    __pyx_v_mode=1) at cylp/cy/CyPivotPythonBase.cpp:3095
#3  0x00007fffff450c91 in __pyx_f_4cylp_2cy_26CyClpPrimalColumnPivotBase_RunSaveWeights (__pyx_v_ptr=0x7ffffe226000, __pyx_v_model=0x88d6f20, __pyx_v_mode=1)
    at cylp/cy/CyClpPrimalColumnPivotBase.cpp:2803
#4  0x00007fffbd29b319 in CppClpPrimalColumnPivotBase::saveWeights (
    this=0x85e8bf0, model=0x88d6f20, mode=1)
    at cylp/cpp/IClpPrimalColumnPivotBase.cpp:32
#5  0x00007fffbf3b69f9 in ClpSimplexPrimal::statusOfProblemInPrimal (
    this=0x88d6f20, lastCleaned=@0x7ffffffea794: 0, type=0, progress=0x88d74b8, 
    doFactorization=true, ifValuesPass=0, originalModel=0x0)
    at /mnt/c/Users/tkral/Documents/Projects/Clp/Clp/src/ClpSimplexPrimal.cpp:873
#6  0x00007fffbf3b5304 in ClpSimplexPrimal::primal (this=0x88d6f20, 
    ifValuesPass=0, startFinishOptions=0)
    at /mnt/c/Users/tkral/Documents/Projects/Clp/Clp/src/ClpSimplexPrimal.cpp:367
#7  0x00007fffc0033715 in IClpSimplex::primal (this=0x88d6f20, ifValuesPass=0, 
    startFinishOptions=0) at cylp/cpp/IClpSimplex.cpp:1327
#8  0x00007fffc009abab in __pyx_pf_4cylp_2cy_12CyClpSimplex_12CyClpSimplex_118primal (__pyx_v_self=0x7fffbd304190, __pyx_v_ifValuesPass=0x7fffff6700d0, 
    __pyx_v_startFinishOptions=0x7fffff6700d0, 
    __pyx_v_presolve=0x83dd660 <_Py_FalseStruct>)
    at cylp/cy/CyClpSimplex.cpp:25209
#9  0x00007fffc009a4de in __pyx_pw_4cylp_2cy_12CyClpSimplex_12CyClpSimplex_119primal (__pyx_v_self=0x7fffbd304190, __pyx_args=0x7fffff674070, __pyx_kwds=0x0)
    at cylp/cy/CyClpSimplex.cpp:25080
#10 0x00000000081bb9ca in method_vectorcall_VARARGS_KEYWORDS ()
    at /home/conda/feedstock_root/build_artifacts/python-split_1637374334978/work/Objects/stringlib/unicode_format.h:630
#11 0x00000000081e3360 in _PyObject_VectorcallTstate (kwnames=0x0, 
    nargsf=<optimized out>, args=0x7fffff699ba8, callable=0x7ffffe221d50, 
    tstate=0x84029e0)
    at /home/conda/feedstock_root/build_artifacts/python-split_1637374334978/work/I---Type <return> to continue, or q <return> to quit--- 
nclude/cpython/abstract.h:123
#12 PyObject_Vectorcall (kwnames=0x0, nargsf=<optimized out>, 
    args=0x7fffff699ba8, callable=0x7ffffe221d50)
    at /home/conda/feedstock_root/build_artifacts/python-split_1637374334978/work/Include/cpython/abstract.h:123
#13 call_function (kwnames=0x0, oparg=<optimized out>, 
    pp_stack=<synthetic pointer>, trace_info=0x7ffffffeae20, 
    tstate=<optimized out>)
    at /home/conda/feedstock_root/build_artifacts/python-split_1637374334978/work/Python/ceval.c:5888
#14 _PyEval_EvalFrameDefault ()
    at /home/conda/feedstock_root/build_artifacts/python-split_1637374334978/work/Python/ceval.c:4206
#15 0x00000000081c4879 in _PyEval_EvalFrame (throwflag=0, f=0x7fffff699a40, 
    tstate=0x84029e0)
    at /home/conda/feedstock_root/build_artifacts/python-split_1637374334978/work/Include/internal/pycore_ceval.h:46
#16 _PyEval_Vector ()
    at /home/conda/feedstock_root/build_artifacts/python-split_1637374334978/work/Python/ceval.c:5073
#17 0x0000000008278107 in PyEval_EvalCode (co=0x7ffffe34b680, 
    globals=0x7ffffe34e000, locals=<optimized out>)
    at /home/conda/feedstock_root/build_artifacts/python-split_1637374334978/work/Python/ceval.c:1134
#18 0x00000000082781c9 in run_eval_code_obj ()
    at /home/conda/feedstock_root/build_artifacts/python-split_1637374334978/work/Python/pythonrun.c:1289
#19 0x000000000829deb4 in run_mod ()
    at /home/conda/feedstock_root/build_artifacts/python-split_1637374334978/work/Python/pythonrun.c:1310
#20 0x00000000082a4369 in pyrun_file ()
    at /home/conda/feedstock_root/build_artifacts/python-split_1637374334978/work/Python/pythonrun.c:1206
#21 0x00000000082a451f in _PyRun_SimpleFileObject.localalias ()
    at /home/conda/feedstock_root/build_artifacts/python-split_1637374334978/work/Python/pythonrun.c:455
#22 0x00000000082a4623 in _PyRun_AnyFileObject.localalias ()
    at /home/conda/feedstock_root/build_artifacts/python-split_1637374334978/work/P---Type <return> to continue, or q <return> to quit---
ython/pythonrun.c:89
#23 0x00000000082a5538 in pymain_run_file_obj (
    skip_source_first_line=<optimized out>, filename=0x7ffffe3c4ce0, 
    program_name=0x7ffffe396830)
    at /home/conda/feedstock_root/build_artifacts/python-split_1637374334978/work/Modules/main.c:357
#24 pymain_run_file (config=0x83e6d50)
    at /home/conda/feedstock_root/build_artifacts/python-split_1637374334978/work/Modules/main.c:376
#25 pymain_run_python (exitcode=0x7ffffffeb1a0)
    at /home/conda/feedstock_root/build_artifacts/python-split_1637374334978/work/Modules/main.c:591
#26 Py_RunMain.localalias ()
    at /home/conda/feedstock_root/build_artifacts/python-split_1637374334978/work/Modules/main.c:670
#27 0x00000000082a5689 in Py_BytesMain (argc=<optimized out>, 
    argv=<optimized out>)
    at /home/conda/feedstock_root/build_artifacts/python-split_1637374334978/work/Modules/main.c:1083
#28 0x00007ffffe431b97 in __libc_start_main (main=0x8123a80 <main>, argc=2, 
    argv=0x7ffffffeb3a8, init=<optimized out>, fini=<optimized out>, 
    rtld_fini=<optimized out>, stack_end=0x7ffffffeb398)
    at ../csu/libc-start.c:310
#29 0x0000000008210385 in _start ()
    at /home/conda/feedstock_root/build_artifacts/python-split_1637374334978/work/Parser/parser.c:22805

@rmcgibbo
Copy link

rmcgibbo commented Jan 8, 2022

it seems like some well-paced Py_INCREF calls in the cython code should be able to fix this. I tried throwing some in blindly and was not successful, but likely it wasn't in the right spot.

@tkralphs
Copy link
Member

tkralphs commented Jan 8, 2022

Yeah, I was thinking of something along those lines, but the problem is that anything like that would likely introduce another memory leak because you have to keep that CyClpSimplex object around until the ClpSimplex object is no longer referenced in Clp. But there's no way for CyLP to know when that Clp object is no longer referenced. So the CyClpSimplex object will just hang around and end up being leaked. Actually, looking at the code snippet where this happens

result = PyEval_EvalFrameEx(f,0);
++tstate->recursion_depth;
Py_DECREF(f);

I would guess that commenting out line 5286 would fix the issue, but then introduce a new leak.

A relatively clean way to fix this would be to clone model before passing it in, so that copy can be deallocated without affecting anything else, but I couldn't quickly see exactly how to do that.

It seems pretty clear that the model object is only being passed in to allow access some data stored there and the whole model isn't really needed. But fixing that would require changing Clp's implementation and I don't have the bandwidth to do that, especially since I don't know of anyone really using this capability of CyLP.

@rmcgibbo
Copy link

rmcgibbo commented Jan 8, 2022

(Understood. Putting on my linux packaging hat, I would prioritize getting the tests to pass in the default configuration, even if it means removing some likely unused functionality.)

@BrannonKing
Copy link

I just want to comment that this issue is HUGE! I just downloaded and installed CyLP for one single purpose: I wanted to test some custom pivot rules. There aren't many easy ways to jump right into pivot rules from Python. Alas, no luck.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants