Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

StackOverflowError in summarysize during misc test #19707

Closed
nalimilan opened this issue Dec 24, 2016 · 45 comments · Fixed by #20161
Closed

StackOverflowError in summarysize during misc test #19707

nalimilan opened this issue Dec 24, 2016 · 45 comments · Fixed by #20161
Labels
heisenbug This bug occurs unpredictably

Comments

@nalimilan
Copy link
Member

I see new failures when building Fedora/RHEL RPM nightlies. These have bee, introduced since d47f24b (from Dec. 22).

Error in testset misc:
Error During Test
  Test threw an exception of type StackOverflowError
  Expression: $(Expr(:escape, :(summarysize(Core)))) $(Expr(:escape, :>)) $(Expr(:escape, :(summarysize(Core.Inference)))) $(Expr(:escape, :>)) $(Expr(:escape, :(Core.sizeof(Core))))
  StackOverflowError:
Error in testset misc:
Error During Test
  Test threw an exception of type StackOverflowError
  Expression: summarysize(Base) > 10000 * sizeof(Int)
  StackOverflowError:
Error in testset ambiguous:
Error During Test
  Expression evaluated to non-Boolean
  Expression: ((x->begin 
        isempty(x) || println(x)
    end))(detect_ambiguities(Core,Base; imported=true))
       Value: nothing

https://copr-be.cloud.fedoraproject.org/results/nalimilan/julia-nightlies/fedora-25-x86_64/00492486-julia/build.log.gz

@ViralBShah ViralBShah added this to the 0.6.0 milestone Dec 24, 2016
@tkelman tkelman removed this from the 0.6.0 milestone Dec 24, 2016
@colbec
Copy link

colbec commented Dec 24, 2016

Also seen in openSUSE Leap 42.2.

@tkelman tkelman added the needs more info Clarification or a reproducible example is required label Dec 24, 2016
@tkelman
Copy link
Contributor

tkelman commented Dec 24, 2016

Does this happen in a default source build, or only when you're building against some system dependencies? Scroll up to where the ambiguity test ran and it should be printing the list of ambiguous methods.

@colbec
Copy link

colbec commented Dec 25, 2016

Not sure what you mean by "list of ambiguous methods." In:

Julia Version 0.6.0-dev.1698
Commit 4793e88* (2016-12-25 07:38 UTC)
Platform Info:
  OS: Linux (x86_64-suse-linux)
  CPU: Intel(R) Core(TM) i5-4460  CPU @ 3.20GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.7.1 (ORCJIT, haswell)

terminal output that mentions "ambiguous" (selectively) is:

threads (2)               |   38.54  |  1.34  |  3.5 | 83.29      | 2158.79 
error (2)                 |    0.64  |  0.00  |  0.0 |  2.63      | 2158.79 
WARNING: Method definition ambig(Any, Integer) in module TestMain_ambiguous at /home/colin/Downloads/julia6/test/ambiguous.jl:7 overwritten at /home/colin/Downloads/julia6/test/ambiguous.jl:84.
boundscheck (5)           |    2.45  |  0.00  |  0.0 |  4.55      | 1654.04 
cartesian (5)             |    0.04  |  0.00  |  0.0 |  0.10      | 1656.18 
	From worker 2:	Skipping Base.<|
	From worker 2:	Skipping Base.>:
	From worker 2:	Skipping Base.active_repl
	From worker 2:	Skipping Base.active_repl_backend
asmvariant (5)            |    0.05  |  0.00  |  0.0 |  1.33      | 1658.32 
osutils (5)               |    0.02  |  0.00  |  0.0 |  0.24      | 1660.92 
ambiguous (2)             |    1.01  |  0.05  |  5.1 | 33.64      | 2158.79 
channels (5)              |    8.77  |  0.75  |  8.6 | 446.92     | 1877.34 
dsp (5)                   |    6.67  |  0.00  |  0.0 | 319.38     | 1877.34 
fft (2)                   |   16.24  |  0.79  |  4.9 | 767.73     | 2158.79 
examples (5)              |   12.14  |  0.00  |  0.0 | 679.32     | 2485.93 
parallel (3)              |   62.43  |  0.00  |  0.0 |  1.43      | 1100.48 
cmdlineargs (4)           |  102.76  |  0.05  |  0.1 | 50.24      | 1121.77 
	From worker 1:	Worker 3 failed running test misc:
Some tests did not pass: 1279452 passed, 0 failed, 2 errored, 0 broken.
 in finish at ./test.jl:524
 in macro expansion; at ./util.jl:288 [inlined]
 in macro expansion; at /home/colin/Downloads/julia6/test/testdefs.jl:0 [inlined]
 in anonymous at ./<missing>:?
 in eval at ./boot.jl:236
 in runtests at /home/colin/Downloads/julia6/test/testdefs.jl:16
 in runtests at /home/colin/Downloads/julia6/test/testdefs.jl:4
 in #586 at ./multi.jl:1455
 in run_work_thunk at ./multi.jl:1004
 in macro expansion at ./multi.jl:1455 [inlined]
 in #585 at ./event.jl:66misc: Error During Test
  Test threw an exception of type StackOverflowError
  Expression: $(Expr(:escape, :(summarysize(Core)))) $(Expr(:escape, :>)) $(Expr(:escape, :(summarysize(Core.Inference)))) $(Expr(:escape, :>)) $(Expr(:escape, :(Core.sizeof(Core))))
  StackOverflowError:
misc: Error During Test
  Test threw an exception of type StackOverflowError
  Expression: summarysize(Base) > 10000 * sizeof(Int)
  StackOverflowError:

Test Summary:           |     Pass  Error  Broken     Total
  Overall               | 20030733      2      29  20030764

followed by:

osutils               |       21                       21
ambiguous             |       42                       42
channels              |       32                       32

Let me know where I should be looking.

@tkelman
Copy link
Contributor

tkelman commented Dec 25, 2016

Looks like you only get the stack overflow then, not the Expression evaluated to non-Boolean from the ambiguity test. Does the stack overflow happen if you just run make test-misc on its own?

@colbec
Copy link

colbec commented Dec 25, 2016

I have seen the non-Boolean error as well.
Both make test-misc and make test-ambiguous run successfully without error when run by themselves.

@tkelman
Copy link
Contributor

tkelman commented Dec 25, 2016

Ah, the ambiguity is reproducible when running math and ambiguous on the same worker, it was caused by #19627 and probably just needs a *(::MockQuantity, ::Bool) method to be defined. I can't reproduce the misc failure on ubuntu with make testall1 so we'll need to keep debugging that. What does Base.summarysize(Base) return for you from a normal REPL? If you add @show summarysize(Base) around line 189 of test/misc.jl, what value does it output for test-misc by itself vs a failing run?

@colbec
Copy link

colbec commented Dec 25, 2016

julia> Base.summarysize(Base)
41035484

I inserted the @show statement after line with import Base.summarysize

> make test-misc
    JULIA test/misc
Test (Worker) | Time (s) | GC (s) | GC % | Alloc (MB) | RSS (MB)
summarysize(Base) = 41465530
misc (1)      |    8.42  |  0.86  | 10.3 | 1177.03    | 363.44  

Test Summary: |    Pass    Total
  Overall     | 1279454  1279454
    SUCCESS

Strangely, I can't see the string "summarysize" anywhere in the output from make testall which consistently fails with the same output.

@tkelman
Copy link
Contributor

tkelman commented Dec 25, 2016

probably because the show statement is triggering the stack overflow too, but only if some subset of tests have run previously on the same worker. running JULIA_CPU_CORES=1 ./julia test/runtests.jl foo bar etc with the list of tests that ran previously on the same worker as misc should reproduce it too? And possibly allow some reducing of the test case to a smaller subset?

@colbec
Copy link

colbec commented Dec 25, 2016

OK here is my progress: thinking that the error might have been triggered by something recent in the misc worker I ran

> JULIA_CPU_CORES=1 julia6 test/runtests.jl cmdlineargs enums markdown nullable grisu llvmcall repl show broadcast goto reflection mpfr pollfd resolve version read spawn math copy blas vecelement simdloop intfuncs random offsetarray staged iobuffer hashing dict triplequote char printf sparse/sparse

which, disappointingly, run successfully. I may be shooting myself in the foot by testing in roughly the reverse order of testall, I don't know. I'll keep expanding the list.

@tkelman
Copy link
Contributor

tkelman commented Dec 25, 2016

it should include misc at the end to reproduce the stack overflow, and may be order dependent

@colbec
Copy link

colbec commented Dec 25, 2016

Right, thanks for noticing the missing misc - rather important. Adding the full complement of tests which ran on the same worker and misc at the end triggers the failure. Order does not seem to be important since my tests ran in reverse order. I will now go back through previous tests and report back.

@colbec
Copy link

colbec commented Dec 25, 2016

Ok this is getting complicated.
> JULIA_CPU_CORES=1 julia6 test/runtests.jl math copy blas vecelement simdloop intfuncs random offsetarray staged iobuffer hashing dict triplequote char printf sparse/sparse linalg/matmul linalg/svd linalg/tridiag linalg/pinv linalg/givens linalg/arnoldi linalg/cholesky linalg/hessenberg misc
produces the error but
> JULIA_CPU_CORES=1 julia6 test/runtests.jl cmdlineargs enums markdown nullable grisu llvmcall repl show broadcast goto reflection mpfr pollfd resolve version read spawn math misc
does not.
So clearly I have to split the first (which failed), and I did this on the sparse test. Then ran both sides and both ran to success.
Please correct my thinking, which is that maybe there are two or more tests, at least one from each of both sides of the sparse test, which in combination with misc triggers the failure.

@tkelman
Copy link
Contributor

tkelman commented Dec 25, 2016

What I usually do with this kind of thing is (write a script to) remove one entry from the list at a time, if reproducing the failure happens without it, leave it off and try another. If there's no failure without one of the entries, mark it as required and don't try removing it again. Stop when removing any one more entry doesn't reproduce the failure any more.

@colbec
Copy link

colbec commented Dec 25, 2016

Just in case it is meaningful, while running these tests there is a quantity of output which does not occur when running make testall, here is a sample (this appears to be repeated over and over and fills my terminal buffer scrolling other information out of view):

in summarysize(::Any, ::ObjectIdDict, ::Type{T}) at ./interactiveutil.jl:741
   in summarysize(::Array{Any,1}, ::ObjectIdDict, ::Type{T}) at ./interactiveutil.jl:768
   in _summarysize(::Any, ::ObjectIdDict, ::Type{T}) at ./interactiveutil.jl:752
   in summarysize(::Any, ::ObjectIdDict, ::Type{T}) at ./interactiveutil.

@colbec
Copy link

colbec commented Dec 25, 2016

There's something odd about triplequote.
> JULIA_CPU_CORES=1 julia6 test/runtests.jl linalg/pinv linalg/matmul offsetarray linalg/tridiag linalg/givens linalg/arnoldi triplequote math linalg/cholesky sparse/sparse linalg/svd misc
fails, but if I remove triplequote then it succeeds. Mean anything?

@colbec
Copy link

colbec commented Dec 26, 2016

After an update to Version 0.6.0-dev.1701 (2016-12-26 06:25 UTC), make testall still fails but no longer mentions errors related to stack overflow.

parallel (5)              |   62.45  |  0.00  |  0.0 |  5.32      | 1579.21 
cmdlineargs (2)           |  102.13  |  0.10  |  0.1 | 53.52      | 3090.21 
	From worker 1:	Worker 3 failed running test ambiguous:
Some tests did not pass: 41 passed, 0 failed, 1 errored, 0 broken.
 in finish at ./test.jl:524
 in macro expansion; at ./util.jl:288 [inlined]
 in macro expansion; at /home/colin/Downloads/julia6/test/testdefs.jl:0 [inlined]
 in anonymous at ./<missing>:?
 in eval at ./boot.jl:236
 in runtests at /home/colin/Downloads/julia6/test/testdefs.jl:16
 in runtests at /home/colin/Downloads/julia6/test/testdefs.jl:4
 in #586 at ./multi.jl:1455
 in run_work_thunk at ./multi.jl:1004
 in macro expansion at ./multi.jl:1455 [inlined]
 in #585 at ./event.jl:66ambiguous: Error During Test
  Expression evaluated to non-Boolean
  Expression: ((x->begin 
        isempty(x) || println(x)
    end))(detect_ambiguities(Core,Base; imported=true))
       Value: nothing

Test Summary:           |     Pass  Error  Broken     Total
  Overall               | 20027342      1      29  20027372
...
  cmdlineargs           |      121                      121
  compile               |       61                       61
    FAILURE
Error in testset ambiguous:
Error During Test
  Expression evaluated to non-Boolean
  Expression: ((x->begin 
        isempty(x) || println(x)
    end))(detect_ambiguities(Core,Base; imported=true))
       Value: nothing
ERROR: LoadError: 
 in (::##33#41)() at /home/colin/Downloads/julia6/test/runtests.jl:195
...

FYI here is the new set of tests that produce the failure:
> JULIA_CPU_CORES=1 julia6 test/runtests.jl linalg/qr linalg/schur linalg/special linalg/bunchkaufman linalg/bidiag linalg/lu linalg/lq sparse/sparsevector dates/rounding dates/types dates/io dates/arithmetic core arrayops copy math spawn read resolve broadcast lineedit replcompletions replutil sets goto grisu meta stacktraces profile docs markdown base64 serialize threads libdl int intset floatfuncs inline boundscheck error cartesian asmvariant osutils channels fft ambiguous

@colbec
Copy link

colbec commented Dec 26, 2016

FWIW for me the above test set reduces to
JULIA_CPU_CORES=1 julia6 test/runtests.jl math ambiguous which fails.
Remove math and it runs successfully, make test-math is also successful.

@tkelman
Copy link
Contributor

tkelman commented Dec 26, 2016

Ah, the ambiguity is reproducible when running math and ambiguous on the same worker, it was caused by #19627 and probably just needs a *(::MockQuantity, ::Bool) method to be defined.

That one should be simple to fix. The stack overflow is the more interesting part of this report. make testall won't be exactly deterministic in terms of which tests run on which worker, since it's timing-dependent. If you run JULIA_CPU_CORES=1 julia6 test/runtests.jl linalg/pinv linalg/matmul offsetarray linalg/tridiag linalg/givens linalg/arnoldi triplequote math linalg/cholesky sparse/sparse linalg/svd misc again on latest master does the stack overflow still happen? Can anything else on that list be removed and still trigger the stack overflow? I'm guessing maybe something is interfering with inference of an element type somewhere and resulting in an Array{Any,1} that's actually supposed to have a more specific element type, but I don't know what would cause that.

@ViralBShah ViralBShah added the system:linux Affects only Linux label Dec 26, 2016
@colbec
Copy link

colbec commented Dec 26, 2016

The command:
JULIA_CPU_CORES=1 julia6 test/runtests.jl linalg/pinv linalg/matmul offsetarray linalg/tridiag linalg/givens linalg/arnoldi triplequote math linalg/cholesky sparse/sparse linalg/svd misc
now runs successfully on v. 1701.

@tkelman
Copy link
Contributor

tkelman commented Dec 26, 2016

What version did it last fail at? Did it happen reliably every time, or is it maybe non-deterministic?

@colbec
Copy link

colbec commented Dec 26, 2016

Last seen in v 1698 as reported above. Failed every time that I could see, but of course the sample size is very small.

@tkelman
Copy link
Contributor

tkelman commented Dec 26, 2016

So that was 4793e88, and there's only been one commit to master since then, changing gmp and curl versions.

@nalimilan
Copy link
Member Author

FWIW, I still get the stack overflow with latest master. Thanks for investigating this @colbec!

@colbec
Copy link

colbec commented Dec 27, 2016

@nalimilan Yes the latest master still fails with overflow when using the full test set.
Here is my latest single core set which causes fail:
> JULIA_CPU_CORES=1 julia6 test/runtests.jl linalg/matmul linalg/bunchkaufman linalg/bidiag linalg/symmetric linalg/lq sparse/sparsevector dates/ranges dates/arithmetic inference worlds keywordargs printf char triplequote dict hashing iobuffer staged offsetarray tuple random intfuncs simdloop blas copy math loading bigint spawn read mmap broadcast mod2pi show sets llvmcall grisu meta stacktraces profile docs markdown threads inline error cartesian asmvariant osutils channels dsp examples misc
The voluminous output from interactiveutil.jl is inconvenient in testing since it scrolls off useful information regarding testing times. I am working on the idea that if the testing reports that x test took the most time then I target that one for next removal, which in theory should speed up the next test.

@colbec
Copy link

colbec commented Dec 27, 2016

Here's another observation:
if I select the linalg tests from the above set and run them followed by misc then it succeeds (this is consistently repeatable).
The set of not(linalg/*) entries followed by misc also succeeds (this is consistently repeatable).
Now taking the not(linalg/*) set and adding back linalg/matmul causes fail (repeatable).

Edit: but adding back linalg/bunchkaufman but not matmul sometimes succeeds, sometimes fails, about 50-50 from a small sample.
Edit 2: when adding back a linalg test which subsequently fails, the testing process does not end with "FAILURE" message, but continues to flood output and stops at

ERROR: LoadError: 
 in (::##33#41)() at /home/colin/Downloads/julia6/test/runtests.jl:195
 in cd(::##33#41, ::String) at ./file.jl:69
 in include_from_node1(::String) at ./loading.jl:532
 in include(::String) at ./sysimg.jl:14
 in process_options(::Base.JLOptions) at ./client.jl:293
 in _start() at ./client.jl:357
while loading /home/colin/Downloads/julia6/test/runtests.jl, in expression starting on line 38

after the usual flood of interactiveutil.jl stuff.

@colbec
Copy link

colbec commented Dec 30, 2016

Update: I am now on version 1781 after significant update overnight and make testall still fails.
Excerpts:

	From worker 1:	Worker 1 failed running test broadcast:
UndefVarError: TestMain_broadcast not defined
 in deserialize(::Base.ClusterSerializer{TCPSocket}, ::Type{Module}) at ./serialize.jl:601
 in handle_deserialize(::Base.ClusterSerializer{TCPSocket}, ::Int32) at ./serialize.jl:580
... and repeats ... ...

then

 in #555 at ./event.jl:66misc: Error During Test
  Test threw an exception of type StackOverflowError
  Expression: $(Expr(:escape, :(summarysize(Core)))) $(Expr(:escape, :>)) $(Expr(:escape, :(summarysize(Core.Inference)))) $(Expr(:escape, :>)) $(Expr(:escape, :(Core.sizeof(Core))))
  StackOverflowError:
misc: Error During Test
  Test threw an exception of type StackOverflowError
  Expression: summarysize(Base) > 10000 * sizeof(Int)

@tkelman
Copy link
Contributor

tkelman commented Dec 30, 2016

the broadcast issue is unrelated and will be fixed by #19745

@colbec
Copy link

colbec commented Jan 1, 2017

@nalimilan @tkelman make testall succeeds in v 0.6 1838 on openSUSE. Close issue now?

@tkelman
Copy link
Contributor

tkelman commented Jan 1, 2017

I saw this a couple of times on win64 yesterday. It's intermittent, but I think it's still there.

@colbec
Copy link

colbec commented Jan 1, 2017

Yes, it is, I just ran the same version again and it failed. Pity I did not get a copy of the output, it might have been useful for comparison. I'll try again.

@tkelman tkelman removed the system:linux Affects only Linux label Jan 1, 2017
@tkelman tkelman changed the title New failures when building RPM nightlies StackOverflowError in summarysize during misc test Jan 1, 2017
@tkelman tkelman added heisenbug This bug occurs unpredictably and removed needs more info Clarification or a reproducible example is required labels Jan 1, 2017
@colbec
Copy link

colbec commented Jan 1, 2017

I ran three more make testall with the same version of v 0.6, there were 2 fails and the last succeeded.
FWIW here is the list of tests that ran on the same worker as misc on the successful run:
linalg/triangular sparse/higherorderfns sparse/spqr strings/io unicode/utf8 dates/query dates/periods inference worlds numbers abstractarray operators ccall bigfloat sorting version resolve pollfd mpfr broadcast combinatorics rounding euler lineedit replcompletions repl goto llvmcall grisu nullable base64 misc inline boundscheck cartesian asmvariant osutils channels dsp
I'm open to suggestions how this info might be useful in comparison with previous observed fails.

@tkelman
Copy link
Contributor

tkelman commented Jan 1, 2017

the dispatch loop you posted in #19707 (comment) is the best lead so far. Maybe we should try to find what inputs are causing that loop?

@maleadt
Copy link
Member

maleadt commented Jan 6, 2017

I'm also seeing this issue on a bot running testall1, but cannot reproduce with the sets of tests reported by @colbec in here. I'm running a bisect script now, lets hope it yields something useful over the weekend.

@colbec
Copy link

colbec commented Jan 6, 2017

I too am having troubles repeating some of my own tests, including the short test that generated the loop mentioned by @tkelman .
One that does fail pretty consistently including the loop is:

JULIA_CPU_CORES=1 julia6 test/runtests.jl linalg/matmul linalg/bunchkaufman linalg/bidiag linalg/symmetric linalg/lq sparse/sparsevector dates/ranges dates/arithmetic inference worlds keywordargs printf char triplequote dict hashing iobuffer staged offsetarray tuple random intfuncs simdloop blas copy math loading bigint spawn read mmap broadcast mod2pi show sets llvmcall grisu meta stacktraces profile docs markdown threads inline error cartesian asmvariant osutils channels dsp examples misc > miscfail.txt
WARNING: Method definition f265a(Any) in module TestMain_worlds at /home/colin/Downloads/julia6/test/worlds.jl:12 overwritten at /home/colin/Downloads/julia6/test/worlds.jl:17.
Warning: threaded loop executed in order
ERROR: LoadError: 
Stacktrace:
 [1] (::##33#41)() at /home/colin/Downloads/julia6/test/runtests.jl:195
 [2] cd(::##33#41, ::String) at ./file.jl:69
 [3] include_from_node1(::String) at ./loading.jl:532
 [4] include(::String) at ./sysimg.jl:14
 [5] process_options(::Base.JLOptions) at ./client.jl:308
 [6] _start() at ./client.jl:374
while loading /home/colin/Downloads/julia6/test/runtests.jl, in expression starting on line 38

I have saved the entire output of the fail to a file (size 12 MB) and can upload this if it is of use. It is extremely repetitive.
One new detail I had not seen before is

...
dsp (1)                 |    6.38  |  0.55  |  8.6 | 302.97     | 1006.70 
examples (1)            |   11.88  |  0.51  |  4.3 | 672.37     | 1006.70 
misc: Error During Test
  Test threw an exception of type StackOverflowError
  Expression: $(Expr(:escape, :(summarysize(Core)))) $(Expr(:escape, :>)) $(Expr(:escape, :(summarysize(
  StackOverflowError:
  Stacktrace:
   [1] setindex!(::ObjectIdDict, ::Any, ::Any) at ./associative.jl:310    
   [2] _summarysize(::Any, ::ObjectIdDict, ::Type{T}) at ./interactiveutil.jl:745
   [3] summarysize(::Any, ::ObjectIdDict, ::Type{T}) at ./interactiveutil.jl:741
...

with the reference to associative.jl.

@maleadt
Copy link
Member

maleadt commented Jan 6, 2017

Oh yeah that set reproduces it. I'll use it as a starting point for my bisect, let's hope I can shave some more tests of.

@barche
Copy link
Contributor

barche commented Jan 6, 2017

I get this on Arch linux too, using the following Make.user:

USE_SYSTEM_LLVM:=1
USE_SYSTEM_LIBUNWIND:=1
USE_SYSTEM_PCRE:=1
USE_SYSTEM_LIBM:=1
USE_SYSTEM_OPENLIBM:=1
USE_SYSTEM_OPENSPECFUN:=1
USE_SYSTEM_DSFMT:=0
USE_SYSTEM_BLAS:=0
USE_SYSTEM_LAPACK:=0
USE_SYSTEM_FFTW:=1
USE_SYSTEM_GMP:=1
USE_SYSTEM_MPFR:=1
USE_SYSTEM_ARPACK:=1
USE_SYSTEM_SUITESPARSE:=1
USE_SYSTEM_LIBUV:=0
USE_SYSTEM_UTF8PROC:=0
USE_SYSTEM_LIBGIT2:=1
USE_SYSTEM_PATCHELF:=1
USE_INTEL_MKL:=0
USE_BLAS64:=0
USE_LLVM_SHLIB:=0

@timholy
Copy link
Member

timholy commented Jan 6, 2017

StackOverflowErrors are often annoying. If it's easier than trapping a difficult-to-catch error...I have noticed that many arise from constructs like this:

foo{T}(x::T, y::T) = # the real implementation of foo
foo(x, y) = foo(promote(x, y)...)

The problem arises because promote(x, y) has a default of returning (x, y) when they can't be promoted to a common type, leading that second line to call itself recursively. I've recently become a big fan of "defensive programming" like this:

foo(x, y) = _foo(promote(x, y)...) # note the underscore
_foo{T}(x::T, y::T) = foo(x, y)
_foo(x, y) = throw_promoteerror(x, y)

@noinline function throw_promoteerror(x, y)
    throw(ArgumentError("$x::$(typeof(x)) and $y::$(typeof(y)) cannot be promoted to a common type"))
end

I'd argue we should do this systematically throughout Base. This may be one of those cases where it would be faster to find the bug by changing any possibly-problematic lines to the more defensive form, since the error message will tell you a lot about what's going on.

@maleadt
Copy link
Member

maleadt commented Jan 9, 2017

It seems to be nondeterministic even with 1 worker... The set below definitely does make it crash, but takes many attempts to reproduce (I'm testing on cbc6670):

$ JULIA_CPU_CORES=1 ./julia test/runtests.jl linalg/matmul linalg/bidiag linalg/symmetric sparse/sparsevector inference worlds printf dict offsetarray random math spawn broadcast stacktraces profile markdown dsp misc
Test (Worker)           | Time (s) | GC (s) | GC % | Alloc (MB) | RSS (MB)
linalg/matmul (1)       |   67.16  |  0.81  |  1.2 | 2154.20    | 293.20
linalg/bidiag (1)       |   37.77  |  0.73  |  1.9 | 1197.66    | 362.35
linalg/symmetric (1)    |   15.78  |  0.25  |  1.6 | 469.39     | 374.47
sparse/sparsevector (1) |   81.53  |  1.19  |  1.5 | 2017.42    | 486.43
inference (1)           |    1.16  |  0.01  |  1.3 | 32.51      | 490.12
WARNING: Method definition f265a(Any) in module TestMain_worlds at /tmp/julia/test/worlds.jl:12 overwritten at /tmp/julia/test/worlds.jl:17.
worlds (1)              |    0.91  |  0.02  |  1.7 | 19.14      | 495.35
printf (1)              |   15.55  |  0.32  |  2.1 | 248.24     | 497.46
dict (1)                |    9.33  |  0.23  |  2.5 | 343.33     | 512.73
offsetarray (1)         |   39.76  |  0.96  |  2.4 | 1079.01    | 668.10
random (1)              |    9.13  |  0.20  |  2.2 | 273.29     | 685.34
math (1)                |   21.47  |  0.53  |  2.5 | 374.49     | 690.46
       [stdio passthrough ok]
spawn (1)               |   19.43  |  0.19  |  1.0 | 206.89     | 709.43
broadcast (1)           |   21.40  |  1.18  |  5.5 | 683.05     | 717.90
stacktraces (1)         |    1.42  |  0.04  |  2.9 | 31.47      | 721.82
profile (1)             |    3.08  |  0.06  |  2.0 | 59.57      | 723.25
markdown (1)            |    3.63  |  0.08  |  2.3 | 87.71      | 724.30
dsp (1)                 |    7.04  |  0.28  |  4.0 | 313.35     | 728.03
misc: Error During Test
  Test threw an exception of type StackOverflowError

There also seem to be 2 slightly different errors:

misc: Error During Test
  Test threw an exception of type StackOverflowError
  Expression: $(Expr(:escape, :(summarysize(Core)))) $(Expr(:escape, :>)) $(Expr(:escape, :(summarysize(Core.Inference)))) $(Expr(:escape, :>)) $(Expr(:escape, :(Core.sizeof(Core))))

and

misc: Error During Test
  Test threw an exception of type StackOverflowError
  Expression: summarysize(Base) > 10000 * sizeof(Int)

both when executing the misc test.

I guess I'll try and catch one with rr to see if I can gather some more details.

@timholy
Copy link
Member

timholy commented Jan 9, 2017

I played with this too. You were much more successful than I in trimming it down. (I think I was able to get rid of printf but I had many others, perhaps it's not a specific combination.) My one other observation is that it's coupled to the summarysize method for TypeMapEntry, and inserting

@show typeof(m)
flush(STDOUT)

sometimes results in

typeof(m) = ERROR...

so it seems to be some kind of object corruption.

@colbec
Copy link

colbec commented Jan 9, 2017

Fools rush in: to try to force an early error I tried multiple testing misc in the following way (edit3 this is a red herring, removing the references to warnonce and testonce from misc leads to success, but the overall failure persists):

JULIA_CPU_CORES=1 julia6 test/runtests.jl linalg/matmul misc linalg/bunchkaufman misc linalg/bidiag linalg/symmetric linalg/lq misc

Edit: in fact to get this I just have to say JULIA_CPU_CORES=1 julia6 test/runtests.jl misc misc
Edit2: JULIA_CPU_CORES=1 julia6 test/runtests.jl dsp dsp is successful, as is doubling linalg/matmul

I switch off line 5 in runtests.jl (which ensures that the tests array set is unique), but on second run of misc it is very unhappy and outputs:

Test (Worker)           | Time (s) | GC (s) | GC % | Alloc (MB) | RSS (MB)
linalg/matmul (1)       |   59.26  |  0.68  |  1.1 | 2193.33    | 296.49  
OK to here
misc (1)                |    9.81  |  0.82  |  8.4 | 1219.46    | 406.40  
linalg/bunchkaufman (1) |   20.26  |  0.22  |  1.1 | 699.76     | 429.01  
WARNING: replacing module TestMain_misc
OK to here
misc: Test Failed
  Expression: contains(sprint((io->begin 
                Base.warn_once(io,"test")
            end)),"WARNING: test")
Stacktrace:
 [1] include_from_node1(::String) at ./loading.jl:532
 [2] include(::String) at ./sysimg.jl:14
 [3] macro expansion; at /home/colin/Downloads/julia6/test/testdefs.jl:13 [inlined]
...
Test Summary:         |    Pass  Fail    Total
  Overall             | 6403639    12  6403651
  linalg/matmul       |     527            527
  misc                | 1279505        1279505
  linalg/bunchkaufman |    2219           2219
  misc                | 1279502     3  1279505
  linalg/bidiag       |    1232           1232
  misc                | 1279502     3  1279505
  linalg/symmetric    |    1328           1328
  misc                | 1279502     3  1279505
  linalg/lq           |     820            820
  misc                | 1279502     3  1279505
    FAILURE

@colbec
Copy link

colbec commented Jan 9, 2017

In Version 0.6.0-dev.1997 I believe I see a consistent pattern of success/fail with the following set:

JULIA_CPU_CORES=1 julia6 test/runtests.jl linalg/matmul linalg/bunchkaufman linalg/bidiag linalg/symmetric linalg/lq sparse/sparsevector dates/ranges dates/arithmetic inference worlds keywordargs printf char triplequote dict hashing iobuffer staged offsetarray tuple random intfuncs simdloop blas copy math loading bigint spawn misc

This set fails, but if misc runs before spawn then it succeeds.
I will try digging into spawn to see if I can isolate a section which will allow a succeed.

@colbec
Copy link

colbec commented Jan 9, 2017

In my previous comment I noted an issue with the order of spawn and misc.
Here is my testing process for examining spawn: 1. place an end statement at the end of the file and then place an if false statement above it in a convenient location to wrap the end of the file in an if block to ensure it will not be executed, and 2. slide the if false up and down to try to find a sensitive spot.
As we know this is an intermittent error which maybe explains my bizarre results. At about line 80 in the spawn tests there is a section which begins @test success(truecmd); I can place the sliding if false before this line and the tests mostly pass. If I place the if false before the STDIN redirection a few lines later it mostly fails with the stackoverflow error. This is about as close as I can come so far.

I tried commenting out the @test_broken lines and it still failed. However I remain suspicious that there is maybe a problem in the first 80+ lines of this file after the cmd definitions. The individual commands all seem to pass in the REPL. Of course it could be something else entirely.

@vtjnash
Copy link
Member

vtjnash commented Jan 9, 2017

I wonder if it could be related to / fixed by #19590

@nalimilan
Copy link
Member Author

@vtjnash I've just tried, and I got the same error.

@nalimilan
Copy link
Member Author

Confirmed it's fixed on latest master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
heisenbug This bug occurs unpredictably
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants