-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
StackOverflowError in summarysize during misc test #19707
Comments
Also seen in openSUSE Leap 42.2. |
Does this happen in a default source build, or only when you're building against some system dependencies? Scroll up to where the ambiguity test ran and it should be printing the list of ambiguous methods. |
Not sure what you mean by "list of ambiguous methods." In:
terminal output that mentions "ambiguous" (selectively) is:
followed by:
Let me know where I should be looking. |
Looks like you only get the stack overflow then, not the |
I have seen the non-Boolean error as well. |
Ah, the ambiguity is reproducible when running |
I inserted the @show statement after line with
Strangely, I can't see the string "summarysize" anywhere in the output from |
probably because the show statement is triggering the stack overflow too, but only if some subset of tests have run previously on the same worker. running |
OK here is my progress: thinking that the error might have been triggered by something recent in the misc worker I ran
which, disappointingly, run successfully. I may be shooting myself in the foot by testing in roughly the reverse order of testall, I don't know. I'll keep expanding the list. |
it should include misc at the end to reproduce the stack overflow, and may be order dependent |
Right, thanks for noticing the missing misc - rather important. Adding the full complement of tests which ran on the same worker and misc at the end triggers the failure. Order does not seem to be important since my tests ran in reverse order. I will now go back through previous tests and report back. |
Ok this is getting complicated. |
What I usually do with this kind of thing is (write a script to) remove one entry from the list at a time, if reproducing the failure happens without it, leave it off and try another. If there's no failure without one of the entries, mark it as required and don't try removing it again. Stop when removing any one more entry doesn't reproduce the failure any more. |
Just in case it is meaningful, while running these tests there is a quantity of output which does not occur when running
|
There's something odd about triplequote. |
After an update to Version 0.6.0-dev.1701 (2016-12-26 06:25 UTC),
FYI here is the new set of tests that produce the failure: |
FWIW for me the above test set reduces to |
That one should be simple to fix. The stack overflow is the more interesting part of this report. |
The command: |
What version did it last fail at? Did it happen reliably every time, or is it maybe non-deterministic? |
Last seen in v 1698 as reported above. Failed every time that I could see, but of course the sample size is very small. |
So that was 4793e88, and there's only been one commit to master since then, changing gmp and curl versions. |
FWIW, I still get the stack overflow with latest master. Thanks for investigating this @colbec! |
@nalimilan Yes the latest master still fails with overflow when using the full test set. |
Here's another observation: Edit: but adding back linalg/bunchkaufman but not matmul sometimes succeeds, sometimes fails, about 50-50 from a small sample.
after the usual flood of interactiveutil.jl stuff. |
Update: I am now on version 1781 after significant update overnight and make testall still fails.
then
|
the broadcast issue is unrelated and will be fixed by #19745 |
@nalimilan @tkelman |
I saw this a couple of times on win64 yesterday. It's intermittent, but I think it's still there. |
Yes, it is, I just ran the same version again and it failed. Pity I did not get a copy of the output, it might have been useful for comparison. I'll try again. |
I ran three more |
the dispatch loop you posted in #19707 (comment) is the best lead so far. Maybe we should try to find what inputs are causing that loop? |
I'm also seeing this issue on a bot running |
I too am having troubles repeating some of my own tests, including the short test that generated the loop mentioned by @tkelman .
I have saved the entire output of the fail to a file (size 12 MB) and can upload this if it is of use. It is extremely repetitive.
with the reference to associative.jl. |
Oh yeah that set reproduces it. I'll use it as a starting point for my bisect, let's hope I can shave some more tests of. |
I get this on Arch linux too, using the following Make.user:
|
StackOverflowErrors are often annoying. If it's easier than trapping a difficult-to-catch error...I have noticed that many arise from constructs like this: foo{T}(x::T, y::T) = # the real implementation of foo
foo(x, y) = foo(promote(x, y)...) The problem arises because foo(x, y) = _foo(promote(x, y)...) # note the underscore
_foo{T}(x::T, y::T) = foo(x, y)
_foo(x, y) = throw_promoteerror(x, y)
@noinline function throw_promoteerror(x, y)
throw(ArgumentError("$x::$(typeof(x)) and $y::$(typeof(y)) cannot be promoted to a common type"))
end I'd argue we should do this systematically throughout Base. This may be one of those cases where it would be faster to find the bug by changing any possibly-problematic lines to the more defensive form, since the error message will tell you a lot about what's going on. |
It seems to be nondeterministic even with 1 worker... The set below definitely does make it crash, but takes many attempts to reproduce (I'm testing on cbc6670):
There also seem to be 2 slightly different errors:
and
both when executing the I guess I'll try and catch one with |
I played with this too. You were much more successful than I in trimming it down. (I think I was able to get rid of @show typeof(m)
flush(STDOUT) sometimes results in
so it seems to be some kind of object corruption. |
Fools rush in: to try to force an early error I tried multiple testing misc in the following way (edit3 this is a red herring, removing the references to warnonce and testonce from misc leads to success, but the overall failure persists):
Edit: in fact to get this I just have to say I switch off line 5 in runtests.jl (which ensures that the tests array set is unique), but on second run of misc it is very unhappy and outputs:
|
In Version 0.6.0-dev.1997 I believe I see a consistent pattern of success/fail with the following set:
This set fails, but if misc runs before spawn then it succeeds. |
In my previous comment I noted an issue with the order of spawn and misc. I tried commenting out the @test_broken lines and it still failed. However I remain suspicious that there is maybe a problem in the first 80+ lines of this file after the cmd definitions. The individual commands all seem to pass in the REPL. Of course it could be something else entirely. |
I wonder if it could be related to / fixed by #19590 |
@vtjnash I've just tried, and I got the same error. |
Confirmed it's fixed on latest master. |
I see new failures when building Fedora/RHEL RPM nightlies. These have bee, introduced since d47f24b (from Dec. 22).
https://copr-be.cloud.fedoraproject.org/results/nalimilan/julia-nightlies/fedora-25-x86_64/00492486-julia/build.log.gz
The text was updated successfully, but these errors were encountered: