Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restore StackOverflow error message for repeated frames #39378

Conversation

NHDaly
Copy link
Member

@NHDaly NHDaly commented Jan 24, 2021

Fixes backtrace printing to display the number of times a frame is
repeated, if there is a frame that's duplicated several times.

julia> function foo() foo() end
foo (generic function with 1 method)

julia> foo()
ERROR: StackOverflowError:
Stacktrace:
 [1] foo() (repeats 79984 times)
   @ Main ./REPL[16]:1

Fixes #37587.

Before #36134, the old code used to print the number of times a frame was repeated, here:
https://github.com/JuliaLang/julia/pull/36134/files#diff-2ad5106fcaed75e73fad677ef1b853010aa244fe4ca3969bc056eab7b9340f63L650

It looks like this count, n, accidentally got stripped out, here:
https://github.com/JuliaLang/julia/pull/36134/files#diff-2ad5106fcaed75e73fad677ef1b853010aa244fe4ca3969bc056eab7b9340f63R778-R779

and then n was always hardcoded as 1:
https://github.com/JuliaLang/julia/pull/36134/files#diff-2ad5106fcaed75e73fad677ef1b853010aa244fe4ca3969bc056eab7b9340f63R579

This PR restores the original behavior to pass n through to print_stackframe(). :)

Fixes backtrace printing to display the number of times a frame is
repeated, if there is a frame that's duplicated several times.

```julia
julia> function foo() foo() end
foo (generic function with 1 method)

julia> foo()
ERROR: StackOverflowError:
Stacktrace:
 [1] foo() (repeats 79984 times)
   @ Main ./REPL[16]:1
```

Fixes JuliaLang#37587.
@NHDaly NHDaly requested a review from c42f January 24, 2021 02:01
@NHDaly NHDaly added bugfix This change fixes an existing bug error messages Better, more actionable error messages labels Jan 24, 2021
Copy link
Member

@Sacha0 Sacha0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Thanks Nathan! :)

@Sacha0
Copy link
Member

Sacha0 commented Jan 24, 2021

(Tentatively tagging for backport given this patch is a bugfix.)

@Sacha0 Sacha0 added the backport 1.6 Change should be backported to release-1.6 label Jan 24, 2021
@c42f
Copy link
Member

c42f commented Jan 25, 2021

Great description of the problem, thanks!

I'm not 100% sure this is the right fix — looking through the code, it seems the previous PR introduced show_reduced_backtrace which uses the n parameter correctly in print_stackframe when the trace is larger than BIG_STACKTRACE_SIZE:

if length(filtered) > BIG_STACKTRACE_SIZE

print_stackframe(io, frame_counter, frame, n, ndigits_max, modulecolordict, modulecolorcycler)

Problem is, it uses length(filtered) to determine this rather than length(trace), leading to the show_full_backtrace() code path further down which incorrectly uses the filtered rather than full list of frames (with misuse of n parameter, as you've fixed).

What I wonder: should the test on length(filtered) be changed so that show_reduced_backtrace is being called? What fancy machinery are we missing out on by not calling that?

@codecov-io
Copy link

codecov-io commented Jan 25, 2021

Codecov Report

Merging #39378 (dbaca8b) into master (0858864) will decrease coverage by 0.10%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #39378      +/-   ##
==========================================
- Coverage   87.29%   87.18%   -0.11%     
==========================================
  Files         388      389       +1     
  Lines       75292    74966     -326     
==========================================
- Hits        65724    65361     -363     
- Misses       9568     9605      +37     
Impacted Files Coverage Δ
base/opaque_closure.jl 22.22% <0.00%> (-77.78%) ⬇️
base/c.jl 55.17% <0.00%> (-13.03%) ⬇️
base/compiler/types.jl 72.72% <0.00%> (-3.28%) ⬇️
base/special/trig.jl 96.94% <0.00%> (-2.74%) ⬇️
stdlib/Logging/src/ConsoleLogger.jl 93.90% <0.00%> (-2.44%) ⬇️
base/compiler/ssair/slot2ssa.jl 85.53% <0.00%> (-2.36%) ⬇️
base/special/log.jl 94.35% <0.00%> (-2.15%) ⬇️
stdlib/Markdown/src/parse/parse.jl 89.36% <0.00%> (-2.13%) ⬇️
base/compiler/inferenceresult.jl 84.84% <0.00%> (-1.82%) ⬇️
base/compiler/ssair/verify.jl 54.81% <0.00%> (-1.49%) ⬇️
... and 110 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 11cbaf6...dced572. Read the comment docs.

@NHDaly
Copy link
Member Author

NHDaly commented Jan 25, 2021

Hmm, i think there are two separate things to discuss here.

The first is the one that I have (i believe) fixed in this PR:

  • The "full backtrace" printer was dropping the number of times any frame is repeated on the floor, and was always reporting them as occurring once.

Before #36134, the old code also had support for separate printing of short and long stack frames, and in the old code, they passed the number of repetitions, n, in both cases:

I believe that this PR restores this functionality for the new organization of the code, and maintains the behavior of the old code. Does that sound correct to you? :) thanks!


Second, you are saying that it uses length(filtered) instead of length(trace) when determining whether to use the reduced or the full backtrace printing:

Problem is, it uses length(filtered) to determine this rather than length(trace), leading to the show_full_backtrace() code path further down which incorrectly uses the filtered rather than full list of frames (with misuse of n parameter, as you've fixed).

I don't think this is incorrect.. It seems that the #36134 preserved the logic that was there before, which also used length(filtered) to determine whether or not to call show_full_backtrace().

I think that length(filtered) is indeed the correct thing to look at to determine whether to call show_full_backtrace(). filtered is the same as the original trace, but with (some?) c functions removed and repeated frames merged. I think this is exactly what we want to look at when determining how to print the backtrace, because e.g. collapsing repeated frames may bring the size under the threshold for full printing, and that's exactly what we want.


So I think #36134 handled the show_full_backtrace() decision correctly, and its only mistake was filtering out the length of repeated frames, and I think this PR fixes that by carrying it through again, which restores the behavior before that PR.

Does that make sense? Thanks @c42f!

@NHDaly NHDaly closed this Jan 25, 2021
@NHDaly NHDaly reopened this Jan 25, 2021
@c42f
Copy link
Member

c42f commented Jan 26, 2021

and its only mistake was filtering out the length of repeated frames, and I think this PR fixes that by carrying it through again, which restores the behavior before that PR.

Sure, thanks for verifying that.

Ok so I think this is fine in principle, it's just the failing tests which need some attention.

  • The linux32 failure seems like a real failure related in some way to the tests added in this PR (something added here, or an underlying problem in the stacktrace machinery?). Cleaned up, the printed stacktraces were:

      Stacktrace:
       [1] single_repeater()
         @ Main.Test94Main_errorshow /buildworker/worker/tester_linux32/build/share/julia/test/errorshow.jl:768
    
      Stacktrace:
       [1] pair_repeater_a()
         @ Main.Test94Main_errorshow /buildworker/worker/tester_linux32/build/share/julia/test/errorshow.jl:769
    
  • The linuxaarch64 failure seems real. Cleaned up, the stacktrace from there was:

      Stacktrace:
       [1] pair_repeater_a()
         @ Main.Test46Main_errorshow /buildworker/worker/tester_linuxaarch64/build/share/julia/test/errorshow.jl:769
       [2] pair_repeater_b()
         @ Main.Test46Main_errorshow /buildworker/worker/tester_linuxaarch64/build/share/julia/test/errorshow.jl:770
    
  • The win32 and win64 tests are a assertion failure from part of the Julia runtime, occuring when running errorshow.jl which seems related to this PR:

        From worker 5:	Assertion failed: lock->owner == jl_thread_self() && "Unlocking a lock in a different thread.", file /cygdrive/c/buildbot/worker/package_win64/build/src/locks.h, line 120
        From worker 5:	
        From worker 5:	signal (22): SIGABRT
        From worker 5:	in expression starting at C:\buildbot\worker-tabularasa\tester_win64\build\share\julia\test\errorshow.jl:772
        From worker 5:	crt_sig_handler at /cygdrive/c/buildbot/worker/package_win64/build/src\signals-win.c:93
        From worker 5:	raise at C:\Windows\System32\msvcrt.dll (unknown line)
        From worker 5:	abort at C:\Windows\System32\msvcrt.dll (unknown line)
        From worker 5:	assert at C:\Windows\System32\msvcrt.dll (unknown line)
        From worker 5:	jl_mutex_unlock_nogc at /cygdrive/c/buildbot/worker/package_win64/build/src\locks.h:120 [inlined]
        From worker 5:	jl_eh_restore_state at /cygdrive/c/buildbot/worker/package_win64/build/src\rtutils.c:254
        From worker 5:	eval_body at /cygdrive/c/buildbot/worker/package_win64/build/src\interpreter.c:522
        From worker 5:	eval_body at /cygdrive/c/buildbot/worker/package_win64/build/src\interpreter.c:502
        From worker 5:	eval_body at /cygdrive/c/buildbot/worker/package_win64/build/src\interpreter.c:502
        From worker 5:	jl_interpret_toplevel_thunk at /cygdrive/c/buildbot/worker/package_win64/build/src\interpreter.c:717
        From worker 5:	jl_toplevel_eval_flex at /cygdrive/c/buildbot/worker/package_win64/build/src\toplevel.c:884
        From worker 5:	jl_toplevel_eval_flex at /cygdrive/c/buildbot/worker/package_win64/build/src\toplevel.c:829
        From worker 5:	jl_toplevel_eval at /cygdrive/c/buildbot/worker/package_win64/build/src\toplevel.c:893 [inlined]
        From worker 5:	jl_toplevel_eval_in at /cygdrive/c/buildbot/worker/package_win64/build/src\toplevel.c:936
        From worker 5:	eval at .\boot.jl:369 [inlined]
        From worker 5:	include_string at .\loading.jl:1090
        From worker 5:	_jl_invoke at /cygdrive/c/buildbot/worker/package_win64/build/src\gf.c:2231 [inlined]
        From worker 5:	jl_apply_generic at /cygdrive/c/buildbot/worker/package_win64/build/src\gf.c:2413
        From worker 5:	_include at .\loading.jl:1144
        From worker 5:	include at .\Base.jl:386 [inlined]
        From worker 5:	macro expansion at C:\buildbot\worker-tabularasa\tester_win64\build\share\julia\test\testdefs.jl:24 [inlined]
        From worker 5:	macro expansion at C:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.7\Test\src\Test.jl:1152 [inlined]
        From worker 5:	macro expansion at C:\buildbot\worker-tabularasa\tester_win64\build\share\julia\test\testdefs.jl:23 [inlined]
        From worker 5:	macro expansion at .\timing.jl:356 [inlined]
        From worker 5:	#runtests#1 at C:\buildbot\worker-tabularasa\tester_win64\build\share\julia\test\testdefs.jl:21
        From worker 5:	runtests##kw at C:\buildbot\worker-tabularasa\tester_win64\build\share\julia\test\testdefs.jl:6 [inlined]
        From worker 5:	runtests##kw at C:\buildbot\worker-tabularasa\tester_win64\build\share\julia\test\testdefs.jl:6
        From worker 5:	unknown function (ip: 0000000060c23c08)
        From worker 5:	_jl_invoke at /cygdrive/c/buildbot/worker/package_win64/build/src\gf.c:2231 [inlined]
        From worker 5:	jl_apply_generic at /cygdrive/c/buildbot/worker/package_win64/build/src\gf.c:2413
        From worker 5:	jl_apply at /cygdrive/c/buildbot/worker/package_win64/build/src\julia.h:1735 [inlined]
        From worker 5:	do_apply at /cygdrive/c/buildbot/worker/package_win64/build/src\builtins.c:684
        From worker 5:	#106 at C:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.7\Distributed\src\process_messages.jl:278
        From worker 5:	run_work_thunk at C:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.7\Distributed\src\process_messages.jl:63
        From worker 5:	macro expansion at C:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.7\Distributed\src\process_messages.jl:278 [inlined]
        From worker 5:	#105 at .\task.jl:406
        From worker 5:	unknown function (ip: 0000000060c19a23)
        From worker 5:	_jl_invoke at /cygdrive/c/buildbot/worker/package_win64/build/src\gf.c:2231 [inlined]
        From worker 5:	jl_apply_generic at /cygdrive/c/buildbot/worker/package_win64/build/src\gf.c:2413
        From worker 5:	jl_apply at /cygdrive/c/buildbot/worker/package_win64/build/src\julia.h:1735 [inlined]
        From worker 5:	start_task at /cygdrive/c/buildbot/worker/package_win64/build/src\task.c:844
        From worker 5:	Allocations: 1019004694 (Pool: 1018102512; Big: 902182); GC: 1078
    
  • The freebsd64 test looks like a spurious sockets issue (KILLING BY SOCKETS TEST WATCHDOG)

@KristofferC KristofferC mentioned this pull request Jan 26, 2021
60 tasks
@NHDaly
Copy link
Member Author

NHDaly commented Jan 26, 2021

Thanks @c42f. For some reason, the tests weren't running before, until i closed and reopened so i hadn't seen the failures yet.

I'll have a look later. Cheers!

@KristofferC KristofferC mentioned this pull request Feb 1, 2021
10 tasks
@KristofferC KristofferC mentioned this pull request Feb 11, 2021
52 tasks
@NHDaly
Copy link
Member Author

NHDaly commented Feb 19, 2021

Eeeep, sorry I forgot about this PR - somehow it seemed like it was finished, in my mind. I've merged in the latest master.

I'm a bit baffled by the test failures. I'm not sure how to resolve them given that they're not failing on macOS (where I am).
Hmmm, do we somehow provide different Vectors for the stack trace to show_backtrace() on different operating systems? Is there a chance that we wouldn't produce a Vector with N duplicates for the stack frame on the OSes that were failing the tests?

@NHDaly
Copy link
Member Author

NHDaly commented Feb 19, 2021

If so, we can just delete the tests, or mark them macOS-only for now. Especially if this never worked, even on 1.5 on linux / windows.

Then we can merge this PR to restore the old behavior, and file an issue that it doesn't work for linux32, linuxaarch64, and windows?

@vchuravy vchuravy added this to the 1.6 blockers milestone Feb 19, 2021
@NHDaly
Copy link
Member Author

NHDaly commented Feb 19, 2021

Wowwww interesting! Yeah, on linuxaarch64, my debugging commit (df0e968) printed length: 2 in the buildbot:

errorshow                          (6) |        started at 2021-02-19T03:23:30.813
      From worker 6:	length: 2
      From worker 6:	length: 2
errorshow                          (6) |         failed at 2021-02-19T03:23:41.597
Test Failed at /buildworker/worker/tester_linuxaarch64/build/share/julia/test/errorshow.jl:788
  Expression: occursin(r"the last 2 lines are repeated \d+ more times", bt_str)
   Evaluated: occursin(r"the last 2 lines are repeated \d+ more times", "\nStacktrace:\n [1] pair_repeater_a()\n   @ Main.Test27Main_errorshow /buildworker/worker/tester_linuxaarch64/build/share/julia/test/errorshow.jl:769\n [2] pair_repeater_b()\n   @ Main.Test27Main_errorshow /buildworker/worker/tester_linuxaarch64/build/share/julia/test/errorshow.jl:770")

Whereas on macos64, it prints length: 79984:

errorshow                          (3) |        started at 2021-02-18T22:34:45.145
      From worker 3:	length: 79984
      From worker 3:	length: 79984
errorshow                          (3) |    34.41 |   0.17 |  0.5 |     525.74 |  2532.52

SO, I guess this is a problem upstream of how we display the back trace. Maybe this is why the original PR broke this behavior - perhaps the author was developing on Linux, and didn't see why we were handling a case of multiple stack frames that never occurs.

... Okay, in this case, I will adjust the newly added unit tests to only run on MacOS, for now, and then file an issue for why the backtrace return value is different on macos vs linux/windows. Does that sound alright with everyone?

Sorry again for my delay - and thanks for the help, @c42f!

@NHDaly NHDaly force-pushed the nhd-StackOverflow-error-message-repeats-n-times branch from 373828c to dced572 Compare February 22, 2021 20:37
Copy link
Member

@c42f c42f left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the linux 64 bit test was passing previously? It would be nice to re-enable that test, as it's a very common configuration.

Other than that, LGTM. Filing a separate issue for backtrace collection is a good idea.

@NHDaly
Copy link
Member Author

NHDaly commented Feb 23, 2021 via email

@jebej
Copy link
Contributor

jebej commented Feb 28, 2021

Not sure if it's useful but the test works on 1.5 on Windows:

julia> foo()
ERROR: StackOverflowError:
Stacktrace:
 [1] foo() at .\REPL[1]:1 (repeats 79984 times)

julia>     pair_repeater_a() = pair_repeater_b()
pair_repeater_a (generic function with 1 method)

julia>     pair_repeater_b() = pair_repeater_a()
pair_repeater_b (generic function with 1 method)

julia> pair_repeater_a()
ERROR: StackOverflowError:
Stacktrace:
 [1] pair_repeater_a() at .\REPL[3]:1
 [2] pair_repeater_b() at .\REPL[4]:1
 ... (the last 2 lines are repeated 39990 more times)
 [79983] pair_repeater_a() at .\REPL[3]:1

@JeffBezanson
Copy link
Member

Ah, good catch. How do I enable only linux 64 in the test itself from Julia?

Sys.islinux() && Sys.WORD_SIZE == 64

@c42f
Copy link
Member

c42f commented Mar 12, 2021

Ah I'd forgotten about this — sorry Nathan for not replying to your query. And thanks Jeff for pushing it through :-)

@KristofferC KristofferC removed the backport 1.6 Change should be backported to release-1.6 label Mar 14, 2021
@NHDaly NHDaly deleted the nhd-StackOverflow-error-message-repeats-n-times branch August 26, 2021 20:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugfix This change fixes an existing bug error messages Better, more actionable error messages
Projects
None yet
Development

Successfully merging this pull request may close these issues.

The new stacktrace printing is missing "repeated N times" info for some StackOverflows
8 participants