Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Travis builds are hanging mid-print #4126

Closed
derekbruening opened this issue Feb 22, 2020 · 11 comments
Closed

Travis builds are hanging mid-print #4126

derekbruening opened this issue Feb 22, 2020 · 11 comments

Comments

@derekbruening
Copy link
Contributor

Filing this since it keeps happening. PR #4119 was my attempt to diagnose it.

The cross-compile job (the 3rd job) has been repeatedly failing ever since PR #4118 except in the diagnostic attempt. It looks like this:

663$ suite/runsuite_wrapper.pl travis $EXTRA_ARGS
664
665
666No output has been received in the last 10m0s, this potentially indicates a stalled build or something wrong with the build itself.
667Check the details on how to adjust your build configuration on: https://docs.travis-ci.com/user/common-build-problems/#build-times-out-because-no-output-was-received
668
669The build has been terminated
670

Very strange.

derekbruening added a commit that referenced this issue Feb 22, 2020
Fixes leaks in the api.dis-a64 and api.ir_negative AArch64 tests.

Adds extra printing to runsuite_wrapper to try and help with the
AArch64 Travis hang.

Issue: #4126
Fixes #4124
@derekbruening
Copy link
Contributor Author

I put the diagnostics into PR #4127 and it reproduces. The hang seems to most likely be at the fork call:

$ suite/runsuite_wrapper.pl travis $EXTRA_ARGS
667Forking child for stdout tee
668
669
670No output has been received in the last 10m0s, this potentially indicates a stalled build or something wrong with the build itself.

derekbruening added a commit that referenced this issue Feb 22, 2020
Fixes leaks in the api.dis-a64 and api.ir_negative AArch64 tests.
Works around an AArch64 unit_tests lock exit crash.

Adds extra printing to runsuite_wrapper to try and help with the
AArch64 Travis hang.

Issue: #4126
Fixes #4124
@derekbruening
Copy link
Contributor Author

Well, I added some prints before and around the fork, around the child exec, and I put in alarms in the parent while it tees the child stdout: and with that in there I can't reproduce the hang. Strange. Going to chalk it up to some perl bug and move on.

@derekbruening derekbruening reopened this Mar 7, 2020
@derekbruening
Copy link
Contributor Author

It looks like the problem is still there. I left in the -VV and 30s alarms, and we can see the build hang mid-print!

https://travis-ci.com/DynamoRIO/dynamorio/jobs/295543355

4342[ 84%] Building C object CMakeFiles/drmemorylib.dir/drmemory/slowpath.c.o
4343[ 84%] Building CXX object dynamorio/clients/drcachesim/CMakeFiles/drmemtrace_static.dir/tracer/instru_online.cpp.o
4344
4345xxxxxxxxxx 30s elapsed xxxxxxxxxxx
4346[ 85%] Building CXX object dynamorio/clients/drcachesim/CMakeFiles/drmemtrace_static.dir/tracer/physaddr.cpp.o
4347[ 85%] Building CXX object dynamorio/clients/drcachesim/CMakeFiles/drmemtrace_static.dir/tracer/func_trace.cpp.o
4348[ 85%] Building C object CMakeFiles/drmemorylib.dir/drmemory/fastpath.c.o
4349[ 85%] Building C object CMakeFiles/drmemorylib.dir/drmemory/stack.c.o
4350[ 85%] Building C object CMakeFiles/drmemorylib.dir/drmemory/shadow.c.o
4351[ 85%] Building CXX object dynamorio/clients/drcachesim/CMakeFiles/drmemtrace_static.dir/common/named_pipe_unix.cpp.o
4352[ 85%] Building C object CMakeFiles/drmemorylib.dir/drmemory/options.c.o
4353[ 85%] Building CXX object dynamorio/clients/drcachesim/CMakeFiles/drmemtrace_static.dir/common/options.cpp.o
4354[ 85%] Building CXX object dynamorio/clients/drcachesim/CMakeFiles/drmemtrace_static.dir/common/trace_entry.cpp.o
4355[ 86%] Building C object CMakeFiles/drmemorylib.dir/drmemory/pattern.c.o
4356Scanning dependencies of target drcachesim
4357[ 86%] Building CXX object dynamorio/clients/drcachesim/CMakeFiles/drcachesim.dir/launcher.cpp.o
4358[ 86%] Building CXX object dynamorio/clients/drcachesim/CMakeFiles/drcachesim.dir/analyzer.cpp.o
4359[ 86%] Building C object CMakeFiles/drmemorylib.dir/common/alloc.c.o
4360[ 86%] Building C object CMakeFiles/drmemorylib.dir/common/alloc_unopt.c.o
4361[ 86%] Building C object CMakeFiles/drmemorylib.dir/common/alloc_replace.c.o
4362[ 87%] Linking CXX static library ../lib64/release/libdrmemtrace_static.a
4363[ 87%] Built target drmemtrace_static
4364Scanning dependencies of target opcode_mix_launcher
4365[ 88%] Building CXX object dynamorio/clients/drcachesim/CMakeFiles/opcode_mix_launcher.dir/tools/opcode_mix_launcher.cpp.o
4366[ 88%] Building CXX object dynamorio/clients/drcachesim/CMakeFiles/drcachesim.dir/analyzer_multi.cpp.o
4367[ 88%] Building C object CMakeFiles/drmemorylib.dir/common/heap.c.o
4368[ 88%] Building C object CMakeFiles/drmemorylib.dir/common/callstack.c.o
4369[ 88%] Linking CXX executable ../bin64/opcode_mix_launcher
4370[ 89%] Building C object CMakeFiles/drmemorylib.dir/drmemory/alloc_drmem.c.o
4371[ 90%] Building CXX object dynamorio/clients/drcachesim/CMakeFiles/drcachesim.dir/common/named_pipe_unix.cpp.o
4372[ 90%] Building C object CMakeFiles/drmemorylib.dir/drmemory/syscall.c.o
4373[ 90%] Built target opcode_mix_launcher
4374Scanning dependencies of target drcovlib
4375[ 90%] Building CXX object dynamorio/clients/drcachesim/CMakeFiles/drcachesim.dir/common/options.cpp.o
4376[ 90%] Building C object dynamorio/ext/drcovlib/CMakeFiles/drcovlib.dir/drcovlib.c.o
4377[ 90%] Building C object CMakeFiles/drmemorylib.dir/drmemory/report.c.o
4378[ 90%] Building C object dynamorio/ext/drcovlib/CMakeFiles/drcovlib.dir/modules.c.o
4379[ 90%] Linking C shared library ../lib64/release/libdrcovlib.so
4380[ 90%] Built target drcovlib
4381[ 90%] Building C object CMakeFiles/drmemorylib.dir/drmemory/replace.c.o
4382[ 90%] Bu
4383
4384No output has been received in the last 10m0s, this potentially indicates a stalled build or something wrong with the build itself.
4385Check the details on how to adjust your build configuration on: https://docs.travis-ci.com/user/common-build-problems/#build-times-out-because-no-output-was-received
4386
4387The build has been terminated
4388

It seems like some Travis issue doesn't it? Our build is not going to do that on its own. It is strange that it has only been in the cross-compile job.

@derekbruening
Copy link
Contributor Author

Actually what I just pasted was from the x86 non-cross-compile package build: so it is not limited to cross-compiling.

@derekbruening
Copy link
Contributor Author

$ grep -c '30s' log.txt
20

Travis says "Ran for 21 min 25 sec". So it built for 10 mins and then mid-print it hung for 10 mins. ?!?

@derekbruening derekbruening changed the title AArch64 cross compile Travis builds are hanging Travis builds are hanging mid-print Mar 7, 2020
@derekbruening
Copy link
Contributor Author

Happened again, breaking the cronbuild -- grr, raising priority.

https://travis-ci.com/github/DynamoRIO/dynamorio/jobs/298058837

4356[ 94%] Built target drmemtrace_static
4357[ 94%] Building C object CMakeFiles/drmemorylib.dir/d
4358
4359No output has been received in the last 10m0s, this potentially indicates a stalled build or something wrong with the build itself.

Could it possibly be the perl signal handler interrupting a print and
trying to print?

Perl 5.8+ has "safe signals": waits until back in regular interp loop to
call handler.

https://docs.travis-ci.com/user/reference/xenial/#perl-support
Xenial has 5.22; can request up to 5.30.

So it has the safe signals.

@derekbruening
Copy link
Contributor Author

I did a re-build and it did the exact same thing. ?!#?

Searching doesn't show much. This seems to only happen to actual no-output cases. There is "travis_wait" but it doesn't seem like that's the problem here: but we could try it. It has no output until the end (kind of ironic....) but maybe we could live with that.

@derekbruening
Copy link
Contributor Author

It happens consistently at about this point: "Ran for 21 min 37 sec". And it consistently has 20 instances of "30s elapsed".

That does not sound like some perl alarm signal issue. It sounds like a Travis 10-minute-old process issue.

Yet the other builds include jobs that take >10 mins. E.g. this one took 15 mins and has 29 "30s elapsed" messages:
https://travis-ci.com/github/DynamoRIO/dynamorio/jobs/298230207?utm_medium=notification

So what's the difference between those >10min succeeding jobs and the ones that hang: the cross-compile that used to hang plus this cronbuild?

@derekbruening
Copy link
Contributor Author

Cronbuild failed again. This is really blocking us. I'm going to try to repro in a cronbuild where we can actually experiment: Travis refuses to deploy there but hopefully the same build will repro.

@derekbruening
Copy link
Contributor Author

I got it to repro in a custom PR job.

I then removed the perl signal: same behavior, so that is not the culprit.

@derekbruening
Copy link
Contributor Author

I tried prefixing the runsuite_wrapper command line with "travis_wait 45" and that succeeded:

https://travis-ci.com/github/DynamoRIO/dynamorio/builds/154398973

1082.29s$ travis_wait 45 suite/runsuite_wrapper.pl travis $EXTRA_ARGS
604
605
606Still running (18 of 45): suite/runsuite_wrapper.pl travis
607The command suite/runsuite_wrapper.pl travis exited with 0.
608
609Log:
610
611Forking child for stdout tee
612Parent tee-ing child stdout...
<...>

6975drmemory-release-64: build successful; no tests for this build
6976release-32: build successful; no tests for this build
6977release-64: build successful; no tests for this build
6978
6979Copying documentation into ./html
6980Looking for /home/travis/build/DynamoRIO/dynamorio/build_debug-32/_CPack_Packages/*/*/DynamoRIO-*/docs/html
6981/home/travis/.travis/functions: line 607:  5810 Terminated              travis_jigger "${!}" "${timeout}" "${cmd[@]}"
6982The command "travis_wait 45 suite/runsuite_wrapper.pl travis $EXTRA_ARGS" exited with 0.
6983
6984Skipping a deployment with the releases provider because the current build is a pull request.
6985Skipping a deployment with the pages provider because the current build is a pull request.
6986
6987Done. Your build exited with 0.

It prints nothing incremental is the downside. It would be simplest to put it in for all builds even though we only need it for the package build so I will probably just do that and we'll live without incremental output (again, pretty ironic).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant