-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Random CI failures rather intense right now #9763
Comments
The sets failure is an additional new failure mode, on top of #9544 (best guess) which has been somewhat often causing assertion failures on OSX Travis and codegen segfaults on AppVeyor for a few weeks, and longer-term pre-existing ones like #9176 and #9501 and #7942-esque timouts. All of this is adding up to CI being actively detrimental instead of helpful right now, when maybe 50% of commits or PR's fail for completely unrelated reasons that we can't reproduce locally. If we have a brave/confident volunteer, we can contact Travis and ask for ssh access to a worker VM for 24 hours to do as much debugging and information gathering as we can. |
i'm pretty sure we need to disable that |
I believe the set failures may be caused by
I suggest:
|
Interesting. That ends up with a 768MB array for storage. Big, but not ridiculous. Are there other big allocations in test? A quick grep for large exponentiations shows that linalg1 creates a 300MB array, but there may be others I missed. How much RAM do typical travis machines have? It could explain why nobody is reproducing it at home if travis machines have much less RAM than our computers typically do. I don't have a 32bit system handy, but I'm pretty sure IntSets can handle elements up to Int64(2)^36 or so since there are 2^5 bits per element of the array. |
didn't mean to push this directly to master, but oh well, lots of changes here: including one pretty serious bug: |
This is very likely. A while back I had some instrumentation code that ran on AppVeyor and was showing our tests taking up all the memory (and timing out) there, but that was before they rolled out the higher-performance Pro environment that we've been using. If anyone wants to make an experimental branch/test PR and do the same basic thing on Travis (print remaining memory after each test file), that could be interesting to look at. |
IIRC there's a big allocation somewhere in reduce or reducedim. We went through a period where |
Was the laptop maybe overheating from pegging all cores at 100%? Mine does that, so I have |
I actually have |
core test still segfaulting https://travis-ci.org/JuliaLang/julia/jobs/46980567 |
With 9 workers, if some workers allocate 500 MByte, Julia tests combined may require several GByte. It would depend on timing coincidences whether they all require this memory at the same time. Can we introduce an environment variable that specifies the maximum amount of memory that Julia should use? This can be checked in the allocator, and we'd get a nice error message (with backtrace) if Julia uses too much memory. If the operating system's memory limit is reached, then the Julia process may be aborted before it can output a backtrace. |
That's a very Java-like thought. It's not very nice of the OS to be accepting our malloc requests, but then sending the OOM to kill our app when we try to use it. If that's really the problem, we might want to ask Travis to change the kernel over-allocation allowance. Tkelman: I opened a new issue specifically for that assertion failure. I currently suspect it maybe the convert-pocolipse |
@vtjnash I've seen crashes often running julia in low memory situations and trying to allocate a large array. Just a And yes, I am a Java programmer, but I don't think the kernel knows that.. yet. :) |
Interesting. I usually get MemoryError() when I have a buildbot with too On Wed, Jan 14, 2015, 08:24 Avik Sengupta [email protected] wrote:
|
I investigated a little more where my errors were coming from, and should probably clarify. Using the plain array constructor does throw a vagrant@vagrant-ubuntu-trusty-64:~$ julia/usr/bin/julia
_
_ _ _(_)_ | A fresh approach to technical computing
(_) | (_) (_) | Documentation: http://docs.julialang.org
_ _ _| |_ __ _ | Type "help()" for help.
| | | | | | |/ _` | |
| | |_| | | | (_| | | Version 0.4.0-dev+2558 (2015-01-08 07:21 UTC)
_/ |\__'_|_|_|\__'_| | Commit 2bb647a* (6 days old master)
|__/ | x86_64-linux-gnu
julia> a=Array(Float64,1000,1000,1000)
ERROR: MemoryError()
in call at base.jl:260
julia> readcsv("input_SPECFUNC_BASELINE.csv")
Killed
vagrant@vagrant-ubuntu-trusty-64:~$
julia> readcsv("input_SPECFUNC_BASELINE.csv", use_mmap=false)
ERROR: MemoryError()
in call at datafmt.jl:148
in readdlm_string at datafmt.jl:249
in readdlm_string at datafmt.jl:273
in readdlm_auto at datafmt.jl:57
in readdlm at datafmt.jl:47
in readdlm at datafmt.jl:45
in readcsv at datafmt.jl:485 |
@vtjnash Once the operating system decides that it doesn't want to give us any more memory, the Julia process is in a bad state. It's not clear at all that it can still generate and output a backtrace at this point. It doesn't matter whether the OS tells the Julia process via a segfault, or via returning To avoid this, the Julia process needs to abort itself before it runs out of memory. Hence an environment variable. Alternatively, call |
why does it help to abort yourself a random amount of time before you would be notified that the system does not want to honor your malloc request? |
@vtjnash It helps for two reasons: First, at this time, you can still get a meaningful backtrace. If you can get a good backtrace when the OS or malloc complains, then that's better. It also helps to find out who is the culprit. If there are 10 workers running simultaneously and one of the requires much more memory than the others, then with an explicit check in Julia, one can catch this one worker and track down where the memory allocation occurs. Otherwise, the OS will abort the first process that allocates memory once the memory is exhausted, and that may not be the one using too much memory. If we can get a process map from the OS that tells us which processes were running and how much memory each had allocated when it aborts one of the workers, then that's better. |
If the OOM is killing the process, shouldn't we see a SIGKILL? |
We print a process exited exception, but we don't print the error code. Perhaps we should? With SIGKILL, the child process doesn't get time to cleanup, but the parent process here could be more informative.
That's not how the linux OOM killer works Nor how malloc is supposed to work (but you need to change a kernel flag to make it work the way the posix spec says it should work). |
If we are running But maybe your argument isn't with the mechanism, but rather with the choice whether one wants to impose a memory limit at all. I think we should impose one: self-tests should run with a "reasonable" amount of memory, and we need to define what "reasonable" means, and we need to catch those tests that accidentally use more memory. I don't care about the mechanism one uses for this. But running two large-memory tests that may or may not run simultaneously, and that may or may not lead to OOM, is a situation that leads to random failure and is difficult to debug. My use case here is a test that used 540 MByte of memory, to store a bit set that had 1 bit set. Maybe this was on purpose, maybe it was an accident because the user was not aware of how the bit set was stored. |
Yes, and/or we could have our travis script dump OOM messages from the log files: |
I think we still have at least one outstanding bug and a commented-out test that has yet to be moved to stress tests in perf. And we're starting to see a new-seeming intermittent failure more often than usual, see https://s3.amazonaws.com/archive.travis-ci.org/jobs/50792847/log.txt for an example. Parallel test failing due to |
Another odd parallel failure on Win64, ENOBUFS in https://ci.appveyor.com/project/StefanKarpinski/julia/build/1.0.2773/job/kstlvdmelslggo3p |
It is probably due to these lines https://github.com/JuliaLang/julia/blob/master/test/parallel.jl#L217-L233 It will require an extra 240MB (80 local, 80 remote, 80 returned result) at a minimum. Could be more. Do you think that could be a problem? |
Yeah, especially at the end of running the tests, the CI VM's could easily be hitting OOM killers - we saw that earlier with a few tests that were commented out last month. For hugely memory-intensive tests we should move them out of the regular CI suite and into a not-run-as-often (but hopefully still once in a while, nightly or a few times a week?) stress test in perf. |
Any chance of getting beefier machines? I would really to keep them in the regular CI, at least till we have a stress/perf suit running on a regular schedule. I'll tweak the those numbers in parallel.jl a bit lower anyways. |
Good idea! Maybe, more like
But we are already near the time limit on Travis OSX, serial execution may just push it over. |
Time limits is a different problem, that can be fixed by improving performance, writing fewer tests, or prioritizing which tests we can run in different situations. Most commits have a very small probability for breaking the high memory tests anyway. We should at least not leave the decision to OOM. |
Maybe workers should be killed and restarted for each test? IIUC memory usage grows because the tests are exercising many different code paths, which generates a lot of functions in cache. If instead of adding these up, we started from scratch for each test, the memory usage would probably be much lower. |
Aside from the OSX timeouts, things had been sorta stable for a while here. But the tuple change looks like it's introduced 3 new intermittent failures on CI: replcompletions #10875 |
The easiest and most obvious nondeterminism in our CI systems is the way in which the different tests get split between workers. Has anyone looked to see if some of these failure modes always happen with a certain combination of tests on the same worker? |
I'm getting the dict failure to happen reliably with a win32 source build via |
Linux 64bit, Sorry @tkelman, I don't have such a script, but |
Somewhat reduced the dict failure. Removing any one of the following list of tests causes it to pass.
|
This might be a stupid question, but where is |
In builtins.c. |
Ah, thanks. Do we have a "reflection doesn't work on builtins" issue? So if I add a few lines to
and rerun the same combination of preceding tests I get
I hope someone else finds a platform/commit combination where they can reliably reproduce this, because I'm stumped. Will see if I can get subsets of tests that cause the other failures. |
That sequence passes for me on 64-bit linux (unfortunately). If the tests are failing on Travis, I know from experience that one can get 24hour direct access. |
Ahah - using this dockerfile #9153 (comment) to build a 32-bit Julia from 64-bit Ubuntu Precise (similar to what we do on Travis) I can reproduce the dict failure, so others should be able to as well. Since it uses system everything it should only take 10-15 minutes to build. edit: it fails with |
Got the same
|
@tkelman I'd love to get your dockerfile working. I've hit a point in the build where it's saying |
Is it right off the bat on the first |
Once docker is installed the actual steps I run are:
|
Excellent, thanks for the pointer. Giving docker a DNS server fixed it. |
Hmm, now I get this:
I'm guessing due to running a different version of ubuntu? |
The version of ubuntu that you run docker from shouldn't matter much (and fwiw I'm also using 14.04 as host), and the version that runs inside the container comes from the first line of the dockerfile so should always be 12.04 here. |
You might try running On Mon, Apr 20, 2015 at 10:26 PM, Tony Kelman [email protected]
|
The dockerfile already does |
Running the container build with |
oh, right, docker union filesystem caching layers from broken previous runs. |
We have an evolving set of issues on CI. I'll close this in favor of more timely issues. |
(spawned from #9679 to get more eyes)
In #9679 I merged in some changes to the tests (splitting
test/collections.jl
intotest/sets.jl and
test/dicts.jl, and moved around some lines in the
base/sets.jl` file [mainly to group methods together]). TravisCI Linux failed, but it worked locally, on Appveyor and TravisCI OSX, so I merged anyway.Since then, perhaps by coincidence, we've been getting seemingly random CI failures - many of them seem to be occurring on the workers running the
sets
tests but not all of them.No one as far as I know has been able to reproduce locally, and I'm way out of my depth :D
The text was updated successfully, but these errors were encountered: