Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade to LLVM 3.7.1 and switch over CI #14623

Merged
merged 7 commits into from
Jan 15, 2016
Merged

Upgrade to LLVM 3.7.1 and switch over CI #14623

merged 7 commits into from
Jan 15, 2016

Conversation

tkelman
Copy link
Contributor

@tkelman tkelman commented Jan 10, 2016

Thanks to @Keno for doing most of the preparation work, and @staticfloat for handling mac packaging. There's one ugly git line in the .travis.yml that can go away once staticfloat/homebrew-julia@9642485 and staticfloat/homebrew-julia@e6b8d2d are merged into master of homebrew-julia. gone

This probably calls for a PkgEval run before merging. Closes #9336.

Fixes: (not yet) #10595, #12671, #10444, #9222, #9085, #4905, #4418, #3596, #10301, #11037, #11083
^^ these should all be checked manually, and add tests before closing wherever possible

todo:

@tkelman
Copy link
Contributor Author

tkelman commented Jan 10, 2016

Both win32 and win64 are failing on appveyor with LLVM ERROR: Unable to allocate section memory!, so #11083 is indeed still open and a blocker here.

@staticfloat
Copy link
Member

Now merged into homebrew-julia's master branch.

@IainNZ
Copy link
Member

IainNZ commented Jan 10, 2016

Wow, huge!

How carefully have the perf implications been looked into for this? I recall a lot of work on bringing codegen times down to a relatively small regression, but how about the generated code on nontrivial code?

@ViralBShah
Copy link
Member

The plan is to get this in for now, and then tackle the rest going forward on master as undoubtedly we will find more issues.

@ViralBShah
Copy link
Member

Perhaps this is also where the benchmarking infrastructure will come in handy.

@jrevels
Copy link
Member

jrevels commented Jan 10, 2016

No time like the present?

runbenchmarks(ALL, vs = "JuliaLang/julia:master")

The CI tracker currently only has a subset of the array benchmarks in Base, so some manual perf testing would also be helpful.

@KristofferC
Copy link
Member

FWIW I see the same performance in my packages and some of them (NearestNeighbors.jl) have previously been quite good at detecting performance regressions, at least in the array code. Just one sample, but hey.

@tkelman
Copy link
Contributor Author

tkelman commented Jan 10, 2016

@staticfloat we're going to need a newer gtar on the linux buildbots, apparently they don't understand .xz https://build.julialang.org/builders/package_tarball64/builds/24/steps/make%20binary-dist/logs/stdio

@jrevels
Copy link
Member

jrevels commented Jan 10, 2016

https://github.com/JuliaCI/BaseBenchmarkReports/blob/master/54afb4e/54afb4e_vs_16301d2.md

Looks like some of the raw time regressions could be attributable to GC, but some might go away if johnmyleswhite/Benchmarks.jl#40 was in play. eig(rand(4, 4)) seems to be the only benchmark that demonstrates memory allocation regressions.

@Keno
Copy link
Member

Keno commented Jan 10, 2016

Ok, fine, I'll write us a better memory manager.

@Keno
Copy link
Member

Keno commented Jan 10, 2016

Also, do people know what other JITs do to avoid RWX pages? I looked at v8 and openjdk, but both seems to have RWX pages.

@ihnorton
Copy link
Member

@Keno
Copy link
Member

Keno commented Jan 10, 2016

Right, that's what LLVM 3.7 does too, but we run into some fragmentation problems which artificially bloats our memory usage.

@Keno
Copy link
Member

Keno commented Jan 10, 2016

What we could do is go RW->RX and put it pack into RW when we want to append. There is a problem with multi-threaded code, but I'm sure we could solve that just by holding the thread until we're done writing to the page. I'm not sure that's significantly better than just having a RWX page though. An attacker just have to time the writing to the page correctly. Also, I'm not necessarily arguing that it should be our job to protect against this, just exploring if there's a reasonable default.

@yuyichao
Copy link
Contributor

There is a problem with multi-threaded code, but I'm sure we could solve that just by holding the thread until we're done writing to the page.

We already relies on segfault for multithreading GC, we could easily add the logic for catching not executable fault as long as LLVM (or codegen) is able to tell whether an address is being written to by LLVM.

An attacker just have to time the writing to the page correctly.

I always thought the whole point is to make the attack window smaller? The attacker can always write to the same page when we are doing codegen (edit: for multi-threading), unless there's some verification pass after we set the page to not-writable, in which case we can also do that when we set the page back (edit: to RX).

@tkelman
Copy link
Contributor Author

tkelman commented Jan 10, 2016

dunno. I should remember #14430 (comment) and add in -DUSE_ORCJIT somewhere for osx travis since we know the homebrew bottle is patched

@Keno
Copy link
Member

Keno commented Jan 10, 2016

@yuyichao That was my thought as well (i.e. holding the thread on an NX fault). The fact that the attack window is already there is a fair point. The only thing that is worse here is that a potential attacker might more easily be able to determine the allocation address. In any case, I think I actually found a bug in my patch which caused us to waste some memory. Let's see if just fixing that is sufficient. We should still think about a better memory allocator in the future though.

@Keno
Copy link
Member

Keno commented Jan 10, 2016

We could checksum the portion of the page that already exists and crash if it was modified.

@yuyichao
Copy link
Contributor

If detecting the address is an issue, we can hide that by making two maps of the physical pages that we might still emit code into and never set the RX one back. Not sure if this is possible to do on windows or if it is supported by the llvm api though.

@Keno
Copy link
Member

Keno commented Jan 10, 2016

Also, my patch does indeed seem to fix the windows issue. Nice! Will commit upstream and add it to the patch set.

@Keno
Copy link
Member

Keno commented Jan 10, 2016

If detecting the address is an issue, we can hide that by making two maps of the physical pages that we might still emit code into and never set the RX one back. Not sure if this is possible to do on windows or if it is supported by the llvm api though.

Yeah, Windows allows it. LLVM's memory APIs may not currently support it, but that's what a custom memory manager is for.

@yuyichao
Copy link
Contributor

We could checksum the portion of the page that already exists and crash if it was modified.

A very minor issue is what if the attacher can modify the checksum on the stack.... probably not important compare to the one it solve.....

@tkelman
Copy link
Contributor Author

tkelman commented Jan 10, 2016

Is the bug Windows only or will it also be worth rebuilding a new revision of the homebrew bottle with the modified patch?

@Keno
Copy link
Member

Keno commented Jan 10, 2016

This bug is windows-only, but there was also a MachO patch that was forgotten (but that one is only a problem in LLVM_ASSERTIONS mode which the bottles may not be in?)

@tkelman
Copy link
Contributor Author

tkelman commented Jan 10, 2016

not sure whether bottles have assertions on, but I think that would be good to do on CI?

@Keno
Copy link
Member

Keno commented Jan 10, 2016

Yeah, probably.

@Keno
Copy link
Member

Keno commented Jan 10, 2016

If the attacker can write the stack you have generally lost because they can overwrite the return address.

@Keno
Copy link
Member

Keno commented Jan 10, 2016

Patch committed upstream as llvm-mirror/llvm@1f644cd.

@ihnorton
Copy link
Member

there was also a MachO patch that was forgotten

@tkelman: #14585 (comment)

If the attacker can write the stack you have generally lost because they can overwrite the return address.

Right now can't anybody just ccall(:mprotect, ... themselves?

(also, can we please, please add the llvm-shlib patch to our patchset so we can support USE_LLVM_SHLIB on Windows?)

Set CXXFLAGS=-DUSE_ORCJIT
(possibly temporary if we need these for something?)
@tkelman
Copy link
Contributor Author

tkelman commented Jan 15, 2016

Fewer failures this time around, but still more than you might hope: https://gist.github.com/cf1da5230c275eedaa67

MbedTLS and dependents look like they were broken by #14667, so these aren't all regressions relative to latest master. The nightly binary is old because the centos and osx buildbots have been constantly failing for 2 weeks.

There are some strange-looking bounds errors and segfaults in here that may be legit regressions though. (or more likely due to #14474)

Enough other things have changed on master in the last few weeks that it's probably best to just merge this and work on fixing things, bringing back osx testing, etc as we go.

@Keno
Copy link
Member

Keno commented Jan 15, 2016

+1 Let's get this merged and address failures as they come in

Keno added a commit that referenced this pull request Jan 15, 2016
Upgrade to LLVM 3.7.1 and switch over CI
@Keno Keno merged commit d4749d2 into master Jan 15, 2016
@tkelman tkelman deleted the llvm37 branch January 15, 2016 15:21
@JeffBezanson
Copy link
Member

🎆 Congrats, big milestone here!

@JeffBezanson
Copy link
Member

Am I reading correctly that this makes total travis CI time 2 hours longer? And to think I was holding off on merging jb/functions since it makes CI time 20 minutes longer...

@Keno
Copy link
Member

Keno commented Jan 15, 2016

No, something is going wrong with Travis.

@JeffBezanson
Copy link
Member

Ok, I suspected it must be something like that. A single AV build is ~20 minutes longer; is that expected?

@Keno
Copy link
Member

Keno commented Jan 15, 2016

Looks like it's recompiling OpenBLAS for some reason. @tkelman take a look? 20 minutes is possible though moderately more than expected.

@tkelman
Copy link
Contributor Author

tkelman commented Jan 15, 2016

It's how the docker caching works on travis now. We aren't using the ppa any more, we do a source build from scratch but cache the built deps. It only takes so long after clearing out the cache. When the cache is populated it's about half an hour. ok maybe more like 40-45 mins.

AppVeyor is slower as a real perf regression. OSX was so much slower it timed out constantly and we had to turn it off for the time being.

@vtjnash
Copy link
Member

vtjnash commented Jan 15, 2016

@JeffBezanson yes, i noted that prior to merging. total CI build time regression appeared to be about 2x

@vtjnash
Copy link
Member

vtjnash commented Jan 15, 2016

@Keno can you look into the int.jl test? it seems to be showing a 15x increase in runtime on CI runs

@Keno
Copy link
Member

Keno commented Jan 15, 2016

Will take a look.

@jiahao
Copy link
Member

jiahao commented Jan 15, 2016

🎉

@blakejohnson
Copy link
Contributor

There are some strange-looking bounds errors and segfaults in here that may be legit regressions though. (or more likely due to #14474)

@tkelman if you find particular examples where you suspect, #14474, please let me know.

@tkelman
Copy link
Contributor Author

tkelman commented Jan 16, 2016

If you could take a look at Brim, Gadfly, Mamba, OpenStreetMap, and/or Winston that'd be awesome. They're all hitting BoundsErrors.

JuMP, NamedTuples, and Persist are failing for reasons that don't make sense to me yet, but they don't look BoundsError related.

@blakejohnson
Copy link
Contributor

So, trying Pkg.test("Gadfly"), the first error I encounter is a BoundsError in Showoff.jl of the flavor:

a, b, c, d = (1,2,3)

and caused by 53ecbaa changing the number of arguments returned by Base.grisu from 4 to 3. If I fix that, then all Gadfly tests pass. This is probably something to add to Compat.jl

@dcjones

@tkelman
Copy link
Contributor Author

tkelman commented Jan 16, 2016

Ah good call, sorry about the false alarm. Does anyone else have commit access to Showoff.jl? If not we may need to redirect metadata to use a mirror/fork for a while if it becomes urgent.

@blakejohnson
Copy link
Contributor

Ah, I guess I am on 0201437, which doesn't include this PR. But, at that point Brim also passes all tests. I would say, generally, that if you see a BoundsError, it is actually quite unlikely to be #14474. The more likely signature would be a segfault, because we elided a bounds check that we shouldn't have.

@blakejohnson
Copy link
Contributor

There is definitely something pretty weird going on with Brim related to bounds checks. The test which fails is:

using Brim
A = [1 0; 0 1]
M = partition_lp(A)

When launched from a julia session with --check-bounds=yes it gives a bizarre BoundsError:

ERROR: BoundsError: attempt to access 4x4 Array{Int64,2}:
 0  0  0  0
 0  0  0  0
 0  0  0  0
 0  0  0  1
  at index [3,1]

Without that command line option it works with a deprecation warning.

@nalimilan
Copy link
Member

FWIW, I've just switched the Copr RPM nightlies to LLVM 3.7.1, and all the tests pass. The build is only slightly slower than before (~26 minutes vs. 18 minutes in the best cases). This should increase the testing of this code since there are about 500 downloads each week.

@jrevels jrevels removed the potential benchmark Could make a good benchmark in BaseBenchmarks label Jan 27, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Win32 std::bad_alloc during make testall1 activate LLVM37