Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

code with static arrays is very very slow to compile with the --release flag #2485

Open
alainravet opened this issue Apr 20, 2016 · 29 comments · May be fixed by #9486
Open

code with static arrays is very very slow to compile with the --release flag #2485

alainravet opened this issue Apr 20, 2016 · 29 comments · May be fixed by #9486
Labels
kind:bug A bug in the code. Does not apply to documentation, specs, etc. topic:compiler

Comments

@alainravet
Copy link

alainravet commented Apr 20, 2016

# Crystal version 15, MacBook Pro (Retina, 13-inch, Early 2015, 16GB RAM)
#0.5 seconds : crystal build  staticarrays.cr ; ./staticarrays
#55 seconds  : crystal build  --release staticarrays.cr ; ./staticarrays

# file: staticarrays.cr
class Wrap
  def initialize
    @elements = StaticArray(String|Nil, 2048).new {}
    @next_elem = 0
  end

  def add(value)
    @elements[@next_elem] = value
    @next_elem += 1
  end

  def values
    @elements.to_a[0..@next_elem-1].compact
  end
end

w = Wrap.new
w.add "un"
w.add "deux"

puts w.values.inspect

@asterite
Copy link
Member

I can't reproduce this. It does take a long time to compile because of the big static array (LLVM is slow with big static arrays), but eventually it compiles and runs fine.

When you say "FAIL", what do you mean? What happens?

@ozra
Copy link
Contributor

ozra commented Apr 20, 2016

Could be an issue with running out of RAM or similar? Just thinking loud...

@alainravet
Copy link
Author

You're right @asterite, it does work but it's very very slow
This sample takes about 0.5 sec without the release flag, and 50 seconds with the flag.
(My original code had two of those static arrays and I gave up after 2 minutes.)

Is there a way to improve this behaviour?

@alainravet alainravet changed the title this code does not compile with --release flag code with static arrays is very very slow to compile with the --release flag Apr 20, 2016
@asterite
Copy link
Member

@alainravet Yes, use a regular array. @elements = [] of String?. StaticArray has very few use cases, mostly low-level ones.

@alainravet
Copy link
Author

@asterite I'm trying to optimize CPU intensive code (a chess engine).
A "how to optimize your code" guide/wiki page would be useful.

@waterlink
Copy link
Contributor

Slowness of non-static arrays (vector) usually comes from size extension operations. Therefore significant speed-up should be observable when vectors are created with pre-defined size (that won't change in runtime). It will not be as fast as static array (with all optimizations applied), but it should be faster then blind usage of vectors.

BTW, how does the benchmark of current code with non-static arrays look like?

Would it be possible to share code (if it is open-source), so that we can play around with it and come up with some optimizations? (I would be interested myself in that).

@alainravet
Copy link
Author

alainravet commented Apr 21, 2016

@waterlink I'm porting a C code project to Crystal and, at first, I refactored to OO as much as possible. The code is better structured and more readable and more ... , but it also creates tons of objects and is slow (relatively speaking). I am now unfactoring the expensive parts, using pools of arrays to reuse instead of creating new ones, unwrapping primitives (as we can't extend the Int* structs 😁 ), ... It looks like you can't have your cake and eat it after all.

BTW, how does the benchmark of current code with non-static arrays look like?

It takes 2 minutes to explore 119_060_324 millions possible moves in 6 plies (Perft test). I'll share it next week, once it's cleaner/less messy. It could be useful to measure performance changes between Crystal versions.

@asterite
Copy link
Member

@alainravet I can't say how to optimize the program without profiling it, but another alternative to Array is Slice. The "problem" with Array is that if you do array[0..index], that is, you take a subrange of it, a new array is allocated. With a Slice, a view to the original slice is returned so no memory allocation is involved. However, a Slice cannot be resized. But it seems in your case you probably don't need to resize it.

In any case, once you share you'll code I'm sure we can all optimize together to be in the order of the C program.

@asterite
Copy link
Member

There's nothing we can do here. LLVM is known to be slow with big static arrays. These should only be used in very few cases. Array and Slice and most appropriate in most cases. So I'm closing this.

@mattrberry
Copy link
Contributor

mattrberry commented Jun 2, 2020

Edit: Much of the issue I've described below isn't necessarily related. However, this slow compilation is still 100% an issue.

@asterite I just had a similar problem and spent a couple of hours trying to figure out the cause. The error messages were also unhelpful, which I'll describe below. I simply added a StaticArray(Float32, 4096) as an instance variable, causing my --release compile time jumped from ~5 seconds to ~4 minutes. I think it would be useful to add some more specifics regarding the use-cases of StaticArrays to the API docs.

As for the error messages, I ctrl-c'd the shards build --release the first time through since it was taking so long. Then, when trying to build with shards build --release again, I received 3000+ lines of output, of what appears to be one line of output for every single time an instance variable was referenced in my program. The last 40 lines lines were these, with the 3000+ leading up to this point resembling the ones at the top:

/home/matt/code/github/CryBoy/src/cryboy/opcodes.cr:2062: undefined reference to `*CPU#a:UInt8'
/home/matt/code/github/CryBoy/src/cryboy/opcodes.cr:2062: undefined reference to `*CPU#f_c=<Bool>:UInt8'
/home/matt/code/github/CryBoy/src/cryboy/opcodes.cr:2063: undefined reference to `*CPU#f_n=<Bool>:UInt8'
/home/matt/code/github/CryBoy/src/cryboy/opcodes.cr:2068: undefined reference to `*CPU#pc=<UInt16>:UInt16'
/home/matt/code/github/CryBoy/src/cryboy/opcodes.cr:2069: undefined reference to `*CPU#sp=<UInt16>:UInt16'
/home/matt/code/github/CryBoy/src/cryboy/opcodes.cr:2070: undefined reference to `*Memory#[]=<UInt16, UInt16>:Nil'
/home/matt/code/github/CryBoy/src/cryboy/opcodes.cr:2071: undefined reference to `*CPU#pc=<UInt16>:UInt16'
_main.o: In function `raise':
/usr/share/crystal/src/raise.cr:191: undefined reference to `*CallStack::new:CallStack'
/usr/share/crystal/src/raise.cr:191: undefined reference to `*SDL::Error@Exception#callstack=<CallStack>:CallStack'
/usr/share/crystal/src/raise.cr:191: undefined reference to `*Pointer(LibUnwind::Exception)@Pointer(T)::malloc:Pointer(LibUnwind::Exception)'
/usr/share/crystal/src/raise.cr:191: undefined reference to `*UInt64@Number::zero:UInt64'
/usr/share/crystal/src/raise.cr:191: undefined reference to `*UInt64@Number::zero:UInt64'
_main.o: In function `->':
/home/matt/code/github/CryBoy/src/cryboy/motherboard.cr:18: undefined reference to `*SDL::quit:Nil'
_main.o: In function `at_exit':
/usr/share/crystal/src/kernel.cr:471: undefined reference to `*Crystal::AtExitHandlers::add<Proc(Int32, (Exception+ | Nil), Nil)>:Array(Proc(Int32, (Exception+ | Nil), Nil))'
_main.o: In function `->':
/usr/share/crystal/src/gc/boehm.cr:164: undefined reference to `*SDL::Window#finalize:Nil'
/usr/share/crystal/src/gc/boehm.cr:164: undefined reference to `*SDL::Renderer#finalize:Nil'
_main.o: In function `->':
/usr/share/crystal/src/callstack.cr:125: undefined reference to `*Pointer(Void)@Comparable(T)#==<Pointer(Void)>:Bool'
/usr/share/crystal/src/callstack.cr:126: undefined reference to `*CallStack::RepeatedFrame#incr:Int32'
/usr/share/crystal/src/callstack.cr:128: undefined reference to `*CallStack::print_frame<CallStack::RepeatedFrame>:Int32'
/usr/share/crystal/src/callstack.cr:129: undefined reference to `*CallStack::RepeatedFrame::new<Pointer(Void)>:CallStack::RepeatedFrame'
_main.o: In function `raise':
/usr/share/crystal/src/raise.cr:191: undefined reference to `*CallStack::new:CallStack'
/usr/share/crystal/src/raise.cr:191: undefined reference to `*OverflowError@Exception#callstack=<CallStack>:CallStack'
/usr/share/crystal/src/raise.cr:191: undefined reference to `*Pointer(LibUnwind::Exception)@Pointer(T)::malloc:Pointer(LibUnwind::Exception)'
/usr/share/crystal/src/raise.cr:191: undefined reference to `*UInt64@Number::zero:UInt64'
/usr/share/crystal/src/raise.cr:191: undefined reference to `*UInt64@Number::zero:UInt64'
/usr/share/crystal/src/raise.cr:191: undefined reference to `*CallStack::new:CallStack'
/usr/share/crystal/src/raise.cr:191: undefined reference to `*Exception+@Exception#callstack=<CallStack>:CallStack'
/usr/share/crystal/src/raise.cr:191: undefined reference to `*Pointer(LibUnwind::Exception)@Pointer(T)::malloc:Pointer(LibUnwind::Exception)'
/usr/share/crystal/src/raise.cr:191: undefined reference to `*UInt64@Number::zero:UInt64'
/usr/share/crystal/src/raise.cr:191: undefined reference to `*UInt64@Number::zero:UInt64'
/usr/bin/ld: /home/matt/code/github/CryBoy/bin/cryboy: No symbol version section for versioned symbol `*File@IO::Evented#resume_read<Bool>:Nil'
/usr/bin/ld: final link failed: Nonrepresentable section on output
collect2: error: ld returned 1 exit status
�[31;1mError: �[0m�[1mexecution of command failed with code: 1: `cc "${@}" -o '/home/matt/code/github/CryBoy/bin/cryboy'  -rdynamic  -lSDL2 -lpcre -lm /usr/bin/../lib/crystal/lib/libgc.a -lpthread /usr/share/crystal/src/ext/libcrystal.a -levent -lrt -ldl -L/usr/bin/../lib/crystal/lib -L/usr/bin/../lib/crystal/lib`�[0m

@jhass
Copy link
Member

jhass commented Jun 2, 2020

Since you Ctrl+C, the error messages are more due to #5890 / #9314 :)

@j8r
Copy link
Contributor

j8r commented Jun 15, 2020

StaticArray#new(&) code (which causes the slow compilation) reduced:

array = uninitialized Int32[4096]
4096.times do |i|
  array.to_unsafe[i] = 0
end
array

I have also found a work-around that compiles a lot faster:

struct StaticArray(T, N)
  def self.new_fast(& : Int32 -> T)
    array = uninitialized self
    buf = array.to_unsafe
    {% for i in 0...N %}
    buf[{{i.id}}] = yield {{i.id}}
    {% end %}
    array
  end
end

@jhass jhass added kind:bug A bug in the code. Does not apply to documentation, specs, etc. topic:compiler and removed status:needs-more-info labels Jun 15, 2020
@jhass
Copy link
Member

jhass commented Jun 15, 2020

So, we did some digging into this and apparently Crystal in --release mode invokes LLVM in a way that makes it really hard for LLVM to optimize that loop. Doing as little as dumping the unoptimized LLVM IR (with crystal build --emit=llvm-ir) and running llc -O3 on it, is substantially (orders of magnitude) faster. It only takes a couple seconds.

In both cases, llc -O3 and crystal build --release, LLVM ends up completely unrolling that loop. So @j8r macro version above, just pre-unrolls the loop in Crystal land and should lead to equivalent of the optimized code. So it seems that LLVM with the crystal build --release options somehow pays a much higher price during loop unrolling than it does for llc -O3.

The first candidates I could find from a quick glance here seem to be

builder.size_level = 0
builder.use_inliner_with_threshold = 275
but removing them makes no real difference in my tests.

So, if somebody can figure out what the heck is different in the optimizer configuration between llc and crystal here (it might be easier to see if you can get llc to be slow too, then figure out how to avoid this setting from the compiler), that might give a general compiler speed bump beyond this issue.

If not, we should at least apply @j8r's workaround for now.

@waj
Copy link
Member

waj commented Jun 19, 2020

@jhass you're just running llc which translates LLVM bitcode to machine code, but you need to run opt -O3 first to reproduce the same delay. In my test opt takes just a couple of seconds, but then that optimized output takes forever for llc to translate.

I guess this has something to do with returning the array as value from the function. Note that this cannot be reproduced in C so probably clang never faced this issue. The workaround still returns the array as value, but for some reason that combination trigger some optimization edge case.

I just tried replacing the new method with an initialize instead, that internally receives the struct as a pointer. It should be exactly the same but it doesn't trigger the problem. I'll perform more tests locally and send a PR.

@jhass
Copy link
Member

jhass commented Jun 19, 2020

Huh, that's confusing, why does llc take an -O3 option then? Also in my tests the llc -O3 generated assembly looked identical to the optimized LLVM IR, in that it fully unrolled the loop.

@waj
Copy link
Member

waj commented Jun 19, 2020

I know, and for a second I thought we had an opportunity for a huge performance improvement. But the result is not the same. The optimizations they perform are different. I just tested with samples/http_server.cr and running llc optimizations alone improves from debug mode, but not as much as running all the optimizations.

@jhass
Copy link
Member

jhass commented Jun 19, 2020

I see, well TIL. Oh dear LLVM, could be so easy...

@kefahi
Copy link

kefahi commented Feb 12, 2021

I have very similar observation when trying to build a simple test code with relatively large array. I found that code as part of some bench marking exercise that test code has an array with 10,000 entries.
Running crystal build --no-debug --progress --release crTest.cr is not slow it actually enters an infinite loop (it never ends) with 100% cpu utilization in the last stages.
If the size of the array is reduced, say to 1000 entries instead of 10,000, the code compiles successfully in 30 seconds.

I use Crystal 0.36.1 on Fedora.

@jwoertink
Copy link
Contributor

I'm also getting this using 1.0.0. I tried to create a StaticArray of String with this domain list (about 120,000 strings). The file looked like:

StaticArray[
"...",
"...",
"and so on for another 120k rows",
]

After 23minutes, my terminal died when my MBP hit 25GB usage on Crystal 😂

❯ crystal run --release --stats email.cr
Parse:                             00:00:00.111867105 (  30.85MB)
Semantic (top level):              00:00:01.072054227 ( 310.95MB)
Semantic (new):                    00:00:00.001057807 ( 310.95MB)
Semantic (type declarations):      00:00:00.080901451 ( 310.95MB)
Semantic (abstract def check):     00:00:00.004426140 ( 310.95MB)
Semantic (ivars initializers):     00:00:00.051636747 ( 310.95MB)
Semantic (cvars initializers):     00:00:00.092158086 ( 342.95MB)
Semantic (main):                   00:00:00.578983974 ( 407.00MB)
Semantic (cleanup):                00:00:00.003794497 ( 407.00MB)
Semantic (recursive struct check): 00:00:00.000463927 ( 407.00MB)
Codegen (crystal):                 00:00:01.779074775 ( 487.12MB)
zsh: killed
Crystal 1.0.0 [dd40a2442] (2021-03-22)

LLVM: 6.0.1
Default target: x86_64-apple-macosx

@kefahi
Copy link

kefahi commented Mar 25, 2021

I, too, confirm that the issue still exists with Crystal 1.0.0.

@jwoertink
Copy link
Contributor

Maybe StaticArray can do what Tuple does with the limit?

raise "tuple size cannot be greater than 300 (size is #{types.size})"
Or could at least be a "quick fix" until something better comes along. In this case it would at least tank the compiler instead of chilling in the CI (my CI) for an hour 😂

@asterite
Copy link
Member

No because it's valid to have static arrays of ten K elements. For example for a buffer in the stack.

The issue is initializing a large static array.

But in general, I would advise simply not using static array at all. Use Array.

@straight-shoota
Copy link
Member

Showing an error message early is certainly better than setting the compiler off to run forever.
The error could just apply to static array literals over a certain size. That should be fine to do.

I totally agree on the general note to just use array instead.
The use case for static array is to allocate stack memory in a performance-sensitive path where heap allocation would be costly.
But don't put program data in a static array. It makes not much sense.

@asterite
Copy link
Member

The use case for static array is to allocate stack memory in a performance-sensitive path

Yes, actually, the only reason we have StaticArray is to be able to get a buffer of bytes on the stack. Any other usage is pretty contrived in my opinion. And if that's the case, it might be good (for 2.0) to just remove the type and have a specific type for that.

@straight-shoota
Copy link
Member

remove the type and have a specific type for that.

What do you mean? I'd actually consider StaticArray already a special type for that. It's just technically possible to use it for other things, too...

@asterite
Copy link
Member

I mean, removing StaticArray(T) and replacing it with a non-generic type ByteBuffer. That way people can't use it to create a giant static array of strings, because such a type wouldn't exist.

@Fryguy
Copy link
Contributor

Fryguy commented Mar 25, 2021

I think part of the issue is the name StaticArray doesn't automatically trigger a "this is on the stack" thought in one's mind. Name-wise it just sounds like a normal Array but defined statically with a fixed size.

@kefahi
Copy link

kefahi commented Mar 27, 2021

Not sure I understand the argument against StaticArray, but I switched the test code to use it and it solved the problem for me.

speedTest ) time crystal build --stats  --no-debug --progress --warnings all --time --release --ll --verbose crTest.cr
Parse:                             00:00:00.006544409 (   3.27MB)
Semantic (top level):              00:00:00.257241234 (  58.52MB)
Semantic (new):                    00:00:00.000920566 (  58.52MB)
Semantic (type declarations):      00:00:00.018386054 (  74.52MB)
Semantic (abstract def check):     00:00:00.004317707 (  74.52MB)
Semantic (ivars initializers):     00:00:00.014386552 (  74.52MB)
Semantic (cvars initializers):     00:00:00.069815220 (  90.52MB)
Semantic (main):                   00:00:00.196561334 ( 106.58MB)
Semantic (cleanup):                00:00:00.003858058 ( 106.58MB)
Semantic (recursive struct check): 00:00:00.000710583 ( 106.58MB)
Codegen (crystal):                 00:00:00.291614535 ( 122.58MB)
Codegen (bc+obj):                  00:01:11.217718396 ( 122.58MB)
cc _main.o -o /home/kefah/Downloads/youtube-tutorial-scripts/speedTest/crTest  -rdynamic -L/usr/bin/../lib/crystal/lib -lpcre -lm -lgc -lpthread /usr/share/crystal/src/ext/libcrystal.a -levent  -lrt -ldl
Codegen (linking):                 00:00:00.619240754 ( 122.58MB)
                                          
Codegen (bc+obj):
 - no previous .o files were reused

________________________________________________________
Executed in   72.87 secs    fish           external
   usr time   70.54 secs    0.20 millis   70.54 secs
   sys time    1.91 secs    1.03 millis    1.91 secs

@darkstego
Copy link
Contributor

I do think StaticArray can be beneficial in cases like #11370. If I simply change to a regular Array performance drops by 33x for small strings to around 2x for long ones, which seems to indicate a significant increase in initialization times.

But I do believe that until a resolution is found it merits at least a compiler warning for StaticArrays over a certain size. I got bit by this one, but luckily it happened just after a single change so I was able to quickly isolate the problem. But I can imagine situations where someone is scratching their head trying to figure out why compile speeds suddenly got out of hand.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind:bug A bug in the code. Does not apply to documentation, specs, etc. topic:compiler
Projects
None yet