-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Package scope and comparison to Turing #6
Comments
Hi Mohamed! I just added a README, where I briefly go into how the library works.
Cool! I remember once seeing benchmarks suggesting Turing was 10x slower than Stan. Is its performance now rather competitive?
julia> using PaddedMatrices, StaticArrays, BenchmarkTools, LinearAlgebra, Random
julia> pA = MutableFixedSizePaddedMatrix{7,7,Float64,7}(undef); randn!(pA); #to avoid padding
julia> pB = MutableFixedSizePaddedMatrix{7,7,Float64,7}(undef); randn!(pB); #to avoid padding
julia> pC = MutableFixedSizePaddedMatrix{7,7,Float64,7}(undef);
julia> sA = SMatrix{7,7}(pA);
julia> sB = SMatrix{7,7}(pB);
julia> @btime $sA * $sB
67.005 ns (0 allocations: 0 bytes)
7×7 SArray{Tuple{7,7},Float64,2,49} with indices SOneTo(7)×SOneTo(7):
2.96419 4.83028 -2.11728 1.00643 -5.69249 4.66337 2.24875
-0.446299 -1.35403 0.00484316 -0.323731 -0.155002 -1.72822 2.07241
1.43737 -0.531415 -3.50058 -0.626386 -3.66252 3.2198 0.504351
1.35284 -1.06179 -2.70543 -2.05564 -2.45683 2.49386 0.958187
1.14669 -1.48968 -3.22534 -1.49174 -2.56398 2.87688 -0.840148
1.33924 2.57827 1.07435 7.49071 6.52194 1.27062 -0.253103
-2.73455 -0.357343 3.97818 -4.22651 -0.354159 -4.7175 -1.29675
julia> @btime mul!($pC, $pA, $pB); pC
17.997 ns (0 allocations: 0 bytes)
7×7 MutableFixedSizePaddedArray{Tuple{7,7},Float64,2,7,49}:
2.96419 4.83028 -2.11728 1.00643 -5.69249 4.66337 2.24875
-0.446299 -1.35403 0.00484316 -0.323731 -0.155002 -1.72822 2.07241
1.43737 -0.531415 -3.50058 -0.626386 -3.66252 3.2198 0.504351
1.35284 -1.06179 -2.70543 -2.05564 -2.45683 2.49386 0.958187
1.14669 -1.48968 -3.22534 -1.49174 -2.56398 2.87688 -0.840148
1.33924 2.57827 1.07435 7.49071 6.52194 1.27062 -0.253103
-2.73455 -0.357343 3.97818 -4.22651 -0.354159 -4.7175 -1.29675
julia> pA = MutableFixedSizePaddedMatrix{23,23,Float64,23}(undef); randn!(pA); #to avoid padding
julia> pB = MutableFixedSizePaddedMatrix{23,23,Float64,23}(undef); randn!(pB); #to avoid padding
julia> pC = MutableFixedSizePaddedMatrix{23,23,Float64,23}(undef);
julia> sA = SMatrix{23,23}(pA);
julia> sB = SMatrix{23,23}(pB);
julia> @btime $sA * $sB
1.404 μs (0 allocations: 0 bytes)
23×23 SArray{Tuple{23,23},Float64,2,529} with indices SOneTo(23)×SOneTo(23):
0.676137 -11.5472 -4.29015 -5.12626 -6.57682 7.23703 9.99614 1.76468 1.25541 … -2.59669 -5.0571 -0.214485 -0.683722 -1.23128 1.8082 -1.67359 7.54335
-4.13968 -5.58828 3.71343 -13.4674 4.54442 -1.58558 -3.78945 -0.200756 3.10997 5.18093 -5.25095 -5.6478 3.95834 -6.12274 -0.548574 -3.54803 -0.856037
-11.2188 -12.0339 -6.31261 5.65743 1.7753 -6.29977 -0.461274 10.0776 -5.18723 3.58942 -6.23296 -0.321895 -3.41774 12.1922 6.50792 -5.91061 2.43976
-0.184734 -5.50193 2.33278 -8.37733 1.12845 0.951583 -4.23231 -2.21098 -2.55599 6.56053 -4.02659 -10.9366 2.20055 -8.96011 0.196862 -4.49699 -0.793005
3.12704 2.14365 5.85391 3.2093 1.50542 -5.46851 1.81043 -3.89439 2.17621 1.07769 7.80411 13.9072 8.97308 10.7278 -7.19291 -2.52473 -6.58659
2.44971 0.73773 -1.79319 10.0336 -6.22129 0.31953 0.783538 -6.55799 1.31856 … -3.41797 -1.6813 5.77539 -3.08049 1.66235 1.25254 7.00309 2.6951
-11.154 6.43822 2.14318 5.81669 2.03786 1.15025 -2.11491 4.13416 8.00949 -5.1131 3.3923 0.778615 0.679363 1.86173 1.83339 1.46648 6.09545
5.98668 5.10285 5.45983 -1.18721 -3.70755 7.0925 5.103 -2.27621 3.84499 -0.781866 -3.23418 -2.3769 -2.37448 -3.30754 0.933589 -2.75913 0.364206
-13.4849 -2.86356 3.653 3.56056 0.322314 1.27395 1.4939 -1.7558 9.98444 3.24293 -15.4737 -9.07452 13.5957 -2.15837 3.18083 0.976493 4.45786
-1.0082 -3.38966 -4.02155 -5.5053 -5.37906 -0.376008 -0.725058 -3.56403 -1.8769 1.8387 -2.37666 -5.56774 -1.97787 0.797312 -4.13205 -0.787578 -1.83671
1.75811 8.66197 7.20415 -1.34933 -5.1398 -1.08476 -0.984257 -7.62245 6.51987 … -6.01005 10.8171 1.81885 11.1168 -3.42523 -4.10206 1.99251 -3.41132
0.112358 -6.47499 0.316 0.831317 -8.86483 4.47975 7.09558 -7.69581 5.89953 1.59308 -11.7264 -0.805044 3.32148 -1.31097 3.71042 2.3723 1.2146
-3.47975 -10.0731 2.89155 0.404549 1.46972 -4.84939 7.18921 5.47791 2.59437 -0.9747 -5.06334 3.17199 1.14994 0.0592599 10.4362 4.56883 -3.82712
-6.85618 -7.40577 -6.59401 5.38784 -5.41918 2.10527 2.45604 -0.0598362 0.916278 -0.824134 -4.72543 4.11253 -3.92668 6.14167 4.72191 0.0359086 -0.280309
1.18071 -4.55732 -4.25194 3.94757 6.57966 -3.3513 1.16814 -1.839 -7.05042 7.78813 -3.62933 -5.2294 -0.56637 -5.63994 0.243361 -0.88296 -0.852351
-5.1927 1.47331 1.0722 2.34575 -0.763959 -3.91154 2.47275 -1.25566 5.29109 … -3.73648 -3.51496 -1.11513 0.0217914 2.44424 -4.29189 1.04555 -0.91635
2.23114 -3.24984 3.38598 -6.09565 3.0708 -3.3247 0.311149 4.43763 -2.23465 0.223676 1.95088 4.8373 -3.34843 -3.04138 -1.6914 7.67875 1.10271
-5.53478 -4.13183 -3.21564 -0.276751 -0.600247 -1.34508 4.92285 -0.811928 3.37869 1.01024 -2.3416 -1.61964 2.57656 4.31843 1.71025 4.76718 -1.24788
5.00821 -11.9721 -9.81844 -0.501608 2.79921 -1.13599 7.97383 1.7627 -1.22191 -0.620024 5.66534 2.76723 6.48373 1.60909 -5.19443 -0.565687 6.81647
-3.93591 5.5759 2.85072 0.615663 2.40643 10.4825 4.17097 -0.922532 11.3389 -2.8813 -3.35798 2.70551 1.72967 3.04221 9.47624 -2.58001 10.8072
-4.28853 2.70421 -3.46872 -3.06659 6.95763 -6.55236 0.0519248 6.77244 -1.61432 … -5.24371 10.8945 15.6312 5.20904 11.029 1.48784 1.90943 -4.33497
0.603864 -8.96422 8.12744 -2.28824 -7.73699 1.27218 7.00939 -2.23641 7.92207 -2.33262 -2.84921 -0.131382 2.8406 -4.33235 -3.18134 0.814096 5.58209
-0.471804 -0.644547 -0.0565805 -5.42696 1.46038 -9.56554 -5.56412 -7.40805 -4.75961 6.8637 1.74527 -7.12773 6.29815 0.736255 -15.2387 -2.97371 -10.2457
julia> @btime mul!($pC, $pA, $pB); pC
273.590 ns (0 allocations: 0 bytes)
23×23 MutableFixedSizePaddedArray{Tuple{23,23},Float64,2,23,529}:
0.676137 -11.5472 -4.29015 -5.12626 -6.57682 7.23703 9.99614 1.76468 1.25541 … -2.59669 -5.0571 -0.214485 -0.683722 -1.23128 1.8082 -1.67359 7.54335
-4.13968 -5.58828 3.71343 -13.4674 4.54442 -1.58558 -3.78945 -0.200756 3.10997 5.18093 -5.25095 -5.6478 3.95834 -6.12274 -0.548574 -3.54803 -0.856037
-11.2188 -12.0339 -6.31261 5.65743 1.7753 -6.29977 -0.461274 10.0776 -5.18723 3.58942 -6.23296 -0.321895 -3.41774 12.1922 6.50792 -5.91061 2.43976
-0.184734 -5.50193 2.33278 -8.37733 1.12845 0.951583 -4.23231 -2.21098 -2.55599 6.56053 -4.02659 -10.9366 2.20055 -8.96011 0.196862 -4.49699 -0.793005
3.12704 2.14365 5.85391 3.2093 1.50542 -5.46851 1.81043 -3.89439 2.17621 1.07769 7.80411 13.9072 8.97308 10.7278 -7.19291 -2.52473 -6.58659
2.44971 0.73773 -1.79319 10.0336 -6.22129 0.31953 0.783538 -6.55799 1.31856 … -3.41797 -1.6813 5.77539 -3.08049 1.66235 1.25254 7.00309 2.6951
-11.154 6.43822 2.14318 5.81669 2.03786 1.15025 -2.11491 4.13416 8.00949 -5.1131 3.3923 0.778615 0.679363 1.86173 1.83339 1.46648 6.09545
5.98668 5.10285 5.45983 -1.18721 -3.70755 7.0925 5.103 -2.27621 3.84499 -0.781866 -3.23418 -2.3769 -2.37448 -3.30754 0.933589 -2.75913 0.364206
-13.4849 -2.86356 3.653 3.56056 0.322314 1.27395 1.4939 -1.7558 9.98444 3.24293 -15.4737 -9.07452 13.5957 -2.15837 3.18083 0.976493 4.45786
-1.0082 -3.38966 -4.02155 -5.5053 -5.37906 -0.376008 -0.725058 -3.56403 -1.8769 1.8387 -2.37666 -5.56774 -1.97787 0.797312 -4.13205 -0.787578 -1.83671
1.75811 8.66197 7.20415 -1.34933 -5.1398 -1.08476 -0.984257 -7.62245 6.51987 … -6.01005 10.8171 1.81885 11.1168 -3.42523 -4.10206 1.99251 -3.41132
0.112358 -6.47499 0.316 0.831317 -8.86483 4.47975 7.09558 -7.69581 5.89953 1.59308 -11.7264 -0.805044 3.32148 -1.31097 3.71042 2.3723 1.2146
-3.47975 -10.0731 2.89155 0.404549 1.46972 -4.84939 7.18921 5.47791 2.59437 -0.9747 -5.06334 3.17199 1.14994 0.0592599 10.4362 4.56883 -3.82712
-6.85618 -7.40577 -6.59401 5.38784 -5.41918 2.10527 2.45604 -0.0598362 0.916278 -0.824134 -4.72543 4.11253 -3.92668 6.14167 4.72191 0.0359086 -0.280309
1.18071 -4.55732 -4.25194 3.94757 6.57966 -3.3513 1.16814 -1.839 -7.05042 7.78813 -3.62933 -5.2294 -0.56637 -5.63994 0.243361 -0.88296 -0.852351
-5.1927 1.47331 1.0722 2.34575 -0.763959 -3.91154 2.47275 -1.25566 5.29109 … -3.73648 -3.51496 -1.11513 0.0217914 2.44424 -4.29189 1.04555 -0.91635
2.23114 -3.24984 3.38598 -6.09565 3.0708 -3.3247 0.311149 4.43763 -2.23465 0.223676 1.95088 4.8373 -3.34843 -3.04138 -1.6914 7.67875 1.10271
-5.53478 -4.13183 -3.21564 -0.276751 -0.600247 -1.34508 4.92285 -0.811928 3.37869 1.01024 -2.3416 -1.61964 2.57656 4.31843 1.71025 4.76718 -1.24788
5.00821 -11.9721 -9.81844 -0.501608 2.79921 -1.13599 7.97383 1.7627 -1.22191 -0.620024 5.66534 2.76723 6.48373 1.60909 -5.19443 -0.565687 6.81647
-3.93591 5.5759 2.85072 0.615663 2.40643 10.4825 4.17097 -0.922532 11.3389 -2.8813 -3.35798 2.70551 1.72967 3.04221 9.47624 -2.58001 10.8072
-4.28853 2.70421 -3.46872 -3.06659 6.95763 -6.55236 0.0519248 6.77244 -1.61432 … -5.24371 10.8945 15.6312 5.20904 11.029 1.48784 1.90943 -4.33497
0.603864 -8.96422 8.12744 -2.28824 -7.73699 1.27218 7.00939 -2.23641 7.92207 -2.33262 -2.84921 -0.131382 2.8406 -4.33235 -3.18134 0.814096 5.58209
-0.471804 -0.644547 -0.0565805 -5.42696 1.46038 -9.56554 -5.56412 -7.40805 -4.75961 6.8637 1.74527 -7.12773 6.29815 0.736255 -15.2387 -2.97371 -10.2457
and not just matrix multiplication: julia> mC = MMatrix{23,23,Float64}(undef);
julia> @benchmark @. $mC = exp($sA) + $sB
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 2.600 μs (0.00% GC)
median time: 2.605 μs (0.00% GC)
mean time: 2.611 μs (0.00% GC)
maximum time: 4.084 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 9
julia> @benchmark @. $pC = exp($pA) + $pB
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 517.563 ns (0.00% GC)
median time: 531.505 ns (0.00% GC)
mean time: 536.259 ns (0.00% GC)
maximum time: 2.023 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 192 Even without padding,
One of ProbabilityModel's performance optimizations is to (pre)allocate a chunk of memory on initialization, and then pass a pointer to this memory to supported functions. Those functions can use the pointer for working memory and/or for allocating the objects they return. In the latter case, they'd increment the pointer. This currently relies on the If the pointer isn't incremented, that memory gets to pull double-duty, potentially letting the cache be more "hot" than it would be if you defined some structure that holds every individual array/preallocated structure (a named tuple for type stability?). I can't argue with flexibility. But if you'd like to implement an optimizing front end, some integration between libraries like this helps IMO. As for other tricks: the two biggest are that it performs source to source autodiff (avoiding the overhead many AD libraries seem to apply), and defining lots of high level adjoints. I see Turing already has a stdlib/distributions.jl, so I think that'd be a natural extension. Or do you already have something like it somewhere else?
Sure, I'm interested, and same for working with cscherrer (who I can't seem to tag, I guess because he hasn't participated in this repository). But if we can agree on some strategy, not duplicating our efforts will help us all move forward. How easily do you think Turing could be adapted to take advantage of the performance tricks I outlined here? |
We are still working on exact benchmarks but the final benchmarks will be different after TuringLang/Turing.jl#826 and TuringLang/Turing.jl#793 are merged. Hopefully, these will be ready by JuliaCon. But I can guarantee you we are way more than 10x faster than before!
This is a very pleasant surprise to me. I would definitely love to explore the use of other array types in Turing to optimize access pattern. See this issue for more details https://github.com/TuringLang/Turing.jl/issues/825 and this post https://discourse.julialang.org/t/mcmc-landscape/25654/21 for my view of the main components of Turing. The book-keeping struct or what we call
We are still not there yet in Turing. Aggressive pre-allocations haven't been done yet, mostly because it wasn't the biggest perf improvement we could work on, and because Turing has quite a few moving parts, e.g. AdvancedHMC. I think this also requires a bottom-up approach where models defined by users should make use of pre-allocations and immutables where possible, sampling algorithms should enable full pre-allocation, then the
Yes, we implement adjoints for various distributions mostly as workarounds for C libraries called by Distributions.jl. But more aggressive adjoints for high level functions are also possible to include if it helps performance. We currently support 2 AD backends: ForwardDiff and Tracker. @willtebbutt also recently used Zygote to define adjoints for
So there are 2 sides to the performance story in Turing:
Perhaps we can have a meeting with the other Turing members and Chad to discuss future plans. CC: @yebai, @xukai92, @willtebbutt, @trappmartin, @cpfiffer. Referring to https://discourse.julialang.org/t/mcmc-landscape/25654/21, it seems to me that you are experimenting with a different book-keeping strategy for models which is definitely something you can directly implement in Turing or port it over after it proves successful. An ideal Turing for me wrt to usability and performance:
So the goal is to enable the following 3 types of optimizations and not get in the way:
I hope this clarifies things from the Turing side. |
CC: @cscherrer |
I really like the idea of PPL interested people joining force. This way everyone can more easily focus on the parts one is interested in. I think it would be very beneficial if we could work together and have an initial Skype conference. |
As a side note. We are extremely fortunate to have Mohamed in the team as most of the people are not necessarily people that are very good in writing efficient Julia code but are more familiar with Bayesian inference and advanced topics of Bayesian statistics. Thus, getting more people familiar with efficient implementations in Julia being involved with Turing would definitely be a great thing. |
@trappmartin thanks for your kind words :) |
That is great to hear.
I've done a whole lot of performance work with PaddedMatrices.jl. julia> using PaddedMatrices, StaticArrays, BenchmarkTools
julia> pa = @btime @Mutable randn(32,32);
1.588 μs (1 allocation: 8.06 KiB)
julia> ma = @btime @MMatrix randn(32,32);
4.729 μs (1 allocation: 8.06 KiB) They are both mutable structs wrapping an NTuple. They are fully compatible in terms of memory layout, so we can even just julia> pa.data = ma.data;
julia> [pa[i,j] == ma[i,j] for i in 1:32, j in 1:32 ] # broadcasting isn't reliable at the moment...
32×32 Array{Bool,2}:
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 yet, massive difference in performance even for such a simple function: julia> function my_sum(x)
s = zero(eltype(x))
@inbounds @fastmath for i in 1:length(x)
s += x[i]
end
s
end
my_sum (generic function with 1 method)
julia> @btime my_sum($ma)
876.833 ns (0 allocations: 0 bytes)
21.33859997410549
julia> @btime my_sum($pa)
31.660 ns (0 allocations: 0 bytes)
21.33859997410542 As implied by the times, we have very different assembly ( julia> @code_native my_sum(ma)
.text
; ┌ @ REPL[31]:2 within `my_sum'
vxorpd %xmm0, %xmm0, %xmm0
xorl %eax, %eax
nopw %cs:(%rax,%rax)
; │ @ REPL[31]:4 within `my_sum'
; │┌ @ fastmath.jl:161 within `add_fast'
L16:
vaddsd (%rdi,%rax,8), %xmm0, %xmm0
; │└
; │┌ @ range.jl:595 within `iterate'
; ││┌ @ promotion.jl:403 within `=='
addq $1, %rax
cmpq $1024, %rax # imm = 0x400
; │└└
jne L16
; │ @ REPL[31]:6 within `my_sum'
retq
nopw %cs:(%rax,%rax)
nopl (%rax)
; └
julia> @code_native my_sum(pa)
.text
; ┌ @ REPL[31]:2 within `my_sum'
vxorpd %xmm0, %xmm0, %xmm0
xorl %eax, %eax
vxorpd %xmm1, %xmm1, %xmm1
vxorpd %xmm2, %xmm2, %xmm2
vxorpd %xmm3, %xmm3, %xmm3
nopw %cs:(%rax,%rax)
nopl (%rax)
; │ @ REPL[31]:4 within `my_sum'
; │┌ @ fastmath.jl:161 within `add_fast'
L32:
vaddpd (%rdi,%rax,8), %zmm0, %zmm0
vaddpd 64(%rdi,%rax,8), %zmm1, %zmm1
vaddpd 128(%rdi,%rax,8), %zmm2, %zmm2
vaddpd 192(%rdi,%rax,8), %zmm3, %zmm3
addq $32, %rax
cmpq $1024, %rax # imm = 0x400
jne L32
vaddpd %zmm0, %zmm1, %zmm0
vaddpd %zmm0, %zmm2, %zmm0
vaddpd %zmm0, %zmm3, %zmm0
vextractf64x4 $1, %zmm0, %ymm1
vaddpd %zmm1, %zmm0, %zmm0
vextractf128 $1, %ymm0, %xmm1
vaddpd %zmm1, %zmm0, %zmm0
vpermilpd $1, %xmm0, %xmm1 # xmm1 = xmm0[1,0]
vaddpd %xmm1, %xmm0, %xmm0
; │└
; │ @ REPL[31]:6 within `my_sum'
vzeroupper
retq
nopw %cs:(%rax,%rax)
nop
; └ Summing the
) julia> aa = Array(pa);
julia> @btime my_sum1024($aa) # replaced `for i in 1:length(x)` with `for i in 1:1024`
31.105 ns (0 allocations: 0 bytes)
21.33859997410542 julia> @code_native my_sum1024(aa)
.text
; ┌ @ REPL[37]:2 within `my_sum1024'
movq (%rdi), %rax
vxorpd %xmm0, %xmm0, %xmm0
xorl %ecx, %ecx
vxorpd %xmm1, %xmm1, %xmm1
vxorpd %xmm2, %xmm2, %xmm2
vxorpd %xmm3, %xmm3, %xmm3
nopw %cs:(%rax,%rax)
nop
; │ @ REPL[37]:4 within `my_sum1024'
; │┌ @ fastmath.jl:161 within `add_fast'
L32:
vaddpd (%rax,%rcx,8), %zmm0, %zmm0
vaddpd 64(%rax,%rcx,8), %zmm1, %zmm1
vaddpd 128(%rax,%rcx,8), %zmm2, %zmm2
vaddpd 192(%rax,%rcx,8), %zmm3, %zmm3
addq $32, %rcx
cmpq $1024, %rcx # imm = 0x400
jne L32
vaddpd %zmm0, %zmm1, %zmm0
vaddpd %zmm0, %zmm2, %zmm0
vaddpd %zmm0, %zmm3, %zmm0
vextractf64x4 $1, %zmm0, %ymm1
vaddpd %zmm1, %zmm0, %zmm0
vextractf128 $1, %ymm0, %xmm1
vaddpd %zmm1, %zmm0, %zmm0
vpermilpd $1, %xmm0, %xmm1 # xmm1 = xmm0[1,0]
vaddpd %xmm1, %xmm0, %xmm0
; │└
; │ @ REPL[37]:6 within `my_sum1024'
vzeroupper
retq
nopw %cs:(%rax,%rax)
nop
; └ So this at least is nothing special about
I've heard that a bunch of inlining isn't necessarily optimal for performance, and I intend to move away from that route once this issue is resolved. Regarding I think my equivalent is that defining a model named One can then create an instance of the struct, using key word arguments to define each of the unknowns as either particular objects (eg, Given a vector with length equal to the total length of all parameters, the generated logdensity/logdensity + gradient functions simply load all the parameters from the vector (adding required Jacobians and their gradients), assigning them to variables with the parameters name. mangled_name = @inbounds mangled_param_vector_name[mangled_some_index]
σ = Base.exp(mangled_name)
target += mangled_name # log jacobian (mangling is just referring to ensuring hygiene; target is the same as in Stan) Julia's compiler obviously has no problem figuring out types from code like this, or optimizing it. That the mental model of what is going on in the generated functions is so familiar -- ie, exactly what goes on in code we write directly -- is why I keep gravitating to them. It would be an overstatement to say ProbabilityModels only supports static models in the moment, in that saying this would imply it works reliably enough to say it supports anything. julia> Meta.@lower for i ∈ 1:5
target += p[i]
end
:($(Expr(:thunk, CodeInfo(
@ none within `top-level scope`
1 ─ %1 = 1:5
│ #s9 = Base.iterate(%1)
│ %3 = #s9 === nothing
│ %4 = Base.not_int(%3)
└── goto #4 if not %4
2 ┄ %6 = #s9
│ i = Core.getfield(%6, 1)
│ %8 = Core.getfield(%6, 2)
│ └
│ @ REPL[20]:2 within `top-level scope`
│ %9 = target
│ %10 = Base.getindex(p, i)
│ target = %9 + %10
│ #s9 = Base.iterate(%1, %8)
│ %13 = #s9 === nothing
│ %14 = Base.not_int(%13)
└── goto #4 if not %14
3 ─ goto #2
4 ┄ return
)))) But I plan on letting someone hide dynamic pieces in another function, and just |
Regarding PPL people joining forces, how about also looping in the Gen.jl team? @marcoct They are doing some cool stuff, but taking a less conventional approach so there may or may not be enough intersection here, but if there is, it would be very powerful (and simplifying to end users). |
Who in the Julia∩PPL community will be at JuliaCon? I'll be there for the week, should be a great time to sync |
I think @cpfiffer and @willtebbutt are attending. Maybe @xukai92 too. |
There's also going to be a Gen.jl talk |
Could make sense to move this to Discourse? |
@chriselrod thanks for the elaborate response. Sorry it took me a while to get back to you. I am officially on a holiday! I am definitely sold on PaddedMatrices.jl. I will try to incorporate it in Turing's
The main complexity of We also store the distribution of each random variable to enable the transformation of parameters from the constrained parameter space to the Euclidean space and back for Hamiltonian sampling. The distribution also helps us reconstruct the parameter value in its true form from the linearized values since we store the values in a single vector per symbol. Because of its breadth, Your pre-allocation approach is interesting and somewhat non-traditional as far as Julia packages are concerned IIUC. Ideally, I would like to have some form of pre-allocation but without having to manage the pointers explicitly as that may increase the complexity of the code. I would love to read ProbabilityModels.jl to understand your approach better. It may not be that hard after all! Thanks for your great package once again, and good luck! |
Hi Chris!
Thanks for this cool package. I wonder what the intended scope of this package is, and how it compares to something like Turing.jl. Also if you have some performance improvement ideas for Turing.jl, I would be interested to hear them. Note that there is a lot of interest now in the Turing team to improve its performance beyond simple type stability and Julia gotchas and your work here seems quite relevant. Turing is now very close to being fully type stable and Chris Rackauckas already reports a 15x speedup in his DiffEqBayes which uses Turing.
Eliminating all unnecessary allocations and enabling the full use of static arrays is also something on my radar, but I am curious if there are other tricks that you suggest. I see you rely a lot on PaddedMatrices.jl but how much of this do you think should be on the Bayesian inference side, versus the user-side by their choice of input array types that dispatch to the appropriate methods? Finally, would you be interested in joining forces?
The text was updated successfully, but these errors were encountered: