-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up group_by_color
#116
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #116 +/- ##
=========================================
Coverage 100.00% 100.00%
=========================================
Files 12 12
Lines 878 884 +6
=========================================
+ Hits 878 884 +6 ☔ View full report in Codecov by Sentry. |
Benchmark Results
|
On second thought it is a kind of bucket sort, maybe we could just use |
An example of the benefits, for the following code: using SparseMatrixColorings
using SparseArrays
problem = ColoringProblem(; structure=:nonsymmetric, partition=:column)
algo = GreedyColoringAlgorithm(; decompression=:direct)
A = sprand(Bool, 1000, 1000, 0.02)
coloring(A, problem, algo)
@profview_allocs for _ in 1:10000; coloring(A, problem, algo); end Before Allocation count Allocation size After Allocation count Allocation size |
@gdalle Your argument is that we have a better frame graph? Sincerely, I don't think we should do this. I think we should focus on more important issues for now and potentially revisit this PR later if you really want to keep it. |
No, it's a combination of things (profiling was just one of those):
First of all, let's remain civil please.
Benchmarks are noisy and they don't tell the whole story, otherwise your manual transposition would have been clearly superior in #107. On certain use cases this
I haven't seen you make a strong case for this either: what are your arguments?
Here's the thing though: the right way to decide about important issues on performance is to profile the code to find the bottlenecks. If the profile is shitty to read because there's one source of allocation taking up all the space, this makes our life harder for no reason. |
TLDR: my approach is slightly worse when there are very very few colors (3), and faster otherwise (>=10). Here's a benchmark taking only the grouping into account: using BenchmarkTools
function compute_group_sizes(colors::Vector{Int})
cmax = maximum(colors)
group_sizes = zeros(Int, cmax)
for c in colors
group_sizes[c] += 1
end
return group_sizes
end
function split_vecvec(colors::Vector{Int})
group_sizes = compute_group_sizes(colors)
groups = [Vector{Int}(undef, group_sizes[c]) for c in eachindex(group_sizes)]
fill!(group_sizes, 0)
for (k, c) in enumerate(colors)
group_sizes[c] += 1
pos = group_sizes[c]
groups[c][pos] = k
end
return groups
end
function split_vecview(colors::Vector{Int})
group_sizes = compute_group_sizes(colors)
group_offsets = cumsum(group_sizes)
groups_flat = similar(colors)
for (k, c) in enumerate(colors)
i = group_offsets[c] - group_sizes[c] + 1
groups_flat[i] = k
group_sizes[c] -= 1
end
TV = typeof(view(groups_flat, 1:1))
groups = Vector{TV}(undef, length(group_sizes)) # allocation 4, size cmax
for c in eachindex(group_sizes)
i = 1 + (c == 1 ? 0 : group_offsets[c - 1])
j = group_offsets[c]
groups[c] = view(groups_flat, i:j)
end
return groups
end And the benchmarking results (> 1 means the approach with views is better): julia> for n in 10 .^ (2, 3, 4, 5), cmax in (3, 10, 30, 100)
yield()
bench_vecvec = @benchmark split_vecvec(_colors) setup = (_colors = rand(1:($cmax), $n))
bench_vecview = @benchmark split_vecview(_colors) setup = (
_colors = rand(1:($cmax), $n)
)
ratios = (
time=minimum(bench_vecvec).time / minimum(bench_vecview).time,
memory=minimum(bench_vecvec).memory / minimum(bench_vecview).memory,
allocs=minimum(bench_vecvec).allocs / minimum(bench_vecview).allocs,
)
@info "Vecvec / vecview ratios - n=$n, cmax=$cmax" ratios.time ratios.memory ratios.allocs
end
┌ Info: Vecvec / vecview ratios - n=100, cmax=3
│ ratios.time = 0.9921126179496853
│ ratios.memory = 0.922077922077922
└ ratios.allocs = 1.25
┌ Info: Vecvec / vecview ratios - n=100, cmax=10
│ ratios.time = 1.4329145256099618
│ ratios.memory = 0.9714285714285714
└ ratios.allocs = 3.0
┌ Info: Vecvec / vecview ratios - n=100, cmax=30
│ ratios.time = 1.9765313592357387
│ ratios.memory = 1.1
└ ratios.allocs = 7.5
┌ Info: Vecvec / vecview ratios - n=100, cmax=100
│ ratios.time = 2.646371976647206
│ ratios.memory = 1.243718592964824
└ ratios.allocs = 23.0
┌ Info: Vecvec / vecview ratios - n=1000, cmax=3
│ ratios.time = 1.1245857661353953
│ ratios.memory = 1.00945179584121
└ ratios.allocs = 1.25
┌ Info: Vecvec / vecview ratios - n=1000, cmax=10
│ ratios.time = 1.099305752912996
│ ratios.memory = 1.007181328545781
└ ratios.allocs = 3.0
┌ Info: Vecvec / vecview ratios - n=1000, cmax=30
│ ratios.time = 1.1895478015959944
│ ratios.memory = 1.0316957210776545
└ ratios.allocs = 8.0
┌ Info: Vecvec / vecview ratios - n=1000, cmax=100
│ ratios.time = 1.5270439790191834
│ ratios.memory = 1.1164383561643836
└ ratios.allocs = 25.5
┌ Info: Vecvec / vecview ratios - n=10000, cmax=3
│ ratios.time = 0.9527100549951475
│ ratios.memory = 0.9990047770700637
└ ratios.allocs = 1.6
┌ Info: Vecvec / vecview ratios - n=10000, cmax=10
│ ratios.time = 1.050451626197754
│ ratios.memory = 1.0096991290577988
└ ratios.allocs = 2.4
┌ Info: Vecvec / vecview ratios - n=10000, cmax=30
│ ratios.time = 1.2483034681177096
│ ratios.memory = 1.0349200156067109
└ ratios.allocs = 6.4
┌ Info: Vecvec / vecview ratios - n=10000, cmax=100
│ ratios.time = 1.2643422354104847
│ ratios.memory = 1.0528372093023255
└ ratios.allocs = 20.4
┌ Info: Vecvec / vecview ratios - n=100000, cmax=3
│ ratios.time = 0.9579135843188993
│ ratios.memory = 0.9999000479769711
└ ratios.allocs = 1.6
┌ Info: Vecvec / vecview ratios - n=100000, cmax=10
│ ratios.time = 1.0529008105578626
│ ratios.memory = 0.9999800207783904
└ ratios.allocs = 4.4
┌ Info: Vecvec / vecview ratios - n=100000, cmax=30
│ ratios.time = 1.0253449503339322
│ ratios.memory = 1.0005785420739737
└ ratios.allocs = 12.4
┌ Info: Vecvec / vecview ratios - n=100000, cmax=100
│ ratios.time = 1.0237008826453011
│ ratios.memory = 1.0133598014888336
└ ratios.allocs = 20.4 |
@amontoison thoughts on this one? When we benchmark the grouping function on its own, as you see above, the benefits are clear as soon as we go beyond |
Closing temporarily because bicoloring will require rethinking this grouping function. We can reoptimize it afterwards |
Actually figured out a way to keep the same grouping behavior in the bicoloring branch, so we can merge this one |
group_by_color
group_by_color
@amontoison this is a very quick review and while the global benchmarks don't show much of a difference, the specific benchmarks in this comment strongly support this change. What do you think? |
Merged 😉 |
group_by_color
, instead of allocating one vector per color group, allocate a single common vector and return one view for each group. That way we only need