Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up group_by_color #116

Merged
merged 7 commits into from
Oct 7, 2024
Merged

Speed up group_by_color #116

merged 7 commits into from
Oct 7, 2024

Conversation

gdalle
Copy link
Owner

@gdalle gdalle commented Sep 26, 2024

  • In group_by_color, instead of allocating one vector per color group, allocate a single common vector and return one view for each group. That way we only need $4$ allocations instead of $O(c_{\max})$.
  • Adjust docstrings because the type of the returned groups has changed from vector to view (but that does not break any public API).
  • Add tests for grouping.
  • Activate memory benchmarking in addition to time.

@gdalle gdalle added the benchmark Run benchmarks on PR label Sep 26, 2024
@gdalle gdalle requested a review from amontoison September 26, 2024 22:15
Copy link

codecov bot commented Sep 26, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 100.00%. Comparing base (c28490d) to head (0832e31).
Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff            @@
##              main      #116   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           12        12           
  Lines          878       884    +6     
=========================================
+ Hits           878       884    +6     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

github-actions bot commented Sep 26, 2024

Benchmark Results

main dc514b9... main/dc514b923d33a4...
coloring/nonsymmetric/column/direct/n=1000/p=0.002 0.0861 ± 0.0041 ms 0.0858 ± 0.0039 ms 1
coloring/nonsymmetric/column/direct/n=1000/p=0.005 0.188 ± 0.0076 ms 0.187 ± 0.0077 ms 1
coloring/nonsymmetric/column/direct/n=1000/p=0.01 0.396 ± 0.013 ms 0.39 ± 0.014 ms 1.01
coloring/nonsymmetric/column/direct/n=100000/p=0.0001 0.0736 ± 0.002 s 0.0736 ± 0.0044 s 1
coloring/nonsymmetric/column/direct/n=100000/p=2.0e-5 13.5 ± 0.17 ms 13.5 ± 0.19 ms 1.01
coloring/nonsymmetric/column/direct/n=100000/p=5.0e-5 0.0324 ± 0.00061 s 0.0325 ± 0.0005 s 0.997
coloring/nonsymmetric/row/direct/n=1000/p=0.002 0.103 ± 0.0045 ms 0.103 ± 0.0047 ms 0.993
coloring/nonsymmetric/row/direct/n=1000/p=0.005 0.213 ± 0.0084 ms 0.214 ± 0.0089 ms 0.996
coloring/nonsymmetric/row/direct/n=1000/p=0.01 0.439 ± 0.015 ms 0.441 ± 0.016 ms 0.995
coloring/nonsymmetric/row/direct/n=100000/p=0.0001 0.0827 ± 0.0022 s 0.0817 ± 0.0017 s 1.01
coloring/nonsymmetric/row/direct/n=100000/p=2.0e-5 16 ± 0.22 ms 16 ± 0.23 ms 0.998
coloring/nonsymmetric/row/direct/n=100000/p=5.0e-5 0.0371 ± 0.00091 s 0.0372 ± 0.00098 s 0.999
coloring/symmetric/column/direct/n=1000/p=0.002 0.271 ± 0.014 ms 0.273 ± 0.014 ms 0.995
coloring/symmetric/column/direct/n=1000/p=0.005 0.74 ± 0.029 ms 0.739 ± 0.028 ms 1
coloring/symmetric/column/direct/n=1000/p=0.01 1.81 ± 0.06 ms 1.81 ± 0.055 ms 1
coloring/symmetric/column/direct/n=100000/p=0.0001 0.381 ± 0.031 s 0.448 ± 0.029 s 0.852
coloring/symmetric/column/direct/n=100000/p=2.0e-5 0.0372 ± 0.0011 s 0.0387 ± 0.0022 s 0.959
coloring/symmetric/column/direct/n=100000/p=5.0e-5 0.126 ± 0.0046 s 0.131 ± 0.0093 s 0.962
coloring/symmetric/column/substitution/n=1000/p=0.002 0.614 ± 0.031 ms 0.624 ± 0.032 ms 0.983
coloring/symmetric/column/substitution/n=1000/p=0.005 1.62 ± 0.062 ms 1.62 ± 0.059 ms 0.999
coloring/symmetric/column/substitution/n=1000/p=0.01 3.5 ± 0.1 ms 3.51 ± 0.098 ms 0.998
coloring/symmetric/column/substitution/n=100000/p=0.0001 0.775 ± 0.023 s 0.882 ± 0.017 s 0.879
coloring/symmetric/column/substitution/n=100000/p=2.0e-5 0.0968 ± 0.01 s 0.0979 ± 0.0077 s 0.989
coloring/symmetric/column/substitution/n=100000/p=5.0e-5 0.311 ± 0.021 s 0.313 ± 0.022 s 0.996
decompress/nonsymmetric/column/direct/n=1000/p=0.002 4.48 ± 0.36 μs 4.24 ± 0.32 μs 1.06
decompress/nonsymmetric/column/direct/n=1000/p=0.005 9.21 ± 0.63 μs 9.21 ± 0.78 μs 1
decompress/nonsymmetric/column/direct/n=1000/p=0.01 19.5 ± 1.1 μs 27.4 ± 7.2 μs 0.714
decompress/nonsymmetric/column/direct/n=100000/p=0.0001 5.16 ± 0.43 ms 5.09 ± 0.4 ms 1.01
decompress/nonsymmetric/column/direct/n=100000/p=2.0e-5 1.03 ± 0.14 ms 1.02 ± 0.11 ms 1.01
decompress/nonsymmetric/column/direct/n=100000/p=5.0e-5 2.51 ± 0.24 ms 2.51 ± 0.23 ms 0.998
decompress/nonsymmetric/row/direct/n=1000/p=0.002 4.32 ± 0.3 μs 4.2 ± 0.33 μs 1.03
decompress/nonsymmetric/row/direct/n=1000/p=0.005 8.57 ± 0.53 μs 7.92 ± 0.59 μs 1.08
decompress/nonsymmetric/row/direct/n=1000/p=0.01 16.6 ± 0.93 μs 17.7 ± 1.7 μs 0.934
decompress/nonsymmetric/row/direct/n=100000/p=0.0001 2.06 ± 0.39 ms 2.04 ± 0.16 ms 1.01
decompress/nonsymmetric/row/direct/n=100000/p=2.0e-5 0.401 ± 0.05 ms 0.39 ± 0.057 ms 1.03
decompress/nonsymmetric/row/direct/n=100000/p=5.0e-5 0.94 ± 0.11 ms 0.934 ± 0.17 ms 1.01
decompress/symmetric/column/direct/n=1000/p=0.002 4.17 ± 0.31 μs 4.1 ± 0.3 μs 1.01
decompress/symmetric/column/direct/n=1000/p=0.005 8.32 ± 0.46 μs 8.07 ± 0.52 μs 1.03
decompress/symmetric/column/direct/n=1000/p=0.01 17.2 ± 0.98 μs 16.8 ± 1.1 μs 1.02
decompress/symmetric/column/direct/n=100000/p=0.0001 4.26 ± 0.19 ms 5.42 ± 2.4 ms 0.786
decompress/symmetric/column/direct/n=100000/p=2.0e-5 0.827 ± 0.1 ms 0.825 ± 0.11 ms 1
decompress/symmetric/column/direct/n=100000/p=5.0e-5 2.17 ± 0.64 ms 2.09 ± 0.32 ms 1.04
decompress/symmetric/column/substitution/n=1000/p=0.002 0.0638 ± 0.0038 ms 0.0649 ± 0.0041 ms 0.983
decompress/symmetric/column/substitution/n=1000/p=0.005 0.158 ± 0.0068 ms 0.16 ± 0.0075 ms 0.991
decompress/symmetric/column/substitution/n=1000/p=0.01 0.34 ± 0.014 ms 0.343 ± 0.014 ms 0.993
decompress/symmetric/column/substitution/n=100000/p=0.0001 0.0637 ± 0.002 s 0.0657 ± 0.00054 s 0.969
decompress/symmetric/column/substitution/n=100000/p=2.0e-5 12.6 ± 0.3 ms 12.8 ± 0.49 ms 0.986
decompress/symmetric/column/substitution/n=100000/p=5.0e-5 29.7 ± 0.49 ms 31.1 ± 0.93 ms 0.953
time_to_load 0.216 ± 0.0013 s 0.216 ± 0.002 s 1
main dc514b9... main/dc514b923d33a4...
coloring/nonsymmetric/column/direct/n=1000/p=0.002 13 allocs: 0.0593 MB 9 allocs: 0.0585 MB 1.02
coloring/nonsymmetric/column/direct/n=1000/p=0.005 23 allocs: 0.103 MB 11 allocs: 0.103 MB 1
coloring/nonsymmetric/column/direct/n=1000/p=0.01 0.042 k allocs: 0.178 MB 11 allocs: 0.178 MB 0.996
coloring/nonsymmetric/column/direct/n=100000/p=0.0001 0.074 k allocs: 18.2 MB 15 allocs: 18.3 MB 0.998
coloring/nonsymmetric/column/direct/n=100000/p=2.0e-5 27 allocs: 6.08 MB 15 allocs: 6.08 MB 1
coloring/nonsymmetric/column/direct/n=100000/p=5.0e-5 0.042 k allocs: 10.6 MB 15 allocs: 10.6 MB 1
coloring/nonsymmetric/row/direct/n=1000/p=0.002 21 allocs: 0.0944 MB 17 allocs: 0.0935 MB 1.01
coloring/nonsymmetric/row/direct/n=1000/p=0.005 31 allocs: 0.183 MB 19 allocs: 0.182 MB 1
coloring/nonsymmetric/row/direct/n=1000/p=0.01 0.049 k allocs: 0.332 MB 19 allocs: 0.33 MB 1
coloring/nonsymmetric/row/direct/n=100000/p=0.0001 0.083 k allocs: 0.0334 GB 24 allocs: 0.0334 GB 1
coloring/nonsymmetric/row/direct/n=100000/p=2.0e-5 0.037 k allocs: 9.86 MB 24 allocs: 9.87 MB 1
coloring/nonsymmetric/row/direct/n=100000/p=5.0e-5 0.051 k allocs: 19 MB 24 allocs: 19 MB 0.998
coloring/symmetric/column/direct/n=1000/p=0.002 1.08 k allocs: 0.275 MB 1.07 k allocs: 0.275 MB 1
coloring/symmetric/column/direct/n=1000/p=0.005 2.66 k allocs: 0.425 MB 2.65 k allocs: 0.426 MB 0.999
coloring/symmetric/column/direct/n=1000/p=0.01 5.69 k allocs: 1.1 MB 5.71 k allocs: 1.1 MB 1
coloring/symmetric/column/direct/n=100000/p=0.0001 0.603 M allocs: 0.104 GB 0.604 M allocs: 0.105 GB 0.999
coloring/symmetric/column/direct/n=100000/p=2.0e-5 0.117 M allocs: 23.7 MB 0.117 M allocs: 23.7 MB 1
coloring/symmetric/column/direct/n=100000/p=5.0e-5 0.286 M allocs: 0.0517 GB 0.287 M allocs: 0.0517 GB 0.999
coloring/symmetric/column/substitution/n=1000/p=0.002 3.46 k allocs: 0.667 MB 3.44 k allocs: 0.67 MB 0.997
coloring/symmetric/column/substitution/n=1000/p=0.005 6.59 k allocs: 1.19 MB 6.73 k allocs: 1.2 MB 0.989
coloring/symmetric/column/substitution/n=1000/p=0.01 12.6 k allocs: 2.65 MB 12.6 k allocs: 2.65 MB 0.999
coloring/symmetric/column/substitution/n=100000/p=0.0001 1.29 M allocs: 0.231 GB 1.29 M allocs: 0.231 GB 0.999
coloring/symmetric/column/substitution/n=100000/p=2.0e-5 0.39 M allocs: 0.0626 GB 0.388 M allocs: 0.0625 GB 1
coloring/symmetric/column/substitution/n=100000/p=5.0e-5 0.699 M allocs: 0.122 GB 0.702 M allocs: 0.123 GB 0.995
decompress/nonsymmetric/column/direct/n=1000/p=0.002 3 allocs: 0.0353 MB 3 allocs: 0.0354 MB 0.997
decompress/nonsymmetric/column/direct/n=1000/p=0.005 5 allocs: 0.0787 MB 5 allocs: 0.0792 MB 0.994
decompress/nonsymmetric/column/direct/n=1000/p=0.01 5 allocs: 0.153 MB 5 allocs: 0.153 MB 0.997
decompress/nonsymmetric/column/direct/n=100000/p=0.0001 6 allocs: 16 MB 6 allocs: 16 MB 1
decompress/nonsymmetric/column/direct/n=100000/p=2.0e-5 6 allocs: 3.79 MB 6 allocs: 3.79 MB 0.999
decompress/nonsymmetric/column/direct/n=100000/p=5.0e-5 6 allocs: 8.36 MB 6 allocs: 8.35 MB 1
decompress/nonsymmetric/row/direct/n=1000/p=0.002 3 allocs: 0.0349 MB 3 allocs: 0.0355 MB 0.983
decompress/nonsymmetric/row/direct/n=1000/p=0.005 5 allocs: 0.0791 MB 5 allocs: 0.0791 MB 1
decompress/nonsymmetric/row/direct/n=1000/p=0.01 5 allocs: 0.154 MB 5 allocs: 0.155 MB 0.997
decompress/nonsymmetric/row/direct/n=100000/p=0.0001 6 allocs: 16 MB 6 allocs: 16 MB 1
decompress/nonsymmetric/row/direct/n=100000/p=2.0e-5 6 allocs: 3.79 MB 6 allocs: 3.79 MB 1
decompress/nonsymmetric/row/direct/n=100000/p=5.0e-5 6 allocs: 8.36 MB 6 allocs: 8.37 MB 0.999
decompress/symmetric/column/direct/n=1000/p=0.002 3 allocs: 0.0347 MB 3 allocs: 0.0352 MB 0.986
decompress/symmetric/column/direct/n=1000/p=0.005 5 allocs: 0.0792 MB 5 allocs: 0.078 MB 1.02
decompress/symmetric/column/direct/n=1000/p=0.01 5 allocs: 0.152 MB 5 allocs: 0.152 MB 1
decompress/symmetric/column/direct/n=100000/p=0.0001 6 allocs: 16 MB 6 allocs: 16 MB 1
decompress/symmetric/column/direct/n=100000/p=2.0e-5 6 allocs: 3.79 MB 6 allocs: 3.79 MB 0.998
decompress/symmetric/column/direct/n=100000/p=5.0e-5 6 allocs: 8.36 MB 6 allocs: 8.37 MB 0.999
decompress/symmetric/column/substitution/n=1000/p=0.002 3 allocs: 0.0348 MB 3 allocs: 0.0352 MB 0.99
decompress/symmetric/column/substitution/n=1000/p=0.005 5 allocs: 0.0789 MB 5 allocs: 0.0797 MB 0.991
decompress/symmetric/column/substitution/n=1000/p=0.01 5 allocs: 0.154 MB 5 allocs: 0.154 MB 1
decompress/symmetric/column/substitution/n=100000/p=0.0001 6 allocs: 16 MB 6 allocs: 16 MB 1
decompress/symmetric/column/substitution/n=100000/p=2.0e-5 6 allocs: 3.79 MB 6 allocs: 3.8 MB 0.999
decompress/symmetric/column/substitution/n=100000/p=5.0e-5 6 allocs: 8.37 MB 6 allocs: 8.36 MB 1
time_to_load 0.153 k allocs: 14.5 kB 0.153 k allocs: 14.5 kB 1

@gdalle
Copy link
Owner Author

gdalle commented Sep 26, 2024

On second thought it is a kind of bucket sort, maybe we could just use sort from Base

@gdalle
Copy link
Owner Author

gdalle commented Sep 26, 2024

An example of the benefits, for the following code:

using SparseMatrixColorings
using SparseArrays

problem = ColoringProblem(; structure=:nonsymmetric, partition=:column)
algo = GreedyColoringAlgorithm(; decompression=:direct)
A = sprand(Bool, 1000, 1000, 0.02)

coloring(A, problem, algo)
@profview_allocs for _ in 1:10000; coloring(A, problem, algo); end

Before

Allocation count

image

Allocation size

image

After

Allocation count

image

Allocation size

image

@gdalle gdalle removed the benchmark Run benchmarks on PR label Sep 26, 2024
@amontoison
Copy link
Collaborator

@gdalle Your argument is that we have a better frame graph?
Honestly, I've rarely seen such a weak argument—it has no real benefit for the user, and on our side, we profile routine by routine if readability really becomes an issue.

Sincerely, I don't think we should do this.
I prefer a Vector{Vector{Int}} as output, and your approach doesn't yield any memory savings.
If the number of allocations is significant, it should be reflected in the execution time.
Otherwise, it's all for nothing.
So far, we haven't seen any gains, not even with row or column coloring.

I think we should focus on more important issues for now and potentially revisit this PR later if you really want to keep it.

@gdalle
Copy link
Owner Author

gdalle commented Sep 27, 2024

Your argument is that we have a better frame graph?

No, it's a combination of things (profiling was just one of those):

  • The number of allocations is reduced, and from what I understand each allocation is costly regardless of its size (see this Discourse thread)
  • Yes, it makes profiling and benchmarking easier because we can now exactly count the number of allocations for column and row coloring. This can in turn become part of the test suite.
  • Memory locality is improved with a single flat vector compared to a vector of vector.

Honestly, I've rarely seen such a weak argument

First of all, let's remain civil please.

If the number of allocations is significant, it should be reflected in the execution time. Otherwise, it's all for nothing. So far, we haven't seen any gains, not even with row or column coloring.

Benchmarks are noisy and they don't tell the whole story, otherwise your manual transposition would have been clearly superior in #107. On certain use cases this view approach might be faster, on most cases it probably won't make much of a difference, but even then what's the harm? It's 10 LOCs to have a better quantitative understanding of how many allocations happen.

I prefer a Vector{Vector{Int}} as output

I haven't seen you make a strong case for this either: what are your arguments?

  • Performance-wise, if you are not okay with using things like view(x, i:j) where x is a plain vector, then we need to rethink the whole graph structure cause I don't see a way around it for neighbor enumeration. It's about as fast as it gets, which is why it is also used in SimpleWeightedGraphs.jl.
  • Usage-wise, you can enumerate and iterate over each group just fine, you can even modify it in-place without troubles.

I think we should focus on more important issues for now and potentially revisit this PR later if you really want to keep it.

Here's the thing though: the right way to decide about important issues on performance is to profile the code to find the bottlenecks. If the profile is shitty to read because there's one source of allocation taking up all the space, this makes our life harder for no reason.

@gdalle
Copy link
Owner Author

gdalle commented Sep 27, 2024

TLDR: my approach is slightly worse when there are very very few colors (3), and faster otherwise (>=10).

Here's a benchmark taking only the grouping into account:

using BenchmarkTools

function compute_group_sizes(colors::Vector{Int})
    cmax = maximum(colors)
    group_sizes = zeros(Int, cmax)
    for c in colors
        group_sizes[c] += 1
    end
    return group_sizes
end

function split_vecvec(colors::Vector{Int})
    group_sizes = compute_group_sizes(colors)
    groups = [Vector{Int}(undef, group_sizes[c]) for c in eachindex(group_sizes)]
    fill!(group_sizes, 0)
    for (k, c) in enumerate(colors)
        group_sizes[c] += 1
        pos = group_sizes[c]
        groups[c][pos] = k
    end
    return groups
end

function split_vecview(colors::Vector{Int})
    group_sizes = compute_group_sizes(colors)
    group_offsets = cumsum(group_sizes)
    groups_flat = similar(colors)
    for (k, c) in enumerate(colors)
        i = group_offsets[c] - group_sizes[c] + 1
        groups_flat[i] = k
        group_sizes[c] -= 1
    end
    TV = typeof(view(groups_flat, 1:1))
    groups = Vector{TV}(undef, length(group_sizes))  # allocation 4, size cmax
    for c in eachindex(group_sizes)
        i = 1 + (c == 1 ? 0 : group_offsets[c - 1])
        j = group_offsets[c]
        groups[c] = view(groups_flat, i:j)
    end
    return groups
end

And the benchmarking results (> 1 means the approach with views is better):

julia> for n in 10 .^ (2, 3, 4, 5), cmax in (3, 10, 30, 100)
           yield()
           bench_vecvec = @benchmark split_vecvec(_colors) setup = (_colors = rand(1:($cmax), $n))
           bench_vecview = @benchmark split_vecview(_colors) setup = (
               _colors = rand(1:($cmax), $n)
           )
           ratios = (
               time=minimum(bench_vecvec).time / minimum(bench_vecview).time,
               memory=minimum(bench_vecvec).memory / minimum(bench_vecview).memory,
               allocs=minimum(bench_vecvec).allocs / minimum(bench_vecview).allocs,
           )
           @info "Vecvec / vecview ratios - n=$n, cmax=$cmax" ratios.time ratios.memory ratios.allocs
       end
┌ Info: Vecvec / vecview ratios - n=100, cmax=3
│   ratios.time = 0.9921126179496853
│   ratios.memory = 0.922077922077922
└   ratios.allocs = 1.25
┌ Info: Vecvec / vecview ratios - n=100, cmax=10
│   ratios.time = 1.4329145256099618
│   ratios.memory = 0.9714285714285714
└   ratios.allocs = 3.0
┌ Info: Vecvec / vecview ratios - n=100, cmax=30
│   ratios.time = 1.9765313592357387
│   ratios.memory = 1.1
└   ratios.allocs = 7.5
┌ Info: Vecvec / vecview ratios - n=100, cmax=100
│   ratios.time = 2.646371976647206
│   ratios.memory = 1.243718592964824
└   ratios.allocs = 23.0
┌ Info: Vecvec / vecview ratios - n=1000, cmax=3
│   ratios.time = 1.1245857661353953
│   ratios.memory = 1.00945179584121
└   ratios.allocs = 1.25
┌ Info: Vecvec / vecview ratios - n=1000, cmax=10
│   ratios.time = 1.099305752912996
│   ratios.memory = 1.007181328545781
└   ratios.allocs = 3.0
┌ Info: Vecvec / vecview ratios - n=1000, cmax=30
│   ratios.time = 1.1895478015959944
│   ratios.memory = 1.0316957210776545
└   ratios.allocs = 8.0
┌ Info: Vecvec / vecview ratios - n=1000, cmax=100
│   ratios.time = 1.5270439790191834
│   ratios.memory = 1.1164383561643836
└   ratios.allocs = 25.5
┌ Info: Vecvec / vecview ratios - n=10000, cmax=3
│   ratios.time = 0.9527100549951475
│   ratios.memory = 0.9990047770700637
└   ratios.allocs = 1.6
┌ Info: Vecvec / vecview ratios - n=10000, cmax=10
│   ratios.time = 1.050451626197754
│   ratios.memory = 1.0096991290577988
└   ratios.allocs = 2.4
┌ Info: Vecvec / vecview ratios - n=10000, cmax=30
│   ratios.time = 1.2483034681177096
│   ratios.memory = 1.0349200156067109
└   ratios.allocs = 6.4
┌ Info: Vecvec / vecview ratios - n=10000, cmax=100
│   ratios.time = 1.2643422354104847
│   ratios.memory = 1.0528372093023255
└   ratios.allocs = 20.4
┌ Info: Vecvec / vecview ratios - n=100000, cmax=3
│   ratios.time = 0.9579135843188993
│   ratios.memory = 0.9999000479769711
└   ratios.allocs = 1.6
┌ Info: Vecvec / vecview ratios - n=100000, cmax=10
│   ratios.time = 1.0529008105578626
│   ratios.memory = 0.9999800207783904
└   ratios.allocs = 4.4
┌ Info: Vecvec / vecview ratios - n=100000, cmax=30
│   ratios.time = 1.0253449503339322
│   ratios.memory = 1.0005785420739737
└   ratios.allocs = 12.4
┌ Info: Vecvec / vecview ratios - n=100000, cmax=100
│   ratios.time = 1.0237008826453011
│   ratios.memory = 1.0133598014888336
└   ratios.allocs = 20.4

@gdalle
Copy link
Owner Author

gdalle commented Sep 29, 2024

@amontoison thoughts on this one? When we benchmark the grouping function on its own, as you see above, the benefits are clear as soon as we go beyond cmax=3, especially for rather small matrices.

@gdalle
Copy link
Owner Author

gdalle commented Oct 6, 2024

Closing temporarily because bicoloring will require rethinking this grouping function. We can reoptimize it afterwards

@gdalle gdalle closed this Oct 6, 2024
@gdalle gdalle reopened this Oct 7, 2024
@gdalle
Copy link
Owner Author

gdalle commented Oct 7, 2024

Actually figured out a way to keep the same grouping behavior in the bicoloring branch, so we can merge this one

@gdalle gdalle changed the title Constant number of allocations in group_by_color Speed up group_by_color Oct 7, 2024
@gdalle
Copy link
Owner Author

gdalle commented Oct 7, 2024

@amontoison this is a very quick review and while the global benchmarks don't show much of a difference, the specific benchmarks in this comment strongly support this change. What do you think?

@amontoison amontoison merged commit e349f50 into main Oct 7, 2024
7 checks passed
@amontoison amontoison deleted the gd/better_grouping branch October 7, 2024 18:57
@amontoison
Copy link
Collaborator

Merged 😉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants