Count number of connected components more efficiently than `length(connected_components(g))` #407

thchr · 2024-11-14T11:37:21Z

This adds a new function count_connected_components, which returns the same value as length(connected_components(g)) but substantially faster by avoiding unnecessary allocations. In particular, connected_components materializes component vectors that are not actually necessary for determining the number of components.
Similar reasoning also lets one optimize is_connected a bit: did that also.

While I was there, I also improved connected_components! slightly: previously, it was allocating a new queue for every new "starting vertex" in the search; but the queue is always empty when it's time to add a new vertex at that point, so there's no point in instantiating a new vector.
To enable users who might want to call connected_components! many times in a row to reduce allocations further (I am one such user), I also made it possible to pass this queue as an optimization.

Finally, connected_components! is very useful and would make sense to export; so I've done that here.

Cc @gdalle, if you have time to review.

…h(connected_components(g))`

thchr · 2024-11-14T11:43:52Z

For the doctest example of g = Graph(Edge.([1=>2, 2=>3, 3=>1, 4=>5, 5=>6, 6=>4, 7=>8])), count_connected_components is about twice as fast as length∘connected_components (179 ns vs. 290 ns). Using the buffers, it is faster still (105 ns).

codecov · 2024-11-14T12:25:03Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 97.31%. Comparing base (24539fd) to head (ead687e).

Additional details and impacted files

@@           Coverage Diff           @@
##           master     #407   +/-   ##
=======================================
  Coverage   97.30%   97.31%           
=======================================
  Files         117      117           
  Lines        6948     6963   +15     
=======================================
+ Hits         6761     6776   +15     
  Misses        187      187

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚨 Try these New Features:

Flaky Tests Detection - Detect and resolve failed and flaky tests

simonschoelly · 2024-11-20T13:00:57Z

src/connectivity.jl

@@ -1,26 +1,32 @@
 # Parts of this code were taken / derived from Graphs.jl. See LICENSE for
 # licensing details.
 """
-    connected_components!(label, g)
+    connected_components!(label, g, [search_queue])


I am all for performance improvements. But I am a bit skeptical if it is worth making the interface more complicated.

Almost all graph algorithms need some kind of of work buffer, so we could have something like in al algorithms but in the end it should be the job for Julia's allocator to verify if there is some suitable piece of memory lying around. We can help it by using sizehint! with a suitable heuristic.

I agree that this will usually not be relevant; in my case it is though, and is the main reason I made the changes. I also agree that there is a trade off between performance improvements and complications of the API. On the other hand, I think passing such work buffers as optional arguments is a good solution to such trade-offs: for most users, the complication can be safely ignored and shouldn't complicate their lives much.

As you say, there are potentially many algorithms in Graphs.jl that could take a work buffer; in light of that, maybe this could be more palatable if we settle on a unified name for these kinds of optional buffers, so that it lowers the complications by standardizing across methods.
Maybe just work_buffer (and, if there are multiple, work_buffer1, work_buffer2, etc?)

If we do this then all functions should take exactly one work_buffer (possibly a tuple) and have an appropriate function to initialize the buffer. I think it is a major change which should be discussed separately.

Work buffers for algorithms? #410

So I think if this is really important for your use case you can either

Create a version that uses a buffer in the Experimental submodule. Currently we don't guarantee semantic versioning there - this allows use to remove things in the future without breaking the API.

Or as this code is very simple you might just copy it to your own repository.

But just to clarify - your problem is not that you are building graphs by adding edges until they are connected? Because if that is the issue, there is a much better algorithm.

simonschoelly · 2024-11-20T13:06:36Z

src/connectivity.jl

+3
+```
+"""
+function count_connected_components(


I am a bit undecided if we should call this count_connected_components or num_connected_components. Currently we have both conventions, namely num_self_loops and Graphs.Experimental.count_isomorph.

Ideally we use the same word everywhere. @gdalle Do you have an opinion on that?

There's also nv(g) for the number of vertices. Maybe just nconnected_components?

If I had to pick I'd rather use count than num or n because it is a complete word

Definitely no to nconnected_components - nv and ne might be some exceptions as they are used all the time - but we might rename them one day.

I don't mind abbreviation from time to time, but lets go with count_connected_components then - after all we also have a count function in the Julia base.

simonschoelly · 2024-11-20T13:08:37Z

src/connectivity.jl

+    seen = Set{T}()
+    c = 0
+    for l in label
+        if l ∉ seen
+            push!(seen, l)
+            c += 1
+        end
+    end
+    return c


Suggested change

seen = Set{T}()

c = 0

for l in label

if l ∉ seen

push!(seen, l)

c += 1

end

end

return c

return length(Set(label))

That's less performant than the explicity looped version though:

julia> label_small = rand(1:3, 20) julia> @b count_unique($label_small) 150.851 ns (4 allocs: 320 bytes) # loop 174.412 ns (4 allocs: 464 bytes) # length(Set(label)) julia> label_big = rand(1:50, 5000) julia> @b count_unique($label_big) 23.385 μs (11 allocs: 3.312 KiB) # loop 32.719 μs (6 allocs: 72.172 KiB) # length(Set(label)) julia> label_huge = rand(1:5000, 500000) julia> @b count_unique($label_huge) 3.499 ms (25 allocs: 192.625 KiB) # loop 4.876 ms (6 allocs: 9.000 MiB, 2.51% gc time) # length(Set(label))

It's indeed not very great that the length(Set(label)) version is slower though. The reasons seems to be that Set(itr) is assuming that most elements in itr will be unique and goes ahead an sizehint!s the to-be-filled-in Set to be the full length of itr - but that seems very unlikely to ever be the case in this scenario: there will usually be far fewer connected components than vertices.

A related thing is that push!(seen, l) is somehow slower than l ∉ seen && push!(seen, l). That seems like a Base issue.

Actually, it is not really an "issue" in Base, per se: rather, it seems Set is optimized with the assumption that most things that are push!ed into it are new, unique things. But when that assumption doesn't apply, it is faster to check ∉ before trying to push!. Here, I would say it's very safe to assume that label will usually contain far fewer unique things than its length, so we might as well exploit that.

That's interesting. I did not know that. Btw. if try to be really efficient here - would using BitSet instead of Set be even more efficient?

src/connectivity.jl

simonschoelly · 2024-11-20T13:16:51Z

For the doctest example of g = Graph(Edge.([1=>2, 2=>3, 3=>1, 4=>5, 5=>6, 6=>4, 7=>8])), count_connected_components is about twice as fast as length∘connected_components (179 ns vs. 290 ns). Using the buffers, it is faster still (105 ns).

We should not do benchmarks on such small graphs unless the algorithm has a huge complexity and is slow even on very small graphs. Otherwise the benchmark is way too noisy and also does not really reflect the situations where this library is used.

thchr · 2024-11-21T08:30:00Z

We should not do benchmarks on such small graphs unless the algorithm has a huge complexity and is slow even on very small graphs. Otherwise the benchmark is way too noisy and also does not really reflect the situations where this library is used.

What are some good go-to defaults for testing? This is a thing I'm running up against frequently, I feel: I am not sure which graphs to test against, and anything beyond small toy examples are not easily accessible via convenience constructors in Graphs.

As context, in my situation the graphs are rarely larger than 50-100 vertices; my challenge is that I need to consider a huge number of permutations of such graphs, so performance in the small-graph case is relevant to me.

gdalle · 2024-11-21T11:10:25Z

What are some good go-to defaults for testing? This is a thing I'm running up against frequently, I feel: I am not sure which graphs to test against, and anything beyond small toy examples are not easily accessible via convenience constructors in Graphs.

I have opened this issue to discuss further:

Typical graphs to test on? #411

thchr added 2 commits November 14, 2024 12:32

implement count_connected_components(g) as faster version of `lengt…

59270de

…h(connected_components(g))`

export connected_components!

a8f8d19

thchr added 4 commits November 14, 2024 12:55

allow resetting label inside count_connected_components

2316b0f

fix unsaved test

05e3b7e

fix copy bug and nits

0153724

JuliaFormatter

ce66be5

lingering copy typo

f440525

simonschoelly reviewed Nov 20, 2024

View reviewed changes

src/connectivity.jl Outdated Show resolved Hide resolved

thchr added 3 commits November 21, 2024 09:30

updates from code-review

da1d31e

simplify count_unique

21b1854

improve comments for count_unique

ead687e

thchr mentioned this pull request Nov 21, 2024

Implement is_articulation to check if vertex is articulation point #387

Open

gdalle mentioned this pull request Nov 21, 2024

Work buffers for algorithms? #410

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Count number of connected components more efficiently than `length(connected_components(g))` #407

Count number of connected components more efficiently than `length(connected_components(g))` #407

thchr commented Nov 14, 2024

thchr commented Nov 14, 2024

codecov bot commented Nov 14, 2024 •

edited

Loading

simonschoelly Nov 20, 2024

thchr Nov 21, 2024

gdalle Nov 21, 2024 •

edited

Loading

gdalle Nov 21, 2024

simonschoelly Nov 24, 2024

simonschoelly Nov 20, 2024

thchr Nov 21, 2024

gdalle Nov 21, 2024

simonschoelly Nov 24, 2024

simonschoelly Nov 20, 2024

thchr Nov 21, 2024 •

edited

Loading

thchr Nov 21, 2024 •

edited

Loading

thchr Nov 21, 2024

simonschoelly Nov 24, 2024

simonschoelly commented Nov 20, 2024

thchr commented Nov 21, 2024

gdalle commented Nov 21, 2024 •

edited

Loading

Count number of connected components more efficiently than length(connected_components(g)) #407

Are you sure you want to change the base?

Count number of connected components more efficiently than length(connected_components(g)) #407

Conversation

thchr commented Nov 14, 2024

thchr commented Nov 14, 2024

codecov bot commented Nov 14, 2024 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gdalle Nov 21, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thchr Nov 21, 2024 • edited Loading

Choose a reason for hiding this comment

thchr Nov 21, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

simonschoelly commented Nov 20, 2024

thchr commented Nov 21, 2024

gdalle commented Nov 21, 2024 • edited Loading

Count number of connected components more efficiently than `length(connected_components(g))` #407

Count number of connected components more efficiently than `length(connected_components(g))` #407

codecov bot commented Nov 14, 2024 •

edited

Loading

gdalle Nov 21, 2024 •

edited

Loading

thchr Nov 21, 2024 •

edited

Loading

thchr Nov 21, 2024 •

edited

Loading

gdalle commented Nov 21, 2024 •

edited

Loading