Skip to content
This repository has been archived by the owner on Oct 21, 2021. It is now read-only.

pathcounts and allparents in DijkstraStates #155

Closed
wants to merge 1 commit into from
Closed

pathcounts and allparents in DijkstraStates #155

wants to merge 1 commit into from

Conversation

sbromberger
Copy link
Contributor

SEE COMMENT BELOW FOR FURTHER EDITS.

Hi,

I needed the number of (u,v) shortest paths in dijkstra. I added a new vector called pathcounts that stores this information. It's an accumulator so shouldn't add any time complexity (will increase memory by sizeof(Int) * num_vertices).

Here's an example:

julia> g = simple_graph(4)
Directed Graph (4 vertices, 0 edges)

julia> add_edge!(g,1,2); add_edge!(g,1,3); add_edge!(g,2,3); add_edge!(g,3,4); add_edge!(g,2,4)
edge [5]: 2 -- 4

julia> z = dijkstra_shortest_paths(g,1)
Graphs.DijkstraStates{Int64,Float64,DataStructures.MutableBinaryHeap{Graphs.DijkstraHEntry{Int64,Float64},DataStructures.LessThan},Int64}([1,1,1,2],Bool[true,true,true,true],[0.0,1.0,1.0,2.0],[2,2,2,2],[1,1,1,2],MutableBinaryHeap(),[0,1,2,3])

julia> z.pathcounts
4-element Array{Int64,1}:
 1 # source always has one path to itself
 1 # (1--2)
 1 # (1--3)
 2 # (1--2--4), (1--3--4)

@coveralls
Copy link

Coverage Status

Coverage decreased (-13.14%) when pulling 21a1d52 on sbromberger:pathcount into c9b4317 on JuliaLang:master.

@coveralls
Copy link

Coverage Status

Coverage decreased (-13.14%) when pulling 21a1d52 on sbromberger:pathcount into c9b4317 on JuliaLang:master.

@coveralls
Copy link

Coverage Status

Coverage decreased (-13.05%) when pulling 196d7cf on sbromberger:pathcount into c9b4317 on JuliaLang:master.

@sbromberger
Copy link
Contributor Author

As it turns out, I needed the explicit parents lists as well, so I made further modifications. An optional all_paths parameter (defaulting to false) is present on each Dijkstra core function (that is, set_source!, process_neighbors!, and all dijkstra_shortest_paths functions). When set to true, the DijkstraStates for a graph G={V, E} and a given source u will log the parents of v for all v ∈ V, (u,v) ∈ E.

Here's an example:

julia> vertices(g)
1:5

julia> edges(g)
5-element Array{Graphs.Edge{Int64},1}:
 edge [1]: 1 -- 2
 edge [2]: 2 -- 3
 edge [3]: 2 -- 4
 edge [4]: 3 -- 5
 edge [5]: 4 -- 5

julia> z = dijkstra_shortest_paths(g,1; all_paths=true)
Graphs.DijkstraStates{Int64,Float64,DataStructures.MutableBinaryHeap{Graphs.DijkstraHEntry{Int64,Float64},DataStructures.LessThan},Int64}([1,1,2,2,3],Bool[true,true,true,true,true],[0.0,1.0,2.0,2.0,3.0],[2,2,2,2,2],[1,1,1,1,2],[[1],[1],[2],[2],[3,4]],MutableBinaryHeap(),[0,1,2,3,4])

julia> z.allparents
5-element Array{Array{Int64,1},1}:
 [1]
 [1]
 [2]
 [2]
 [3,4]

julia> z.pathcounts
5-element Array{Int64,1}:
 1
 1
 1
 1
 2

This should not be time-impactful, but might be memory-impactful (as it would store potentially (nv-1)^2 vertices), so it defaults to false.

Once folks are ok with this, I'll also push an update to the docs.

@sbromberger
Copy link
Contributor Author

@yeesian - we might also consider the (positive) impact this would have on something like enumerate_paths, since we now have access to all the parents - one fewer loop?

@sbromberger sbromberger changed the title pathcounts in DijkstraStates pathcounts and allparents in DijkstraStates Jan 9, 2015
@sbromberger
Copy link
Contributor Author

Some times on a reasonably-sized random graph:

julia> num_vertices(g)
10000

julia> num_edges(g)
174882

julia> @time z = dijkstra_shortest_paths(g,1; all_paths=false);
elapsed time: 0.010353756 seconds (3810816 bytes allocated)

julia> @time z = dijkstra_shortest_paths(g,1; all_paths=false);
elapsed time: 0.0112202 seconds (3810816 bytes allocated)

julia> @time z = dijkstra_shortest_paths(g,1; all_paths=false);
elapsed time: 0.009402285 seconds (3810816 bytes allocated)

julia> @time z = dijkstra_shortest_paths(g,1; all_paths=true);
elapsed time: 0.011581051 seconds (5400160 bytes allocated)

julia> @time z = dijkstra_shortest_paths(g,1; all_paths=true);
elapsed time: 0.012658101 seconds (5400160 bytes allocated)

julia> @time z = dijkstra_shortest_paths(g,1; all_paths=true);
elapsed time: 0.012146354 seconds (5400160 bytes allocated)

@yeesian
Copy link
Contributor

yeesian commented Jan 9, 2015

Sorry @sbromberger, but I'm not sure I'm on the same page.. Have you tried it on graphs with a million nodes/edges? I think those are instances that people care about, and that the README promises performance for.

I'm of the understanding that

  • enumerate_paths is a convenience function for working with the results of a shortest_path algorithm (not the modification of the behavior of a [standard] shortest path algorithm), and
  • hasparent is a hack around the lack of Nullable for julia 0.3, and not an invitation to add to the list.

Although the keyword arguments are optional, the bloat in DijkstraState is not; is this a standard practice in Boost or networkx?

I can imagine this to be useful on occasion, but not for the 99% of applications that might use Graphs.jl -- perhaps make it a separate function, and introduce its own result type? Actually, I don't know if it counts as a standard graph algorithm either.. might be worth introducing a new package for such functionality?

@sbromberger
Copy link
Contributor Author

It's actually vital for efficient betweenness_centrality and is present in NetworkX. It has no appreciable impact on million-node graphs:

julia> @time z = dijkstra_shortest_paths(g,1; all_paths=true); # with parent tracing
elapsed time: 2.60640678 seconds (454528680 bytes allocated, 19.78% gc time)

julia> @time z = dijkstra_shortest_paths(g,1; all_paths=false);  # the original
elapsed time: 1.789843458 seconds (324088848 bytes allocated, 23.36% gc time)

julia> num_vertices(g)
1000000

julia> num_edges(g)
10000000

What's the concern with optionally tracking parents? It's off by default but can be used in all sorts of path analysis with very little overhead. Having to compute all_paths parents after the fact is an expensive operation; it's much better to do it during the path discovery itself.

Also: this does not modify the Dijkstra algorithm one bit. It just keeps track of extra information as dijkstra does its thing.

@sbromberger
Copy link
Contributor Author

cc @lindahua

@yeesian
Copy link
Contributor

yeesian commented Jan 9, 2015

I think it'll help for you to provide the performance between

dijkstra_shortest_paths(g,1) # before the PR

and

dijkstra_shortest_paths(g,1; all_paths=false) # after the PR

for comparison

@coveralls
Copy link

Coverage Status

Coverage decreased (-13.02%) when pulling 379af5c on sbromberger:pathcount into c9b4317 on JuliaLang:master.

tests

bugfix

added allparents, made accumulation optional and default to false

added comments and fixed upstream pathcounts

removed println for debug
@coveralls
Copy link

Coverage Status

Coverage decreased (-13.03%) when pulling 353a0dc on sbromberger:pathcount into c9b4317 on JuliaLang:master.

@sbromberger
Copy link
Contributor Author

I used the exact same randomly-generated {1m, 10m} graph for all tests:

julia> g
Directed Graph (1000000 vertices, 10000000 edges)

Pre-PR code:

julia> @time z = dijkstra_shortest_paths(g,2);
elapsed time: 1.674417784 seconds (180111408 bytes allocated, 28.63% gc time)

julia> @time z = dijkstra_shortest_paths(g,2);
elapsed time: 1.6798163 seconds (180111360 bytes allocated, 29.37% gc time)

julia> @time z = dijkstra_shortest_paths(g,2);
elapsed time: 1.666871282 seconds (180111360 bytes allocated, 29.00% gc time)

julia> @time z = dijkstra_shortest_paths(g,2);
elapsed time: 1.679200498 seconds (180111360 bytes allocated, 29.20% gc time)

Post-PR code, all_paths=false:

julia> @time z = dijkstra_shortest_paths(g,2);
elapsed time: 1.840191832 seconds (324088144 bytes allocated, 25.90% gc time)

julia> @time z = dijkstra_shortest_paths(g,2);
elapsed time: 2.048471379 seconds (324088144 bytes allocated, 35.44% gc time)

julia> @time z = dijkstra_shortest_paths(g,2);
elapsed time: 2.067335427 seconds (324088144 bytes allocated, 34.32% gc time)

julia> @time z = dijkstra_shortest_paths(g,2);
elapsed time: 2.046543133 seconds (324088144 bytes allocated, 34.16% gc time)

Post-PR code, all_paths=true:

julia> @time z = dijkstra_shortest_paths(g,2, all_paths=true);
elapsed time: 2.821628908 seconds (469875528 bytes allocated, 16.53% gc time)

julia> @time z = dijkstra_shortest_paths(g,2, all_paths=true);
elapsed time: 2.480601804 seconds (454000952 bytes allocated, 15.52% gc time)

julia> @time z = dijkstra_shortest_paths(g,2, all_paths=true);
elapsed time: 2.53800812 seconds (454000952 bytes allocated, 18.26% gc time)

julia> @time z = dijkstra_shortest_paths(g,2, all_paths=true);
elapsed time: 2.145515473 seconds (454000952 bytes allocated)

The PR code saves 15 seconds (about 50%) for betweenness_centrality on a {10k, 80k} graph relative to NetworkX (NX runtime takes about 27 seconds; PR code ran in 12). Without the code, betweenness_centrality runs approx. 25% slower (runtime almost 33 seconds) than NetworkX.


dists::Vector{D} = state.dists
parents::Vector{V} = state.parents
hasparent::Vector{Bool} = state.hasparent
colormap::Vector{Int} = state.colormap
pathcounts::Vector{Int} = state.pathcounts
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you wrap these two lines in an if-block? AIUI they will currently cause unnecessary allocation and GC in the all_paths=false case, but I haven't tested this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

an if all_paths wrapping 124 and 125 actually worsened the time for both:

julia> @time z = dijkstra_shortest_paths(g,2, all_paths=true);
elapsed time: 3.926816859 seconds (653550664 bytes allocated, 20.23% gc time)

julia> @time z = dijkstra_shortest_paths(g,2, all_paths=true);
elapsed time: 3.349007774 seconds (636758152 bytes allocated, 15.47% gc time)

julia> @time z = dijkstra_shortest_paths(g,2, all_paths=true);
elapsed time: 3.31105659 seconds (636758152 bytes allocated, 15.53% gc time)

julia> @time z = dijkstra_shortest_paths(g,2);
elapsed time: 2.567676538 seconds (324112744 bytes allocated, 48.72% gc time)

julia> @time z = dijkstra_shortest_paths(g,2);
elapsed time: 2.169036004 seconds (324088144 bytes allocated, 37.58% gc time)

julia> @time z = dijkstra_shortest_paths(g,2);
elapsed time: 1.998513853 seconds (324088144 bytes allocated, 34.89% gc time)

julia> @time z = dijkstra_shortest_paths(g,2);
elapsed time: 2.081961072 seconds (324088144 bytes allocated, 34.56% gc time)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh, I am surprised. I'd have thought the JIT and/or branch predictor would have eliminated most of the overhead with a million-node graph. Evidently not - thanks for checking!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries - I was hoping it would improve performance also.

@pozorvlak
Copy link
Contributor

Wow, those are some pretty compelling numbers for your use-case. But IMHO a 9% slowdown is not trivial, and I'd prefer to reduce that before merging this. Hopefully that'll be as simple as adding the if-statement I suggested :-)

@sbromberger
Copy link
Contributor Author

That if statement increased the time (see inline). If a 9% on a million-node graph with average degree of 10 is concerning, perhaps the better approach is to create a separate DijkstraStates object in Centrality with its own dijkstra_shortest_paths that will only provide parent info for Centrality measures. The downsides are as follows:

  1. the code will not easily be available as a general Graphs method, where someone might like to access parent information (ex: I believe I can get a much more efficient enumerate_paths with this data, and there are undoubtedly other use cases, which is why NetworkX provides this);
  2. the eventual integration of Centrality into Graphs becomes that much more complex; and
  3. code divergence may be an issue.

@sbromberger
Copy link
Contributor Author

Closing this out. I'll move this into Centrality.jl as dijkstra_predecessor_and_distance to be consistent with NetworkX - I need to get cracking on more centrality measures, and I can't seem to close the performance gap (slight though it appears to me):

julia> g = Centrality.readgraph("$(Pkg.dir("Centrality"))/test/testdata/graph-1000000-10000000.csv")
Directed Graph (1000000 vertices, 10000000 edges)

julia> @time z = dijkstra_shortest_paths(g,2);
elapsed time: 1.478243315 seconds (185808104 bytes allocated, 14.86% gc time)

julia> @time z = dijkstra_shortest_paths(g,2);
elapsed time: 1.423671749 seconds (180111360 bytes allocated, 15.87% gc time)

julia> @time z = dijkstra_shortest_paths(g,2);
elapsed time: 1.13893359 seconds (180111360 bytes allocated)

julia> @time y = dijkstra_predecessor_and_distance(g,2);
elapsed time: 2.217978906 seconds (422024216 bytes allocated, 15.71% gc time)

julia> @time y = dijkstra_predecessor_and_distance(g,2);
elapsed time: 2.365467454 seconds (422024216 bytes allocated, 18.53% gc time)

julia> @time y = dijkstra_predecessor_and_distance(g,2);
elapsed time: 1.93348047 seconds (422024216 bytes allocated)

I'm happy to revisit rolling it back into Graphs.jl - just let me know.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants