Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add logic to check orphan nodes and attempt to fix parents of the orphan nodes in best effort to avoid dangling nodes #3

Closed
wants to merge 1 commit into from

Conversation

JianFangAtRai
Copy link
Contributor

This PR is related to the issue (#1).

From heavy testing and debugging, it seems the current sampling logic could introduce dangling nodes, i.e., orphan nodes without the path to the uber root node. Even we attempted to resolve this issue by removing the orphan nodes' children in this PR (#2), unfortunately, it is not enough due to the fact that the snapshot is actually a very complex graph with many object cross references instead of a tree structure.

To resolve this issue, this PR proposed a best effort way to fix the orphan nodes' parents as follows:

  1. introduce logic to check orphan nodes from the node and edge arrays
  2. check the input heapsnap shot file for orphan nodes, if they are presented, attempt to remove them. This could happen for earlier julia version without this fix (heap snapshot: add gc roots and gc finalist roots to fix unrooted nodes JuliaLang/julia#52618)
  3. attempt to fix the parents of the orphan nodes during filter_nodes operation, abort if the tries reach a given threshold or no changes could be made
  4. printout a warning message of the generated file contains some orphan nodes
  5. added new unit tests and updated existing unit tests accordingly

From my tests, the orphan nodes are eventually gone for snapshots that I took use the latest Julia 1.9.2+RAI.

For example, here is the output for a snapshot of an idle Julia process:

julia> subsample_snapshot("jf-julia.heapsnapshot") do node_type, node_self_size, node_name
                node_type in (1,2,3) && node_self_size >= 10
              end
[ Info: Reading snapshot from "jf-julia.heapsnapshot"
┌ Info: BEFORE:  Snapshot contains 1388694 nodes, 4418313 edges, and 1389044 strings.
│ Total size of nodes: 127.077 MiB
│     synthetic: 39 bytes (42)
│     jl_task_t: 4.359 KiB (9)
│   jl_module_t: 82.266 KiB (135)
│    jl_array_t: 10.411 MiB (227439)
│        object: 92.288 MiB (683106)
│        String: 10.886 MiB (178355)
│ jl_datatype_t: 3.226 MiB (65954)
│     jl_svec_t: 539.687 KiB (26768)
└      jl_sym_t: 9.654 MiB (206886)
[ Info: Attempting to fix parents for 135455 orphan nodes...
[ Info: Attempting to fix parents for 69091 orphan nodes...
[ Info: Attempting to fix parents for 26753 orphan nodes...
[ Info: Attempting to fix parents for 8613 orphan nodes...
[ Info: Attempting to fix parents for 7626 orphan nodes...
[ Info: Attempting to fix parents for 1178 orphan nodes...
[ Info: Attempting to fix parents for 684 orphan nodes...
[ Info: Attempting to fix parents for 372 orphan nodes...
[ Info: Attempting to fix parents for 199 orphan nodes...
[ Info: Attempting to fix parents for 66 orphan nodes...
[ Info: Attempting to fix parents for 32 orphan nodes...
[ Info: Attempting to fix parents for 16 orphan nodes...
[ Info: Attempting to fix parents for 11 orphan nodes...
[ Info: Attempting to fix parents for 8 orphan nodes...
[ Info: Attempting to fix parents for 7 orphan nodes...
[ Info: Attempting to fix parents for 6 orphan nodes...
[ Info: Attempting to fix parents for 5 orphan nodes...
[ Info: Attempting to fix parents for 4 orphan nodes...
[ Info: Attempting to fix parents for 4 orphan nodes...
[ Info: Attempting to fix parents for 4 orphan nodes...
[ Info: Attempting to fix parents for 4 orphan nodes...
[ Info: Attempting to fix parents for 3 orphan nodes...
[ Info: Attempting to fix parents for 3 orphan nodes...
[ Info: Attempting to fix parents for 3 orphan nodes...
[ Info: Attempting to fix parents for 1 orphan nodes...
[ Info: Attempting to fix parents for 1 orphan nodes...
[ Info: Attempting to fix parents for 1 orphan nodes...
[ Info: Attempting to fix parents for 1 orphan nodes...
[ Info: Attempting to fix parents for 1 orphan nodes...
[ Info: Attempting to fix parents for 1 orphan nodes...
[ Info: Attempting to fix parents for 1 orphan nodes...
[ Info: Attempting to fix parents for 1 orphan nodes...
┌ Info: AFTER:   Snapshot contains 360355 nodes, 697237 edges, and 14483 strings.
│ Total size of nodes: 26.865 MiB
│     synthetic: 25 bytes (28)
│     jl_task_t: 2.906 KiB (6)
│   jl_module_t: 60.938 KiB (100)
│    jl_array_t: 6.389 MiB (139563)
│        object: 18.211 MiB (205786)
│        String: 736.625 KiB (11786)
│ jl_datatype_t: 0 bytes (0)
│     jl_svec_t: 0 bytes (0)
└      jl_sym_t: 1.483 MiB (3086)
[ Info: Writing snapshot to "subsampled_jf-julia.heapsnapshot"
"subsampled_jf-julia.heapsnapshot"

The output of 11GB heap snapshot of our RAI engine:

julia> subsample_snapshot("rai-engine-assembled.heapsnapshot") do node_type, node_self_size, node_name
                node_type in (1,2,3) || (node_self_size >= 64) || occursin(r"Tuple"i, node_name)
              end
[ Info: Reading snapshot from "rai-engine-assembled.heapsnapshot"
┌ Info: BEFORE:  Snapshot contains 84256621 nodes, 300739929 edges, and 84061771 strings.
│ Total size of nodes: 8.728 GiB
│     synthetic: 195.065 KiB (199750)
│     jl_task_t: 3.338 MiB (9115)
│   jl_module_t: 411.328 KiB (675)
│    jl_array_t: 973.289 MiB (21261816)
│        object: 7.591 GiB (59311441)
│        String: 83.535 MiB (1368630)
│ jl_datatype_t: 18.550 MiB (274654)
│     jl_svec_t: 80.930 MiB (1684463)
└      jl_sym_t: 3.646 MiB (146077)
[ Info: Attempting to fix parents for 14336673 orphan nodes...
[ Info: Attempting to fix parents for 10221485 orphan nodes...
[ Info: Attempting to fix parents for 5176743 orphan nodes...
[ Info: Attempting to fix parents for 2082608 orphan nodes...
[ Info: Attempting to fix parents for 1670971 orphan nodes...
[ Info: Attempting to fix parents for 468824 orphan nodes...
[ Info: Attempting to fix parents for 285041 orphan nodes...
[ Info: Attempting to fix parents for 77561 orphan nodes...
[ Info: Attempting to fix parents for 37186 orphan nodes...
[ Info: Attempting to fix parents for 9271 orphan nodes...
[ Info: Attempting to fix parents for 2035 orphan nodes...
[ Info: Attempting to fix parents for 644 orphan nodes...
[ Info: Attempting to fix parents for 201 orphan nodes...
[ Info: Attempting to fix parents for 189 orphan nodes...
[ Info: Attempting to fix parents for 128 orphan nodes...
[ Info: Attempting to fix parents for 123 orphan nodes...
[ Info: Attempting to fix parents for 121 orphan nodes...
[ Info: Attempting to fix parents for 119 orphan nodes...
[ Info: Attempting to fix parents for 94 orphan nodes...
[ Info: Attempting to fix parents for 94 orphan nodes...
[ Info: Attempting to fix parents for 93 orphan nodes...
[ Info: Attempting to fix parents for 93 orphan nodes...
[ Info: Attempting to fix parents for 88 orphan nodes...
[ Info: Attempting to fix parents for 88 orphan nodes...
[ Info: Attempting to fix parents for 88 orphan nodes...
[ Info: Attempting to fix parents for 88 orphan nodes...
[ Info: Attempting to fix parents for 88 orphan nodes...
[ Info: Attempting to fix parents for 88 orphan nodes...
[ Info: Attempting to fix parents for 88 orphan nodes...
[ Info: Attempting to fix parents for 88 orphan nodes...
[ Info: Attempting to fix parents for 87 orphan nodes...
[ Info: Attempting to fix parents for 87 orphan nodes...
[ Info: Attempting to fix parents for 88 orphan nodes...
[ Info: Attempting to fix parents for 88 orphan nodes...
[ Info: Attempting to fix parents for 88 orphan nodes...
[ Info: Attempting to fix parents for 87 orphan nodes...
[ Info: Attempting to fix parents for 86 orphan nodes...
[ Info: Attempting to fix parents for 85 orphan nodes...
[ Info: Attempting to fix parents for 85 orphan nodes...
[ Info: Attempting to fix parents for 84 orphan nodes...
[ Info: Attempting to fix parents for 83 orphan nodes...
[ Info: Attempting to fix parents for 82 orphan nodes...
[ Info: Attempting to fix parents for 2 orphan nodes...
[ Info: Attempting to fix parents for 1 orphan nodes...
┌ Info: AFTER:   Snapshot contains 68178480 nodes, 159823102 edges, and 1170210 strings.
│ Total size of nodes: 8.339 GiB
│     synthetic: 185.146 KiB (189592)
│     jl_task_t: 3.327 MiB (9084)
│   jl_module_t: 334.547 KiB (549)
│    jl_array_t: 927.274 MiB (20256599)
│        object: 7.319 GiB (46281214)
│        String: 72.947 MiB (1195158)
│ jl_datatype_t: 9.942 MiB (40599)
│     jl_svec_t: 30.458 MiB (202893)
└      jl_sym_t: 343.322 KiB (2792)
[ Info: Writing snapshot to "subsampled_rai-engine-assembled.heapsnapshot"

…han nodes in best effort to avoid dangling nodes
@JianFangAtRai JianFangAtRai requested a review from Drvi February 1, 2024 21:09
@JianFangAtRai JianFangAtRai self-assigned this Feb 1, 2024
@Drvi
Copy link
Member

Drvi commented Feb 2, 2024

@JianFangAtRai Could the thing we try to achieve here be described like this (using a little bit of graph lingo):
a) Identify the connected component of the GC graph in the heap snapshot, that contains the root
b) Remove any node that is not a part of this connected component
?

If yes, it feels like iterative filtering shouldn't be needed.

If not, maybe we should rethink how we filter out nodes so that we minimize this effort of adding nodes back.

@JianFangAtRai
Copy link
Contributor Author

JianFangAtRai commented Feb 5, 2024

@JianFangAtRai Could the thing we try to achieve here be described like this (using a little bit of graph lingo): a) Identify the connected component of the GC graph in the heap snapshot, that contains the root b) Remove any node that is not a part of this connected component ?

If yes, it feels like iterative filtering shouldn't be needed.

If not, maybe we should rethink how we filter out nodes so that we minimize this effort of adding nodes back.

Thanks for the suggestions. The current sampling logic is not based on graph and it does not preserve the parents, even the uber root could be filtered out based on the filtering function. As we can see, this logic could introduce a huge number of disconnected subgraphs, for example, 14336673 subgraphs for the 11GB snapshot data.

The idea in this PR is to add the parent nodes back so that we could have the path to the uber root. The iteration is not to filter out nodes, but to add the parents back. Due to the nature of the graph, the adding back process takes multiple iterations in this PR since one node back in the graph could affect multiple paths due to cross references of objects.

Just like you to be aware that I start to work on SPCS projects and might not have time to continue working on this.

@Drvi
Copy link
Member

Drvi commented Feb 5, 2024

@JianFangAtRai Thanks for explaining. When I tried this approach on my 10 GiB snapshot, It took ~6 minutes to finish, which made me wonder if we could save all this back-and-forth and instead of removing parents in the filter step and re-adding them iteratively later, we could filter out whole subgraphs.

It also seems that there is not just a single root -- I see 1405 connected components in the snapshot you added in this PR:

using Graphs, HeapSnapshotUtils
nodes, strings, _... = HeapSnapshotUtils.parse_nodes("test/data/julia-test.heapsnapshot");

function heap_graph(nodes)
    g = DiGraph{UInt32}()
    j = 0
    add_vertices!(g, length(nodes))
    for i in 1:length(nodes)
        for _ in 1:nodes.edge_count[i]
            j += 1
            e = nodes.edges.to_pos[j]
            add_edge!(g, i, e)
        end
    end
    return g
end

ccs = connected_components(heap_graph(nodes))
sort!(ccs, by=length, rev=true)

@info length(ccs)
for cc in first(ccs, 20)
    println(length(cc), " => ", strings[nodes.name_index[first(cc, 20)] .+ 1])
end
[ Info: 1405

352696 => ["", "GC roots", "GC finalizer list roots", "Task", "Main", "UnionAll", "Core.TypeMapEntry", "Core.TypeMapEntry", "Core.TypeMapEntry", "Core.TypeMapEntry", "Core.TypeMapEntry", "Core.TypeMapEntry", "Core.TypeMapEntry", "Core.TypeMapEntry", "Core.TypeMapEntry", "Core.TypeMapEntry", "Core.TypeMapEntry", "Core.TypeMapEntry", "Core.TypeMapEntry", "Core.TypeMapEntry"]
32 => ["typeof(Core.Compiler.Iterators.prod_iteratorsize)", "Core.TypeName", "Core.MethodTable", "Core.TypeMapEntry", "Core.TypeMapEntry", "Method", "Array{UInt8, 1}", "Array{Any, 1}", "Core.TypeMapEntry", "Core.TypeMapEntry", "Method", "Array{UInt8, 1}", "Array{Any, 1}", "Core.TypeMapEntry", "Core.TypeMapEntry", "Method", "Array{UInt8, 1}", "Array{Any, 1}", "Core.TypeMapEntry", "Core.TypeMapEntry"]
22 => ["typeof(Core.Compiler.Iterators.zip_iteratorsize)", "Core.TypeName", "Core.MethodTable", "Core.TypeMapEntry", "Core.TypeMapEntry", "Method", "Array{UInt8, 1}", "Array{Any, 1}", "Core.TypeMapEntry", "Core.TypeMapEntry", "Method", "Array{UInt8, 1}", "Array{Any, 1}", "Core.TypeMapEntry", "Core.TypeMapEntry", "Method", "Array{UInt8, 1}", "Array{Any, 1}", "Core.TypeMapEntry", "Method"]
22 => ["typeof(Random.XoshiroSimd._bits2float)", "Core.TypeName", "Core.MethodTable", "Core.TypeMapEntry", "Core.TypeMapEntry", "Method", "Array{UInt8, 1}", "Array{Any, 1}", "Core.TypeMapEntry", "Core.TypeMapEntry", "Method", "Array{UInt8, 1}", "Array{Any, 1}", "Core.TypeMapEntry", "Core.TypeMapEntry", "Method", "Array{UInt8, 1}", "Array{Any, 1}", "Core.TypeMapEntry", "Method"]
17 => ["typeof(Base._RepeatInnerOuter.resolve)", "Core.TypeName", "Core.MethodTable", "Core.TypeMapEntry", "Core.TypeMapEntry", "Method", "Array{UInt8, 1}", "Array{Any, 1}", "Core.TypeMapEntry", "Core.TypeMapEntry", "Method", "Array{UInt8, 1}", "Array{Any, 1}", "Core.TypeMapEntry", "Method", "Array{UInt8, 1}", "Array{Any, 1}"]
17 => ["typeof(Random.XoshiroSimd._lshr)", "Core.TypeName", "Core.MethodTable", "Core.TypeMapEntry", "Core.TypeMapEntry", "Method", "Array{UInt8, 1}", "Array{Any, 1}", "Core.TypeMapEntry", "Core.TypeMapEntry", "Method", "Array{UInt8, 1}", "Array{Any, 1}", "Core.TypeMapEntry", "Method", "Array{UInt8, 1}", "Array{Any, 1}"]
16 => ["typeof(REPL.Terminals.cmove_left)", "Core.TypeName", "Core.MethodTable", "Core.TypeMapEntry", "Core.TypeMapEntry", "Method", "SimpleVector", "Array{UInt8, 1}", "Array{UInt8, 1}", "Array{Any, 1}", "Core.MethodInstance", "Core.CodeInstance", "Core.TypeMapEntry", "Method", "Array{UInt8, 1}", "Array{Any, 1}"]
15 => ["UnionAll", "UnionAll", "UnionAll", "UnionAll", "UnionAll", "UnionAll", "UnionAll", "UnionAll", "Pkg.Operations.var\"#130#134\"{coverage, julia_args, test_args, test_fn, ctx, pkgs_errored, source_path, var\"#15397#pkg\"}", "Core.TypeName", "Core.MethodTable", "Core.TypeMapEntry", "Method", "Array{UInt8, 1}", "Array{Any, 1}"]
13 => ["typeof(REPL.Terminals.cmove_down)", "Core.TypeName", "Core.MethodTable", "Core.TypeMapEntry", "Array{Any, 1}", "Core.TypeMapEntry", "Method", "Array{UInt8, 1}", "Array{Any, 1}", "Core.TypeMapEntry", "Method", "Array{UInt8, 1}", "Array{Any, 1}"]
13 => ["typeof(SparseArrays.LibSuiteSparse.include)", "Core.TypeName", "Core.MethodTable", "Core.TypeMapEntry", "Array{Any, 1}", "Core.TypeMapEntry", "Core.MethodInstance", "Method", "Core.CodeInstance", "SimpleVector", "Array{UInt8, 1}", "Array{UInt8, 1}", "Array{Any, 1}"]
13 => ["typeof(LibCURL.curl_version)", "Core.TypeName", "Core.MethodTable", "Core.TypeMapEntry", "Array{Any, 1}", "Core.TypeMapEntry", "Core.MethodInstance", "Method", "Core.CodeInstance", "SimpleVector", "Array{UInt8, 1}", "Array{UInt8, 1}", "Array{Any, 1}"]
13 => ["typeof(LibCURL.include)", "Core.TypeName", "Core.MethodTable", "Core.TypeMapEntry", "Array{Any, 1}", "Core.TypeMapEntry", "Core.MethodInstance", "Method", "Core.CodeInstance", "SimpleVector", "Array{UInt8, 1}", "Array{UInt8, 1}", "Array{Any, 1}"]
13 => ["typeof(TOML.Internals.Printer.include)", "Core.TypeName", "Core.MethodTable", "Core.TypeMapEntry", "Array{Any, 1}", "Core.TypeMapEntry", "Core.MethodInstance", "Method", "Core.CodeInstance", "SimpleVector", "Array{UInt8, 1}", "Array{UInt8, 1}", "Array{Any, 1}"]
13 => ["typeof(Base.PCRE.include)", "Core.TypeName", "Core.MethodTable", "Core.TypeMapEntry", "Array{Any, 1}", "Core.TypeMapEntry", "Core.MethodInstance", "Method", "Core.CodeInstance", "SimpleVector", "Array{UInt8, 1}", "Array{UInt8, 1}", "Array{Any, 1}"]
12 => ["typeof(REPL.Terminals.cmove_line_up)", "Core.TypeName", "Core.MethodTable", "Core.TypeMapEntry", "Core.TypeMapEntry", "Method", "Array{UInt8, 1}", "Array{Any, 1}", "Core.TypeMapEntry", "Method", "Array{UInt8, 1}", "Array{Any, 1}"]
12 => ["typeof(REPL.Terminals.cmove_line_down)", "Core.TypeName", "Core.MethodTable", "Core.TypeMapEntry", "Core.TypeMapEntry", "Method", "Array{UInt8, 1}", "Array{Any, 1}", "Core.TypeMapEntry", "Method", "Array{UInt8, 1}", "Array{Any, 1}"]
12 => ["typeof(Core.Compiler.Iterators.drop_iteratorsize)", "Core.TypeName", "Core.MethodTable", "Core.TypeMapEntry", "Core.TypeMapEntry", "Method", "Array{UInt8, 1}", "Array{Any, 1}", "Core.TypeMapEntry", "Method", "Array{UInt8, 1}", "Array{Any, 1}"]
12 => ["typeof(Core.Compiler.Iterators.zip_iteratoreltype)", "Core.TypeName", "Core.MethodTable", "Core.TypeMapEntry", "Core.TypeMapEntry", "Method", "Array{UInt8, 1}", "Array{Any, 1}", "Core.TypeMapEntry", "Method", "Array{UInt8, 1}", "Array{Any, 1}"]
12 => ["typeof(Core.Compiler.Iterators._diff_length)", "Core.TypeName", "Core.MethodTable", "Core.TypeMapEntry", "Core.TypeMapEntry", "Method", "Array{UInt8, 1}", "Array{Any, 1}", "Core.TypeMapEntry", "Method", "Array{UInt8, 1}", "Array{Any, 1}"]
12 => ["typeof(Core.Compiler.Iterators.and_iteratorsize)", "Core.TypeName", "Core.MethodTable", "Core.TypeMapEntry", "Core.TypeMapEntry", "Method", "Array{UInt8, 1}", "Array{Any, 1}", "Core.TypeMapEntry", "Method", "Array{UInt8, 1}", "Array{Any, 1}"]
...

While the this PR assumed everything is connected to the very first node in the graph, correct? Or maybe I'm missing something?

@JianFangAtRai
Copy link
Contributor Author

JianFangAtRai commented Feb 5, 2024

Thanks for checking, the test file is a downsampled heapsnapshot using this subsampling logic from the original snapshot, which is over 100MB and is too big to check into the git repository. As a result, it isn't the original full graph.

For, 352696 => ["", "GC roots", "GC finalizer list roots", "" is the uber root, "GC roots" and "GC finalizer list roots" are the subroots under the uber root.

@Drvi
Copy link
Member

Drvi commented Feb 7, 2024

@JianFangAtRai Thanks for clarifying! I went ahead and tried to improve the filtering logic so that no iterative step with re-adding parents is necessary -- can you please review #4? I think its a viable alternative to the approach in this PR and has some nice performance improvements.

@JianFangAtRai
Copy link
Contributor Author

Busy with SPCS project now. Will take a look once I have some free time. Thanks.

@Drvi
Copy link
Member

Drvi commented Feb 19, 2024

I'm going to close this, since we're using the alternative approach, at least for now. Thanks for looking into this!

@Drvi Drvi closed this Feb 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants