CSM init endless cycle MWE fail case #754

dehann · 2020-06-20T03:14:37Z

Found a easy fail case on CSM, relates to existing test:

IncrementalInference.jl/test/testBasicTreeInit.jl

fg = generateCanonicalFG_CaesarRing1D(graphinit=false)

MWE

Clique 5 and 6 solves cycle endlessly if the following clique solve sequence is used:

fg = generateCanonicalFG_CaesarRing1D(graphinit=false)
getSolverParams(fg).graphinit = false
getSolverParams(fg).limititers = 50
TS = Vector{Task}(undef,6)
TS[5] = solveCliq!(fg, tree, :x5, async=true)  # , recordcliq=true
TS[6] = solveCliq!(fg, tree, :x3, async=true)  # , recordcliq=true

dehann · 2020-09-19T20:28:21Z

also added as a recent new test in IIF, testBasicTreeInit.jl

Affie · 2020-09-22T15:32:09Z

Another basic endless cycle fail case (after #459)

using IncrementalInference
fg = generateCanonicalFG_lineStep(5; 
                                  poseEvery=1, 
                                  landmarkEvery=5, 
                                  posePriorsAt=[0,2], 
                                  sightDistance=4,
                                  solverParams=SolverParams(algorithms=[:default, :parametric]))
                                  
getSolverParams(fg).graphinit = false
getSolverParams(fg).treeinit = true
getSolverParams(fg).limititers = 50
smtasks = Task[]
tree, smt, hist = solveTree!(fg; smtasks=smtasks, verbose=true, timeout=50, recordcliqs=ls(fg));

Affie · 2020-09-23T11:25:21Z

when clique 1 hits limititers = 50

Giving 2 a chance to solve results in endless cycle in this state:

dehann · 2020-09-25T03:17:47Z

TO DEBUG THE PACKAGES WERE AT DFG v0.10.5, and IIF at 3a41031

I'm going to debug a little by adding:

mkpath(getLogPath(fg))
fid = open(joinLogPath(fg,"csmVerbose.log"), "w")
#solveTree!(...;  verbosefid=fid, verbose=true, ...)
tree, smt, hist = solveTree!(fg; smtasks=smtasks, verbose=true, verbosefid=fid, timeout=50, recordcliqs=ls(fg));

# after finished
close(fid)


open(joinLogPath(fg, "csmLogicalReconstructMax.log"),"w") do io
  IIF.reconstructCSMHistoryLogical(getLogPath(fg), fid=io)
end

dehann · 2020-09-25T03:24:07Z

Here it is:
csmLogicalReconstructMax.log

dehann · 2020-09-25T03:36:40Z

So this is the loop on (1):

trafficRe null
maybeNeed null
determine null
blockUnti null

While Children (2)=>init, and (3)=>null

First question for me is why does blockUntilChildrenHaveStatus not stop the loop, since one child is null...

dehann · 2020-09-25T03:38:19Z

We can also look at the log on (1):
log.txt

dehann · 2020-09-25T03:42:53Z

Okay, so log from (1) shows something else, it says the endless cycle occurs with (3)=>needdownmsg -- not null as shown earlier:

┌ Info: 23:22:02.713 | 1---1| x2 @ null | 4e, blockUntilChildrenHaveStatus_StateMachine, maybe wait cliq=2, child status=upsolved.
└ @ IncrementalInference /home/dehann/.julia/dev/IncrementalInference/src/CliqStateMachine.jl:63
┌ Info: 23:22:02.713 | 1---1| x2 @ null | 4e, blockUntilChildrenHaveStatus_StateMachine, maybe wait cliq=3, child status=needdownmsg.
└ @ IncrementalInference /home/dehann/.julia/dev/IncrementalInference/src/CliqStateMachine.jl:63
┌ Info: 23:22:02.721 | 1---1| x2 @ null | 4b, trafficRedirectConsolidate459_StateMachine, cliqst=null
└ @ IncrementalInference /home/dehann/.julia/dev/IncrementalInference/src/CliqStateMachine.jl:63

Note the run away cycle happens within a few milliseconds after 23:22:02.713.

The next place to look is back in the Logical.log at what happened with (3) between null and needdownmsg:

39m trafficRe null
40m maybeNeed null
95m blockSibl need
96m waitChang need

The jump in global sequential step 40 to 95 is the part of interest. I'm checking out maybeNeedDwnMsg_StateMachine

dehann · 2020-09-25T03:49:40Z

Yup, here is the call in (3) that sets off the cycle in (1) -- i.e. the last step from any neighboring CSMs:

IncrementalInference.jl/src/CliqStateMachine.jl

Line 1150 in 3a41031

prepPutCliqueStatusMsgUp!(csmc, :needdownmsg)

So now we can look at how to resolve the process inside only CSM (1) with the pre-knowledge from (2) and (3)...

dehann · 2020-09-25T03:58:18Z

So we need a way to redirect the loop (#754 (comment)) towards one of the slow steps -- either:

slowIfChildrenNotUpsolved_StateMachine -- "Delay loop if waiting on upsolves to complete.", or
slowWhileInit_StateMachine-- "Function to iterate through while initializing various child cliques that start off needdownmsg."

dehann · 2020-09-25T04:03:19Z

Lets start with slowWhileInit_StateMachine as the most promising.

IncrementalInference.jl/src/CliqStateMachine.jl

Lines 740 to 751 in 3a41031

    
           function slowWhileInit_StateMachine(csmc::CliqStateMachineContainer) 
        
             if doAnyChildrenNeedDwnMsg(csmc.tree, csmc.cliq) 
        
               infocsm(csmc, "7e, slowWhileInit_StateMachine, must wait for new child messages.") 
        
               # wait for THIS clique to be notified (PUSH NOTIFICATION FROM CHILDREN at `prepPutCliqueStatusMsgUp!`) 
        
               wait(getSolveCondition(csmc.cliq)) 
        
             end 
        
             # go to 8f 
        
             return prepInitUp_StateMachine 
        
           end

Yup, that looks pretty good. The question now is which of the cycle members (#754 (comment)) should divert out in this case. Best is to read all the code in those 4 cycle functions and see which part is closest to the current case...

resolve CSM cycle problem, #754

dehann · 2020-10-01T15:52:21Z

okay, more information -- so clique 3 during "good case" waits to go from :needdownmsg --> :initialized only after sibling clique 2 gets to :upsolved. In the bad case, clique 3 changes to :initialized way earlier while clique 2 is still :initialized for quite some time.

Good case:

3.14   10:57:18.148 dwnInitSiblingWait   | needdownmsg P 1null        C 7needdownmsg |S| 2needdownmsg 5needdownmsg 
3.15   10:57:19.329 waitChangeOnParent   | needdownmsg P 1initialized C 7needdownmsg |S| 2needdownmsg 5needdownmsg 

3.19   10:57:49.605 dwnInitSiblingWait   | needdownmsg P 1initialized C 7needdownmsg |S| 2upsolved 5needdownmsg 
3.20   10:57:50.279 tryDwnInitCliq       | needdownmsg P 1initialized C 7needdownmsg |S| 2upsolved 5needdownmsg 
3.21   10:57:52.131 rmMsgLikelihoodsAf   | initialized P 1initialized C 7needdownmsg |S| 2upsolved 5needdownmsg

Bad case:

3.14   10:53:11.628 dwnInitSiblingWait   | needdownmsg P 1null        C 7needdownmsg |S| 2needdownmsg 5needdownmsg 
3.15   10:53:12.641 tryDwnInitCliq       | needdownmsg P 1initialized C 7needdownmsg |S| 2needdownmsg 5needdownmsg 
3.16   10:53:15.991 rmMsgLikelihoodsAf   | initialized P 1initialized C 7needdownmsg |S| 2initialized 5needdownmsg

So why in the bad case does 3.14 dwnInitSiblingWait directly go to tryDwnInitCliq while in the good case (with seemingly similar neighboring statuses) does it first go to waitChangeOnParent?

dehann · 2020-10-01T16:05:42Z

CSM log on clique 3,

Good case:

┌ Info: 10:57:18.689 | 3---3| l1 @ needdownmsg | 8j, dwnInitSiblingWaitOrder_StateMachine, dwinmsgs keys=[:x0]
┌ Info: 10:57:18.69 | 3---3| l1 @ needdownmsg | 8j, dwnInitSiblingWaitOrder_StateMachine, sdims=Dict(:x0 => 3.0,:x7 => 0.0,:l1 => 0.0)
┌ Info: cliq 3, updateCliqSolvableDims! -- cleared solvableDims
┌ Info: cliq 3, updateCliqSolvableDims! -- updated
┌ Info: getSiblingsDelayOrder -- number siblings=3, sibidx=2
┌ Info: getSiblingsDelayOrder -- allinters=[0 1 1; 0 0 1; 0 0 0]
┌ Info: getSiblingsDelayOrder -- rows=[0 1 2]
┌ Info: getSiblingsDelayOrder -- rows=[2; 1; 0]
┌ Info: getSiblingsDelayOrder -- all blocking: sum(remainingmask) == length(stat), stat=[:needdownmsg, :needdownmsg, :needdownmsg]
┌ Info: getSiblingsDelayOrder -- candidates=Bool[1, 1, 0], maxcan=[1, 2], rows=[0 1 2]
┌ Info: getSiblingsDelayOrder -- not a max and should block
┌ Info: 10:57:19.329 | 3---3| l1 @ needdownmsg | 8j, dwnInitSiblingWaitOrder_StateMachine, prioritize
┌ Info: getCliqSiblingsPriorityInitOrder, idx=1 of 3, x2 length solvableDims=1
┌ Info: getCliqSiblingsPriorityInitOrder, finished idx=1 of 3, length solvableDims=1
┌ Info: getCliqSiblingsPriorityInitOrder, idx=2 of 3, l1 length solvableDims=1
┌ Info: getCliqSiblingsPriorityInitOrder, finished idx=2 of 3, length solvableDims=1
┌ Info: getCliqSiblingsPriorityInitOrder, idx=3 of 3, x6 length solvableDims=1
┌ Info: getCliqSiblingsPriorityInitOrder, finished idx=3 of 3, length solvableDims=1
┌ Info: getCliqSiblingsPriorityInitOrder, done p=[1, 2, 3]
┌ Info: 10:57:19.329 | 3---3| l1 @ needdownmsg | 8j, dwnInitSiblingWaitOrder_StateMachine, 1, true, true, solord = [2, 3, 5]

Bad case

┌ Info: 10:53:12.036 | 3---3| l1 @ needdownmsg | 8j, dwnInitSiblingWaitOrder_StateMachine, dwinmsgs keys=[:x0]
┌ Info: 10:53:12.445 | 3---3| l1 @ needdownmsg | 8j, dwnInitSiblingWaitOrder_StateMachine, sdims=Dict(:x0 => 3.0,:x7 => 0.0,:l1 => 0.0)
┌ Info: cliq 3, updateCliqSolvableDims! -- cleared solvableDims
┌ Info: cliq 3, updateCliqSolvableDims! -- updated
┌ Info: getSiblingsDelayOrder -- number siblings=3, sibidx=2
┌ Info: getSiblingsDelayOrder -- allinters=[0 1 1; 0 0 1; 0 0 0]
┌ Info: getSiblingsDelayOrder -- rows=[0 1 2]
┌ Info: getSiblingsDelayOrder -- rows=[2; 1; 0]
┌ Info: getSiblingsDelayOrder -- all blocking: sum(remainingmask) == length(stat), stat=[:needdownmsg, :needdownmsg, :needdownmsg]
┌ Info: getSiblingsDelayOrder -- candidates=Bool[1, 1, 0], maxcan=[1, 2], rows=[0 1 2]
┌ Info: getSiblingsDelayOrder -- not a max and should block
┌ Info: 10:53:12.641 | 3---3| l1 @ needdownmsg | 8j, dwnInitSiblingWaitOrder_StateMachine, prioritize
┌ Info: getCliqSiblingsPriorityInitOrder, idx=1 of 3, x2 length solvableDims=1
┌ Info: getCliqSiblingsPriorityInitOrder, finished idx=1 of 3, length solvableDims=1
┌ Info: getCliqSiblingsPriorityInitOrder, idx=2 of 3, l1 length solvableDims=1
┌ Info: getCliqSiblingsPriorityInitOrder, finished idx=2 of 3, length solvableDims=1
┌ Info: getCliqSiblingsPriorityInitOrder, idx=3 of 3, x6 length solvableDims=1
┌ Info: getCliqSiblingsPriorityInitOrder, finished idx=3 of 3, length solvableDims=1
┌ Info: getCliqSiblingsPriorityInitOrder, done p=[2, 1, 3]
┌ Info: 10:53:12.641 | 3---3| l1 @ needdownmsg | 8j, dwnInitSiblingWaitOrder_StateMachine, 1, true, true, solord = [3, 2, 5]

So the difference is in the solve order, good: [2, 3, 5]; and bad [3, 2, 5]

dehann · 2020-10-01T16:32:45Z

IncrementalInference.jl/src/InferDimensionUtils.jl

Lines 380 to 407 in d410dcc

    
           function getCliqSiblingsPriorityInitOrder(tree::AbstractBayesTree, 
        
                                                     prnt::TreeClique, 
        
                                                     logger=ConsoleLogger() )::Vector{Int} 
        
             # 
        
             sibs = getChildren(tree, prnt) 
        
             len = length(sibs) 
        
             tdims = Vector{Int}(undef, len) 
        
             sidx = Vector{Int}(undef, len) 
        
             for idx in 1:len 
        
               cliqd = getCliqueData(sibs[idx]) 
        
               with_logger(logger) do 
        
                 @info "getCliqSiblingsPriorityInitOrder, idx=$idx of $len, $(cliqd.frontalIDs[1]) length solvableDims=$(length(cliqd.solvableDims.data))" 
        
               end 
        
               flush(logger.stream) 
        
               sidims = fetchCliqSolvableDims(sibs[idx]) 
        
               sidx[idx] = sibs[idx].index 
        
               tdims[idx] = sum(collect(values(sidims))) 
        
               with_logger(logger) do 
        
                 @info "getCliqSiblingsPriorityInitOrder, finished idx=$idx of $len, length solvableDims=$(length(cliqd.solvableDims.data))" 
        
               end 
        
               flush(logger.stream) 
        
             end 
        
             p = sortperm(tdims, rev=true) 
        
             with_logger(logger) do 
        
               @info "getCliqSiblingsPriorityInitOrder, done p=$p" 
        
             end 
        
             return sidx[p] 
        
           end

With the most troubling part the separate out:

IncrementalInference.jl/src/InferDimensionUtils.jl

Line 394 in d410dcc

sidims = fetchCliqSolvableDims(sibs[idx])

Which probably means that the priority this is on issue #910

dehann · 2020-10-04T15:33:35Z

Okay with current master, I still get solve order as the main issue, however, the hex example can be forced to complete by adding the following simple delay:

idb = [6=>(determineCliqNeedDownMsg_StateMachine=>10);]

The difference in this case in the sibling init-solve order between "good" and "bad" cases. The "good" case solves in order:
CSM:

6.25   02:15:31.771 determineCliqNeedD   | needdownmsg P 2initialized C ---- |S| 5upsolved 
6.26   02:15:31.776 dwnInitSiblingWait   | needdownmsg P 2initialized C ---- |S| 5upsolved 

Info: 01:08:12.68 | 6---6| x3 @ needdownmsg | 8j, dwnInitSiblingWaitOrder_StateMachine, 2, true, false, solord = [6, 5]

and the "bad" case:

6.25   01:10:30.424 determineCliqNeedD   | needdownmsg P 2initialized C ---- |S| 5initialized 
6.26   01:10:30.429 dwnInitSiblingWait   | needdownmsg P 2initialized C ---- |S| 5initialized 

Info: 01:10:30.463 | 6---6| x3 @ needdownmsg | 8j, dwnInitSiblingWaitOrder_StateMachine, 2, true, false, solord = [5, 6]

The CSM sequencing for clique 6 either has sibling 5 as :initialized (bad) or :upsolved (good):

The forced delay shown above ensures that the "good" sequence occurs (debug purpose only). First time to compile adds enough delay to 6 as to induce the error. Consecutive runs solve fine. The forced delay on 6 is long enough for both precompile and already compiled cases to have CSM complete its solve.

dehann · 2020-10-04T16:09:30Z

My current view of the problem here is that the joint synchronization between solvableDims as a channel (directly from siblings) as well as passing message/status information down from the parent is producing a deadlock condition. Solving #910 is the best suited way for deterministic solution to remove any possibility of this deadlock occurring.

dehann · 2020-10-08T02:52:13Z

workaround fix for current first compile test failure in IIF on current master on hex test in testBasicTreeInit.jl im locally using a delay hack while doing #910,

injectDelayBefore = [6=>(determineCliqNeedDownMsg_StateMachine=>10);]

See hack in code here:

IncrementalInference.jl/test/testBasicTreeInit.jl

Lines 53 to 58 in e3c0b52

    
           idb = [6=>(determineCliqNeedDownMsg_StateMachine=>10);] 
        
           # mkpath(getLogPath(fg)) 
        
           # verbosefid = open(joinLogPath(fg, "csmVerbose.log"),"w") 
        
           tree, smt, hist = solveTree!(fg, timeout=70, injectDelayBefore=idb ) # , verbose=true, verbosefid=verbosefid)

dehann · 2020-10-08T15:58:51Z

Ah, not great this issue has been re-introduced with #958

See example here:
#958 (comment)

dehann · 2020-10-08T15:59:40Z

We should add that test to RoME to help catch it in the future.

Affie · 2020-10-08T16:06:42Z

Perhaps I should add this to tree init tests?

IncrementalInference.jl/test/testExpXstroke.jl

Lines 27 to 49 in 362e2fa

    
           # linear octo  
        
           N=8 
        
           fg = generateCanonicalFG_lineStep(N;  
        
                                             graphinit=false, 
        
                                             poseEvery=1,  
        
                                             landmarkEvery=N+1,  
        
                                             posePriorsAt=[0], 
        
                                             landmarkPriorsAt=[],  
        
                                             sightDistance=N+1) 
        
           deleteFactor!.(fg, [Symbol("x$(i)lm0f1") for i=1:(N-1)]) 
        
           getSolverParams(fg).graphinit = false 
        
           getSolverParams(fg).treeinit = true 
        
           smtasks = Task[] 
        
           tree, smt, hists = IIF.solveTree_X!(fg; smtasks=smtasks); 
        
           for var in sortDFG(ls(fg)) 
        
               sppe = getVariable(fg,var) |> getPPE |> IIF.getSuggestedPPE 
        
               println("Testing ", var,": ", sppe) 
        
               @test isapprox(sppe[1], parse(Int,string(var)[end]), atol=0.15) 
        
           end

It is the same structure as generateCanonicalFG_Circle(8)

EDIT: see #959

Affie · 2020-10-08T18:37:05Z

It doesn't seem to be #958, I tested to see if was #958 or #957 and it fails before both of them.

dehann · 2020-10-12T05:59:58Z

Obsolete with resolution of #855 and decision to only use xstroke-take for final consolidation of CSMs.

dehann added bug testing inference needs testing clique state machine tree init labels Jun 20, 2020

dehann added this to the v0.0.x milestone Jun 20, 2020

dehann changed the title ~~CSM endless cycle fail case~~ CSM endless cycle MWE fail case Jun 20, 2020

dehann mentioned this issue Sep 19, 2020

Tree init with simple graph freezes #843

Closed

dehann modified the milestones: v0.0.x, v0.16.0 Sep 19, 2020

dehann mentioned this issue Sep 19, 2020

#459 dwnMsg consolidation #908

Merged

dehann closed this as completed Sep 19, 2020

Affie reopened this Sep 22, 2020

This was referenced Sep 25, 2020

State of the Tree / CSM #889

Closed

Documenting CSM and inference steps for debugging #443

Open

dehann closed this as completed in #936 Oct 1, 2020

dehann added a commit that referenced this issue Oct 1, 2020

Merge pull request #936 from JuliaRobotics/csm/3Q20/fix754

d410dcc

resolve CSM cycle problem, #754

dehann reopened this Oct 1, 2020

dehann mentioned this issue Oct 2, 2020

Cherrypick cleanup and deprecation from #940 #942

Merged

dehann added a commit that referenced this issue Oct 7, 2020

hack test during #910 and #754

3e5f83c

This was referenced Oct 7, 2020

WIP to fix #910 #953

Merged

define new tree based down init sequence and test #344

Open

Parent CSM should resolve sibling init order (if partials or similar) #954

Closed

utils into CSM states instead, sibling dwn init #955

Merged

dehann changed the title ~~CSM endless cycle MWE fail case~~ CSM init endless cycle MWE fail case Oct 8, 2020

This was referenced Oct 8, 2020

parent starting down solve too early bug #924

Closed

Only downsolve when all children finished upsolve #958

Merged

dehann mentioned this issue Oct 8, 2020

Add tree init test JuliaRobotics/RoME.jl#337

Closed

dehann closed this as completed Oct 12, 2020

dehann added the decision label Oct 12, 2020

dehann mentioned this issue Oct 13, 2020

additional tests JuliaRobotics/RoME.jl#339

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CSM init endless cycle MWE fail case #754

CSM init endless cycle MWE fail case #754

dehann commented Jun 20, 2020 •

edited

Loading

dehann commented Sep 19, 2020

Affie commented Sep 22, 2020 •

edited by dehann

Loading

Affie commented Sep 23, 2020 •

edited

Loading

dehann commented Sep 25, 2020 •

edited

Loading

dehann commented Sep 25, 2020

dehann commented Sep 25, 2020

dehann commented Sep 25, 2020

dehann commented Sep 25, 2020 •

edited

Loading

dehann commented Sep 25, 2020 •

edited

Loading

dehann commented Sep 25, 2020 •

edited

Loading

dehann commented Sep 25, 2020 •

edited

Loading

dehann commented Oct 1, 2020 •

edited

Loading

dehann commented Oct 1, 2020 •

edited

Loading

dehann commented Oct 1, 2020

dehann commented Oct 4, 2020 •

edited

Loading

dehann commented Oct 4, 2020

dehann commented Oct 8, 2020 •

edited

Loading

dehann commented Oct 8, 2020 •

edited

Loading

dehann commented Oct 8, 2020

Affie commented Oct 8, 2020 •

edited

Loading

Affie commented Oct 8, 2020

dehann commented Oct 12, 2020

CSM init endless cycle MWE fail case #754

CSM init endless cycle MWE fail case #754

Comments

dehann commented Jun 20, 2020 • edited Loading

MWE

dehann commented Sep 19, 2020

Affie commented Sep 22, 2020 • edited by dehann Loading

Affie commented Sep 23, 2020 • edited Loading

dehann commented Sep 25, 2020 • edited Loading

dehann commented Sep 25, 2020

dehann commented Sep 25, 2020

dehann commented Sep 25, 2020

dehann commented Sep 25, 2020 • edited Loading

dehann commented Sep 25, 2020 • edited Loading

dehann commented Sep 25, 2020 • edited Loading

dehann commented Sep 25, 2020 • edited Loading

dehann commented Oct 1, 2020 • edited Loading

dehann commented Oct 1, 2020 • edited Loading

Good case:

Bad case

dehann commented Oct 1, 2020

dehann commented Oct 4, 2020 • edited Loading

dehann commented Oct 4, 2020

dehann commented Oct 8, 2020 • edited Loading

dehann commented Oct 8, 2020 • edited Loading

dehann commented Oct 8, 2020

Affie commented Oct 8, 2020 • edited Loading

Affie commented Oct 8, 2020

dehann commented Oct 12, 2020

dehann commented Jun 20, 2020 •

edited

Loading

Affie commented Sep 22, 2020 •

edited by dehann

Loading

Affie commented Sep 23, 2020 •

edited

Loading

dehann commented Sep 25, 2020 •

edited

Loading

dehann commented Sep 25, 2020 •

edited

Loading

dehann commented Sep 25, 2020 •

edited

Loading

dehann commented Sep 25, 2020 •

edited

Loading

dehann commented Sep 25, 2020 •

edited

Loading

dehann commented Oct 1, 2020 •

edited

Loading

dehann commented Oct 1, 2020 •

edited

Loading

dehann commented Oct 4, 2020 •

edited

Loading

dehann commented Oct 8, 2020 •

edited

Loading

dehann commented Oct 8, 2020 •

edited

Loading

Affie commented Oct 8, 2020 •

edited

Loading