Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSM init endless cycle MWE fail case #754

Closed
dehann opened this issue Jun 20, 2020 · 47 comments · Fixed by #936
Closed

CSM init endless cycle MWE fail case #754

dehann opened this issue Jun 20, 2020 · 47 comments · Fixed by #936

Comments

@dehann
Copy link
Member

dehann commented Jun 20, 2020

Found a easy fail case on CSM, relates to existing test:

fg = generateCanonicalFG_CaesarRing1D(graphinit=false)

MWE

Clique 5 and 6 solves cycle endlessly if the following clique solve sequence is used:

fg = generateCanonicalFG_CaesarRing1D(graphinit=false)
getSolverParams(fg).graphinit = false
getSolverParams(fg).limititers = 50
TS = Vector{Task}(undef,6)
TS[5] = solveCliq!(fg, tree, :x5, async=true)  # , recordcliq=true
TS[6] = solveCliq!(fg, tree, :x3, async=true)  # , recordcliq=true
@dehann dehann added this to the v0.0.x milestone Jun 20, 2020
@dehann dehann changed the title CSM endless cycle fail case CSM endless cycle MWE fail case Jun 20, 2020
@dehann dehann modified the milestones: v0.0.x, v0.16.0 Sep 19, 2020
@dehann dehann closed this as completed Sep 19, 2020
@dehann
Copy link
Member Author

dehann commented Sep 19, 2020

also added as a recent new test in IIF, testBasicTreeInit.jl

@Affie
Copy link
Member

Affie commented Sep 22, 2020

Another basic endless cycle fail case (after #459)

using IncrementalInference
fg = generateCanonicalFG_lineStep(5; 
                                  poseEvery=1, 
                                  landmarkEvery=5, 
                                  posePriorsAt=[0,2], 
                                  sightDistance=4,
                                  solverParams=SolverParams(algorithms=[:default, :parametric]))
                                  
getSolverParams(fg).graphinit = false
getSolverParams(fg).treeinit = true
getSolverParams(fg).limititers = 50
smtasks = Task[]
tree, smt, hist = solveTree!(fg; smtasks=smtasks, verbose=true, timeout=50, recordcliqs=ls(fg));

image

@Affie Affie reopened this Sep 22, 2020
@Affie
Copy link
Member

Affie commented Sep 23, 2020

when clique 1 hits limititers = 50
image

Giving 2 a chance to solve results in endless cycle in this state:
image

@dehann
Copy link
Member Author

dehann commented Sep 25, 2020

TO DEBUG THE PACKAGES WERE AT DFG v0.10.5, and IIF at 3a41031


I'm going to debug a little by adding:

mkpath(getLogPath(fg))
fid = open(joinLogPath(fg,"csmVerbose.log"), "w")
#solveTree!(...;  verbosefid=fid, verbose=true, ...)
tree, smt, hist = solveTree!(fg; smtasks=smtasks, verbose=true, verbosefid=fid, timeout=50, recordcliqs=ls(fg));

# after finished
close(fid)


open(joinLogPath(fg, "csmLogicalReconstructMax.log"),"w") do io
  IIF.reconstructCSMHistoryLogical(getLogPath(fg), fid=io)
end

@dehann
Copy link
Member Author

dehann commented Sep 25, 2020

Here it is:
csmLogicalReconstructMax.log

@dehann
Copy link
Member Author

dehann commented Sep 25, 2020

So this is the loop on (1):

trafficRe null
maybeNeed null
determine null
blockUnti null

While Children (2)=>init, and (3)=>null

First question for me is why does blockUntilChildrenHaveStatus not stop the loop, since one child is null...

@dehann
Copy link
Member Author

dehann commented Sep 25, 2020

We can also look at the log on (1):
log.txt

@dehann
Copy link
Member Author

dehann commented Sep 25, 2020

Okay, so log from (1) shows something else, it says the endless cycle occurs with (3)=>needdownmsg -- not null as shown earlier:

┌ Info: 23:22:02.713 | 1---1| x2 @ null | 4e, blockUntilChildrenHaveStatus_StateMachine, maybe wait cliq=2, child status=upsolved.
└ @ IncrementalInference /home/dehann/.julia/dev/IncrementalInference/src/CliqStateMachine.jl:63
┌ Info: 23:22:02.713 | 1---1| x2 @ null | 4e, blockUntilChildrenHaveStatus_StateMachine, maybe wait cliq=3, child status=needdownmsg.
└ @ IncrementalInference /home/dehann/.julia/dev/IncrementalInference/src/CliqStateMachine.jl:63
┌ Info: 23:22:02.721 | 1---1| x2 @ null | 4b, trafficRedirectConsolidate459_StateMachine, cliqst=null
└ @ IncrementalInference /home/dehann/.julia/dev/IncrementalInference/src/CliqStateMachine.jl:63

Note the run away cycle happens within a few milliseconds after 23:22:02.713.

The next place to look is back in the Logical.log at what happened with (3) between null and needdownmsg:

39m trafficRe null
40m maybeNeed null
95m blockSibl need
96m waitChang need

The jump in global sequential step 40 to 95 is the part of interest. I'm checking out maybeNeedDwnMsg_StateMachine

@dehann
Copy link
Member Author

dehann commented Sep 25, 2020

Yup, here is the call in (3) that sets off the cycle in (1) -- i.e. the last step from any neighboring CSMs:

prepPutCliqueStatusMsgUp!(csmc, :needdownmsg)

So now we can look at how to resolve the process inside only CSM (1) with the pre-knowledge from (2) and (3)...

@dehann
Copy link
Member Author

dehann commented Sep 25, 2020

So we need a way to redirect the loop (#754 (comment)) towards one of the slow steps -- either:

  • slowIfChildrenNotUpsolved_StateMachine -- "Delay loop if waiting on upsolves to complete.", or
  • slowWhileInit_StateMachine-- "Function to iterate through while initializing various child cliques that start off needdownmsg."

@dehann
Copy link
Member Author

dehann commented Sep 25, 2020

Lets start with slowWhileInit_StateMachine as the most promising.

function slowWhileInit_StateMachine(csmc::CliqStateMachineContainer)
if doAnyChildrenNeedDwnMsg(csmc.tree, csmc.cliq)
infocsm(csmc, "7e, slowWhileInit_StateMachine, must wait for new child messages.")
# wait for THIS clique to be notified (PUSH NOTIFICATION FROM CHILDREN at `prepPutCliqueStatusMsgUp!`)
wait(getSolveCondition(csmc.cliq))
end
# go to 8f
return prepInitUp_StateMachine
end

Yup, that looks pretty good. The question now is which of the cycle members (#754 (comment)) should divert out in this case. Best is to read all the code in those 4 cycle functions and see which part is closest to the current case...

dehann added a commit that referenced this issue Oct 1, 2020
@dehann dehann reopened this Oct 1, 2020
@dehann
Copy link
Member Author

dehann commented Oct 1, 2020

okay, more information -- so clique 3 during "good case" waits to go from :needdownmsg --> :initialized only after sibling clique 2 gets to :upsolved. In the bad case, clique 3 changes to :initialized way earlier while clique 2 is still :initialized for quite some time.

Good case:

3.14   10:57:18.148 dwnInitSiblingWait   | needdownmsg P 1null        C 7needdownmsg |S| 2needdownmsg 5needdownmsg 
3.15   10:57:19.329 waitChangeOnParent   | needdownmsg P 1initialized C 7needdownmsg |S| 2needdownmsg 5needdownmsg 

3.19   10:57:49.605 dwnInitSiblingWait   | needdownmsg P 1initialized C 7needdownmsg |S| 2upsolved 5needdownmsg 
3.20   10:57:50.279 tryDwnInitCliq       | needdownmsg P 1initialized C 7needdownmsg |S| 2upsolved 5needdownmsg 
3.21   10:57:52.131 rmMsgLikelihoodsAf   | initialized P 1initialized C 7needdownmsg |S| 2upsolved 5needdownmsg 

Bad case:

3.14   10:53:11.628 dwnInitSiblingWait   | needdownmsg P 1null        C 7needdownmsg |S| 2needdownmsg 5needdownmsg 
3.15   10:53:12.641 tryDwnInitCliq       | needdownmsg P 1initialized C 7needdownmsg |S| 2needdownmsg 5needdownmsg 
3.16   10:53:15.991 rmMsgLikelihoodsAf   | initialized P 1initialized C 7needdownmsg |S| 2initialized 5needdownmsg 

So why in the bad case does 3.14 dwnInitSiblingWait directly go to tryDwnInitCliq while in the good case (with seemingly similar neighboring statuses) does it first go to waitChangeOnParent?

@dehann
Copy link
Member Author

dehann commented Oct 1, 2020

CSM log on clique 3,

Good case:

┌ Info: 10:57:18.689 | 3---3| l1 @ needdownmsg | 8j, dwnInitSiblingWaitOrder_StateMachine, dwinmsgs keys=[:x0]
┌ Info: 10:57:18.69 | 3---3| l1 @ needdownmsg | 8j, dwnInitSiblingWaitOrder_StateMachine, sdims=Dict(:x0 => 3.0,:x7 => 0.0,:l1 => 0.0)
┌ Info: cliq 3, updateCliqSolvableDims! -- cleared solvableDims
┌ Info: cliq 3, updateCliqSolvableDims! -- updated
┌ Info: getSiblingsDelayOrder -- number siblings=3, sibidx=2
┌ Info: getSiblingsDelayOrder -- allinters=[0 1 1; 0 0 1; 0 0 0]
┌ Info: getSiblingsDelayOrder -- rows=[0 1 2]
┌ Info: getSiblingsDelayOrder -- rows=[2; 1; 0]
┌ Info: getSiblingsDelayOrder -- all blocking: sum(remainingmask) == length(stat), stat=[:needdownmsg, :needdownmsg, :needdownmsg]
┌ Info: getSiblingsDelayOrder -- candidates=Bool[1, 1, 0], maxcan=[1, 2], rows=[0 1 2]
┌ Info: getSiblingsDelayOrder -- not a max and should block
┌ Info: 10:57:19.329 | 3---3| l1 @ needdownmsg | 8j, dwnInitSiblingWaitOrder_StateMachine, prioritize
┌ Info: getCliqSiblingsPriorityInitOrder, idx=1 of 3, x2 length solvableDims=1
┌ Info: getCliqSiblingsPriorityInitOrder, finished idx=1 of 3, length solvableDims=1
┌ Info: getCliqSiblingsPriorityInitOrder, idx=2 of 3, l1 length solvableDims=1
┌ Info: getCliqSiblingsPriorityInitOrder, finished idx=2 of 3, length solvableDims=1
┌ Info: getCliqSiblingsPriorityInitOrder, idx=3 of 3, x6 length solvableDims=1
┌ Info: getCliqSiblingsPriorityInitOrder, finished idx=3 of 3, length solvableDims=1
┌ Info: getCliqSiblingsPriorityInitOrder, done p=[1, 2, 3]
┌ Info: 10:57:19.329 | 3---3| l1 @ needdownmsg | 8j, dwnInitSiblingWaitOrder_StateMachine, 1, true, true, solord = [2, 3, 5]

Bad case

┌ Info: 10:53:12.036 | 3---3| l1 @ needdownmsg | 8j, dwnInitSiblingWaitOrder_StateMachine, dwinmsgs keys=[:x0]
┌ Info: 10:53:12.445 | 3---3| l1 @ needdownmsg | 8j, dwnInitSiblingWaitOrder_StateMachine, sdims=Dict(:x0 => 3.0,:x7 => 0.0,:l1 => 0.0)
┌ Info: cliq 3, updateCliqSolvableDims! -- cleared solvableDims
┌ Info: cliq 3, updateCliqSolvableDims! -- updated
┌ Info: getSiblingsDelayOrder -- number siblings=3, sibidx=2
┌ Info: getSiblingsDelayOrder -- allinters=[0 1 1; 0 0 1; 0 0 0]
┌ Info: getSiblingsDelayOrder -- rows=[0 1 2]
┌ Info: getSiblingsDelayOrder -- rows=[2; 1; 0]
┌ Info: getSiblingsDelayOrder -- all blocking: sum(remainingmask) == length(stat), stat=[:needdownmsg, :needdownmsg, :needdownmsg]
┌ Info: getSiblingsDelayOrder -- candidates=Bool[1, 1, 0], maxcan=[1, 2], rows=[0 1 2]
┌ Info: getSiblingsDelayOrder -- not a max and should block
┌ Info: 10:53:12.641 | 3---3| l1 @ needdownmsg | 8j, dwnInitSiblingWaitOrder_StateMachine, prioritize
┌ Info: getCliqSiblingsPriorityInitOrder, idx=1 of 3, x2 length solvableDims=1
┌ Info: getCliqSiblingsPriorityInitOrder, finished idx=1 of 3, length solvableDims=1
┌ Info: getCliqSiblingsPriorityInitOrder, idx=2 of 3, l1 length solvableDims=1
┌ Info: getCliqSiblingsPriorityInitOrder, finished idx=2 of 3, length solvableDims=1
┌ Info: getCliqSiblingsPriorityInitOrder, idx=3 of 3, x6 length solvableDims=1
┌ Info: getCliqSiblingsPriorityInitOrder, finished idx=3 of 3, length solvableDims=1
┌ Info: getCliqSiblingsPriorityInitOrder, done p=[2, 1, 3]
┌ Info: 10:53:12.641 | 3---3| l1 @ needdownmsg | 8j, dwnInitSiblingWaitOrder_StateMachine, 1, true, true, solord = [3, 2, 5]

So the difference is in the solve order, good: [2, 3, 5]; and bad [3, 2, 5]

@dehann
Copy link
Member Author

dehann commented Oct 1, 2020

function getCliqSiblingsPriorityInitOrder(tree::AbstractBayesTree,
prnt::TreeClique,
logger=ConsoleLogger() )::Vector{Int}
#
sibs = getChildren(tree, prnt)
len = length(sibs)
tdims = Vector{Int}(undef, len)
sidx = Vector{Int}(undef, len)
for idx in 1:len
cliqd = getCliqueData(sibs[idx])
with_logger(logger) do
@info "getCliqSiblingsPriorityInitOrder, idx=$idx of $len, $(cliqd.frontalIDs[1]) length solvableDims=$(length(cliqd.solvableDims.data))"
end
flush(logger.stream)
sidims = fetchCliqSolvableDims(sibs[idx])
sidx[idx] = sibs[idx].index
tdims[idx] = sum(collect(values(sidims)))
with_logger(logger) do
@info "getCliqSiblingsPriorityInitOrder, finished idx=$idx of $len, length solvableDims=$(length(cliqd.solvableDims.data))"
end
flush(logger.stream)
end
p = sortperm(tdims, rev=true)
with_logger(logger) do
@info "getCliqSiblingsPriorityInitOrder, done p=$p"
end
return sidx[p]
end

With the most troubling part the separate out:

sidims = fetchCliqSolvableDims(sibs[idx])

Which probably means that the priority this is on issue #910

@dehann
Copy link
Member Author

dehann commented Oct 4, 2020

Okay with current master, I still get solve order as the main issue, however, the hex example can be forced to complete by adding the following simple delay:

idb = [6=>(determineCliqNeedDownMsg_StateMachine=>10);]

The difference in this case in the sibling init-solve order between "good" and "bad" cases. The "good" case solves in order:
CSM:

6.25   02:15:31.771 determineCliqNeedD   | needdownmsg P 2initialized C ---- |S| 5upsolved 
6.26   02:15:31.776 dwnInitSiblingWait   | needdownmsg P 2initialized C ---- |S| 5upsolved 

Info: 01:08:12.68 | 6---6| x3 @ needdownmsg | 8j, dwnInitSiblingWaitOrder_StateMachine, 2, true, false, solord = [6, 5]

and the "bad" case:

6.25   01:10:30.424 determineCliqNeedD   | needdownmsg P 2initialized C ---- |S| 5initialized 
6.26   01:10:30.429 dwnInitSiblingWait   | needdownmsg P 2initialized C ---- |S| 5initialized 

Info: 01:10:30.463 | 6---6| x3 @ needdownmsg | 8j, dwnInitSiblingWaitOrder_StateMachine, 2, true, false, solord = [5, 6]

The CSM sequencing for clique 6 either has sibling 5 as :initialized (bad) or :upsolved (good):
Screenshot from 2020-10-04 11-27-05

The forced delay shown above ensures that the "good" sequence occurs (debug purpose only). First time to compile adds enough delay to 6 as to induce the error. Consecutive runs solve fine. The forced delay on 6 is long enough for both precompile and already compiled cases to have CSM complete its solve.

@dehann
Copy link
Member Author

dehann commented Oct 4, 2020

My current view of the problem here is that the joint synchronization between solvableDims as a channel (directly from siblings) as well as passing message/status information down from the parent is producing a deadlock condition. Solving #910 is the best suited way for deterministic solution to remove any possibility of this deadlock occurring.

dehann added a commit that referenced this issue Oct 7, 2020
@dehann dehann changed the title CSM endless cycle MWE fail case CSM init endless cycle MWE fail case Oct 8, 2020
@dehann
Copy link
Member Author

dehann commented Oct 8, 2020

workaround fix for current first compile test failure in IIF on current master on hex test in testBasicTreeInit.jl im locally using a delay hack while doing #910,

injectDelayBefore = [6=>(determineCliqNeedDownMsg_StateMachine=>10);]

See hack in code here:

idb = [6=>(determineCliqNeedDownMsg_StateMachine=>10);]
# mkpath(getLogPath(fg))
# verbosefid = open(joinLogPath(fg, "csmVerbose.log"),"w")
tree, smt, hist = solveTree!(fg, timeout=70, injectDelayBefore=idb ) # , verbose=true, verbosefid=verbosefid)

@dehann
Copy link
Member Author

dehann commented Oct 8, 2020

Ah, not great this issue has been re-introduced with #958

See example here:
#958 (comment)

@dehann
Copy link
Member Author

dehann commented Oct 8, 2020

We should add that test to RoME to help catch it in the future.

@Affie
Copy link
Member

Affie commented Oct 8, 2020

Perhaps I should add this to tree init tests?

# linear octo
N=8
fg = generateCanonicalFG_lineStep(N;
graphinit=false,
poseEvery=1,
landmarkEvery=N+1,
posePriorsAt=[0],
landmarkPriorsAt=[],
sightDistance=N+1)
deleteFactor!.(fg, [Symbol("x$(i)lm0f1") for i=1:(N-1)])
getSolverParams(fg).graphinit = false
getSolverParams(fg).treeinit = true
smtasks = Task[]
tree, smt, hists = IIF.solveTree_X!(fg; smtasks=smtasks);
for var in sortDFG(ls(fg))
sppe = getVariable(fg,var) |> getPPE |> IIF.getSuggestedPPE
println("Testing ", var,": ", sppe)
@test isapprox(sppe[1], parse(Int,string(var)[end]), atol=0.15)
end

It is the same structure as generateCanonicalFG_Circle(8)

EDIT: see #959

@Affie
Copy link
Member

Affie commented Oct 8, 2020

It doesn't seem to be #958, I tested to see if was #958 or #957 and it fails before both of them.

@dehann
Copy link
Member Author

dehann commented Oct 12, 2020

Obsolete with resolution of #855 and decision to only use xstroke-take for final consolidation of CSMs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants