Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

solver: prevent edge merge to inactive states #4887

Merged
merged 3 commits into from
May 31, 2024

Conversation

sipsma
Copy link
Collaborator

@sipsma sipsma commented Apr 26, 2024

Before this, it was possible for an edge merge to happen to a target
edge/state that is no longer in the actives map, e.g.

  1. Job A solves some edges
  2. Job B solves some edges that get merged with job A's edges
  3. Job A is discarded
  4. Job C solves some edges that get merged with job B's edges, which
    recursively end up merging to job A's, which no longer exist in the
    actives map.

While this doesn't always result in an error, given the right state and
order of operations this can result in getState or getEdge being
called on the "stale" edges that are inactive and an error. E.g. if a
stale dep transitions from desired state cache-fast to cache-slow,
the slow cache logic will call getState on it and return a compute cache error.

I also suspect this is the same root cause behind inconsistent graph state errors still seen occasionally, but have not repro'd that error
locally so can't be 100% sure yet.

The fix here updates state.setEdge to register all the jobs in the
source edge's state with the target edge's state. This works out
because:

  1. edges that are the target of a merge will not have their state
    removed from the actives map if only the original job creating them
    is discarded
  2. those edges are still removed eventually, but only when all jobs
    referencing them (now including jobs referencing them via edge
    merges) are discarded

I previously was able to repro a failed to compute cache key: failed to get state for index 0 error very consistently locally but no longer can with this change. It's hard to integ test since it involves a very specific set of state's and ordering, but added a unit test for the change made here to not discard jobs that are still referenced via edge merges.


Also have some updates to logs that were useful for parsing them: eab04dc

In case it's ever useful to anyone reading this, the code I wrote for parsing, filtering and displaying the scheduler debugs logs is here: https://gist.github.com/sipsma/e67119d96cebe06e69e69f9ba9be2e78

  • Very far from polished or complete, but still ended up being extremely useful for debugging here when I had 1.4GB of logs to sift through 😄

Draft Until:

  • I noticed that buildkitd after this change had 5000+ goroutines left around stuck on progress-related code. Need to check if pre-existing issue or related to this change.
    • Seems to be a pre-existing issue, same thing happens without the fix here. Dagger also just went through a lot of changes with how progress is handled so it's entirely possible this doesn't even happen in vanilla buildkitd.

sipsma added 2 commits April 26, 2024 12:07
Before this, it was possible for an edge merge to happen to a target
edge/state that is no longer in the actives map, e.g.
1. Job A solves some edges
2. Job B solves some edges that get merged with job A's edges
3. Job A is discarded
4. Job C solves some edges that get merged with job B's edges, which
   recursively end up merging to job A's, which no longer exist in the
   actives map.

While this doesn't always result in an error, given the right state and
order of operations this can result in `getState` or `getEdge` being
called on the "stale" edges that are inactive and an error. E.g. if a
stale dep transitions from desired state `cache-fast` to `cache-slow`,
the slow cache logic will call `getState` on it and return a `compute
cache` error.

I also *suspect* this is the same root cause behind `inconsistent graph
state` errors still seen occasionally, but have not repro'd that error
locally so can't be 100% sure yet.

The fix here updates `state.setEdge` to register all the jobs in the
source edge's state with the target edge's state. This works out
because:
1. edges that are the target of a merge will not have their state
   removed from the actives map if only the original job creating them
   is discarded
1. those edges are still removed eventually, but only when all jobs
   referencing them (now including jobs referencing them via edge
   merges) are discarded

Signed-off-by: Erik Sipsma <[email protected]>
While debugging solver bugs with scheduler debug logs I ended up with so
many logs that I needed to write a program that parsed them and
extracted relevant information so it could be summarized more
comprehensibly.

That was greatly simplified by updating some of the old scheduler debug
logs to use logrus fields rather than one-off `fmt.Sprintf`s. All the
same information as before is present, just formatted more consistently
for easier parsability.

I also ended up removing one of the logs I recently added that printed
each job for a vertex in `loadUnlocked`. I realized that information was
parsable from the rest of the logs and thus was mostly just extra noise.

Signed-off-by: Erik Sipsma <[email protected]>
@tonistiigi
Copy link
Member

In the case where job is released and it releases all the state except the one that was targetState for another job's merged edge, is it possible that this lone remaining state wants to access its parents and would still get an error because they have been released?

@sipsma
Copy link
Collaborator Author

sipsma commented Apr 27, 2024

In the case where job is released and it releases all the state except the one that was targetState for another job's merged edge, is it possible that this lone remaining state wants to access its parents and would still get an error because they have been released?

Oh yeah, good catch, that sounds highly plausible. Will fix by just adding jobs recursively to the state and it's ancestors + cover in unit test

@sipsma
Copy link
Collaborator Author

sipsma commented Apr 27, 2024

@tonistiigi Pushed a fix for that: ffdbd86

I realized in practice I'm not sure how that case could actually get hit since it would require that one edge is mergeable but it has ancestors that aren't mergeable. But I left the fix in anyways since there's a high probability I'm either not thinking of a case currently possible or future modifications will make that possible somehow, in which case we don't want to be on shaky ground.

@tonistiigi
Copy link
Member

I realized in practice I'm not sure how that case could actually get hit since it would require that one edge is mergeable but it has ancestors that aren't mergeable.

If the merge happens based on content checksum (cache-slow case) then the parent chains can be completely different.

@tonistiigi
Copy link
Member

If the merge happens based on content checksum (cache-slow case) then the parent chains can be completely different.

Although, in this case if the cache-slow has been computed that should mean the parent is already in completed state and likely will not be called into again.

But I don't think there is a guarantee the parent has merged even without cache-slow, even if the practical case is hard to think of for current LLB ops.

@jedevc
Copy link
Member

jedevc commented Apr 29, 2024

I might be misunderstanding the recursive part of this here - but essentially we're taking all ancestors of a node and adding the job to each state here?

Isn't state.jobs also used for other things though? Just glancing through, it appears it's used as a SessionIterator (so maybe we could end up accidentally leaking sessions into ancestors that shouldn't have access to it? e.g. if job B merges to job A, should vertices executed in this context have access to the session from B - not sure, but I do think it's a subtle behavior change). It's also used in connectProgressFromState (which might end connecting more progress than it should?).

Not quite sure about this, so take it with a grain of salt 😆

@sipsma
Copy link
Collaborator Author

sipsma commented Apr 29, 2024

I might be misunderstanding the recursive part of this here - but essentially we're taking all ancestors of a node and adding the job to each state here?

Isn't state.jobs also used for other things though? Just glancing through, it appears it's used as a SessionIterator (so maybe we could end up accidentally leaking sessions into ancestors that shouldn't have access to it? e.g. if job B merges to job A, should vertices executed in this context have access to the session from B - not sure, but I do think it's a subtle behavior change). It's also used in connectProgressFromState (which might end connecting more progress than it should?).

Not quite sure about this, so take it with a grain of salt 😆

Good question, on one hand I agree that attaching session resources to those ancestor states feels bad, on the other hand if we don't then you can get into a situation where those vertices execute and no longer have any sessions attached and thus are guaranteed to fail... And even if we do go with the current approach it's reasonably unlikely the client of the source of the edge merge has all the right attachables, so still could fail.

Same idea for progress, but rather than failing you would end up being blocked on execution that you get no progress updates about.

I'm not seeing any good immediate answers here. It might feel reasonable to just never do an edge merge into an edge that has uncompleted vertices with session attachables in itself or in an ancestor, but I don't think there's any plumbing right now to identify vertices with session attachables on this level of the solver (yet).

Wdyt @tonistiigi ? Does tracking vertices with session attachables and updating the edge merge logic as described above feel reasonable to you?

I would personally also be okay with just going back to not including ancestors for now and thinking about this in followups. This issue is causing tons of problems for us now so getting the fix in ASAP is preferable even if it's not theoretically 100% comprehensive. It seems to solve the immediate problems for us as is, so it's still a step forward. But if we agree on an approach to deal with this and it's not enormously time consuming I can squash it now.

@jedevc
Copy link
Member

jedevc commented Apr 29, 2024

if we don't then you can get into a situation where those vertices execute and no longer have any sessions attached and thus are guaranteed to fail...

Hm, I hadn't thought of that side - I think I'd rather have something that "leaks" in a maybe slightly unexpected way, instead of erroring out, or not edge merging where we might expect.

I think as implemented, this looks right to me then - already, when edge merging, we merge together sessions in this way, this doesn't feel like a significantly larger issue to me (though in general, how we handle this does feel like it has implications for how we're using/planning to use sessions in dagger - but that's more a downstream kind of question 😆)

@sipsma
Copy link
Collaborator Author

sipsma commented Apr 29, 2024

if we don't then you can get into a situation where those vertices execute and no longer have any sessions attached and thus are guaranteed to fail...

Hm, I hadn't thought of that side - I think I'd rather have something that "leaks" in a maybe slightly unexpected way, instead of erroring out, or not edge merging where we might expect.

I think as implemented, this looks right to me then - already, when edge merging, we merge together sessions in this way, this doesn't feel like a significantly larger issue to me (though in general, how we handle this does feel like it has implications for how we're using/planning to use sessions in dagger - but that's more a downstream kind of question 😆)

Generally agree, honestly if there was a "userspace" option on LLB vertexes to say "never merge this" that would solve a lot of these cases for us.

@tonistiigi
Copy link
Member

I think as implemented, this looks right to me then - already, when edge merging, we merge together sessions in this way,

That's already how all the requests are combined in the graph, it is is connected directly via LLB digest on load or later via
"merged edges" doesn't make a difference. Single buildkit instance combines requests, and does not provide isolation between them.

@tonistiigi
Copy link
Member

It's also used in connectProgressFromState (which might end connecting more progress than it should?).

It should not connect progress from the parents, but yes it would get tricky if there are parents that are not fully loaded yet. Based on #4887 (comment) maybe we can assume that this can't happen in LLB atm.

It might feel reasonable to just never do an edge merge into an edge that has uncompleted vertices with session attachables in itself or in an ancestor, but I don't think there's any plumbing right now to identify vertices with session attachables on this level of the solver (yet).

Any estimation how many merged edges would behave like this. Most of the cases come from multiple local sources tracking the same files. So these do have sessions associated with them but all of them should be completed already because the cache key based on content has been computed.

Another approach: should be the in merged edge set the sharedOp from discarded edge as a fallback to the new target. So that if the target fails (eg. because it was cancelled and session is gone) then it would fallback to the original one.

@sipsma
Copy link
Collaborator Author

sipsma commented Apr 29, 2024

Any estimation how many merged edges would behave like this. Most of the cases come from multiple local sources tracking the same files. So these do have sessions associated with them but all of them should be completed already because the cache key based on content has been computed.

ExecOps with secrets also come to mind, but I think they are in the same boat in that they should already have ran if they are part of an edge merge (sure there's some case that's not true somehow but probably small subset, and the only downside is that the slight optimization of edge merge doesn't happen).

Another approach: should be the in merged edge set the sharedOp from discarded edge as a fallback to the new target. So that if the target fails (eg. because it was cancelled and session is gone) then it would fallback to the original one.

To confirm I'm following: if edge transitions to a state where it would actually execute the sharedOp but the original job is gone, "unmerge" and use a different src edge with a job still around?

That makes conceptual sense if so. TBH my only concern with it is that it seems like it would involve tricky intricate changes to the scheduler, which is hard to modify without breaking something else unintentionally. Maybe it's not as bad as I'm imagining though?

  • Side note: this reminded me of the long standing issue where an op ends up lazy due to remote cache but then gets unlazied later and there's a failure pulling layers and the whole build fails. Maybe completely separate, but feels related somehow in that it's another case where you want to decide to the scheduler to decide to execute an op differently than it previously decided. Feel free to ignore if irrelevant.

@tonistiigi
Copy link
Member

"unmerge" and use a different src edge with a job still around?

I don't think we would need "full unmerge". The possible issues could come sharedOp when calling CacheMap() or Exec() , right? So we could just fallback to the old implementation of sharedOp for these.

For something like mounting secrets during exec I'm not sure if this is even possible. We just need to make sure that the job/session from JobB is added to the merged edge and when it runs it uses sessiongroup.Any so even if JobA is gone, it should not cause an issue.

If it is possible that the parent gets called then it would be different, but then we don't really have a way to "unmerge" anyway because of the step failing is not the merged step, but it just happens to have a child step later that is merged. Not sure if this is possible in LLB though.

}
inputState.addJobs(srcState, memo)

// tricky case: if the inputState's edge was *already* merged we should
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this case actually produces inputState != mergedInputState, is there even a need to bother with the inputState?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Provided that the whole codebase strictly uses getEdge/getState (and thus always uses the merge target if a merge happened), then yeah you could probably skip inputState.addJobs and just unconditionally call inputState.getEdge and use the return of that.

I'm not sure if that's true (haven't audited everything), do you happen to know if it is?

@sipsma
Copy link
Collaborator Author

sipsma commented Apr 30, 2024

"unmerge" and use a different src edge with a job still around?

I don't think we would need "full unmerge". The possible issues could come sharedOp when calling CacheMap() or Exec() , right? So we could just fallback to the old implementation of sharedOp for these.

For something like mounting secrets during exec I'm not sure if this is even possible. We just need to make sure that the job/session from JobB is added to the merged edge and when it runs it uses sessiongroup.Any so even if JobA is gone, it should not cause an issue.

If it is possible that the parent gets called then it would be different, but then we don't really have a way to "unmerge" anyway because of the step failing is not the merged step, but it just happens to have a child step later that is merged. Not sure if this is possible in LLB though.

@tonistiigi I just looked into the details of this:

  1. The getState in CalcSlowCache that was returning the error for us is here: https://github.com/sipsma/buildkit/blob/ffdbd863b34ad0c6505949ebbda5e61722c939ad/solver/jobs.go#L934-L934
  2. It's using s.st.vtx, which means that s.st.vtx is not a source of a merge, it's the target.
    • Basically, this bug actually requires there to be two levels of stale states. s.st has to be not in the actives map and s.st.vtx.Inputs()[index] also has to not be in the actives map.

So AFAICT the fix would require that sharedOp track all the states that have been merged into it (as opposed to tracking one extra state for the state that it was merged into). I guess that's possible, though there are many usages of sharedOp.st that I'd need look through and see if they should be impacted.

Is that what you were imagining?

ludamad added a commit to AztecProtocol/aztec-packages that referenced this pull request May 2, 2024
Until something like moby/buildkit#4887 lands upstream, looks like this is can cause the cache to not build further
ludamad added a commit to AztecProtocol/aztec-packages that referenced this pull request May 2, 2024
Until something like moby/buildkit#4887 lands
upstream, looks like this is can cause the cache to not build further
@tonistiigi tonistiigi added this to the v0.14.0 milestone May 2, 2024
@sipsma
Copy link
Collaborator Author

sipsma commented May 4, 2024

ping @tonistiigi on the questions asked here #4887 (comment)

@tonistiigi
Copy link
Member

So AFAICT the fix would require that sharedOp track all the states that have been merged into it (as opposed to tracking one extra state for the state that it was merged into). I guess that's possible, though there are many usages of sharedOp.st that I'd need look through and see if they should be impacted.

I think we first need to figure out if this case is even possible. Otherwise, we might be complicating things unnecessarily. So are there cases where 1) edge is merged with target from jobA, 2) that target still needs to resolve its own inputs or call other sharedOp methods 3) when the original jobA cancels it would cause a sharedOp method to error as well (normally convered by sessionmanager.Any).

For 3 to appear it would mean that different tasks from different sessions resolved in the same cache key (it needs to be tasks that only works on one session and not the other for the Any to not work). But in that case it should be content based cachekey and already resolved so the 2 should not apply.

If we have a case then we can look what are the constraints for "unmerge" for that case. It could be possible for example that it can exist but only for roots, to the getState to input that you linked in your comment does not apply in that case.

@sipsma
Copy link
Collaborator Author

sipsma commented May 7, 2024

One quick update, it does appear that some form of the more complex change (where we add jobs to all ancestor states) is needed due to the fact that walkProvenance looks at the actives map for all ancestors: https://github.com/dagger/buildkit/blob/63352b0fb96b6231686023fb83d4c708bd27d430/solver/jobs.go#L735-L735

Still looking through what cases are possible in practice and whether adding jobs to all ancestors is sufficient or if more is needed.

@tonistiigi
Copy link
Member

That walkProvenance case is interesting indeed. In that case you would expect the original LLB that actually matches the vertex digest, but if it calls into op.Pin() then that op should be the one that actually ran(merged target). Case to test that would be to do llb.Image(alpine:latest) and llb.Image(alpine:3.19) in parallel and check the provenance afterwards (this case can not exist in Dockerfile because the FROM commands are resolved to digest before LLB is converted).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants