-
Notifications
You must be signed in to change notification settings - Fork 503
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
build: fix issues with leaving invoke containers running #1257
Conversation
@@ -643,7 +643,16 @@ func Invoke(ctx context.Context, cfg ContainerConfig) error { | |||
return errors.Errorf("result must be provided") | |||
} | |||
c, res := cfg.ResultCtx.Client, cfg.ResultCtx.Res | |||
_, err := c.Build(ctx, client.SolveOpt{}, "buildx", func(ctx context.Context, c gateway.Client) (*gateway.Result, error) { | |||
|
|||
mainCtx := ctx |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worth looking into this case in buildkit client later on so hacks like this are not required. It looks like that if this context gets called in a certain timeframe the /solve
grpc call is canceled before the callback returns, causing some calls inside callback to fail with "no such job". Build()
should never return if the callback has not returned, even if the grpc already goes down.
@@ -673,7 +682,8 @@ func Invoke(ctx context.Context, cfg ContainerConfig) error { | |||
if err != nil { | |||
return nil, err | |||
} | |||
defer ctr.Release(ctx) | |||
defer ctr.Release(context.TODO()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise, if ctx
is already canceled then this does nothing.
go func() { | ||
m.rollback(ctx, containerConfig) | ||
}() | ||
m.rollback(ctx, containerConfig) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really part of the fix but seemed useless goroutine.
@@ -52,6 +50,10 @@ func RunMonitor(ctx context.Context, containerConfig build.ContainerConfig, relo | |||
go func() { | |||
defer close(doneCh) | |||
defer in.Close() | |||
go func() { | |||
<-ctx.Done() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is to handle appcontext
canceling from sigint
. Otherwise this goroutine will keep running although everything else shuts down. Worth thinking through this goroutine structure again to make the cancel path more clear and avoid parallel goroutines with no defined shutdown order. Atm there is a chance for a small race if this happens when the initial Invoke()
call is also in the middle of processing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you.
Signed-off-by: Tonis Tiigi <[email protected]>
2f9f951
to
75ddc5b
Compare
Currently, when running
--invoke
, there exist the following issues:When invoke container is not shut down by gracefully killing the container process itself then the container is left running in buildkit. This means that it also holds onto resources and cache records in
buildx du
do not become releasable. To reproduce, switch IO to monitor mode and then type "exit" orctrl-D
/SIGINT.When switching IO to monitor mode and issuing "rollback" command the buildx process goes to 100% CPU (200% on repeat). Looks like there is a busy loop.
The release problem should be fixed on buildkit side as well as buildkit should handle misbehaving clients. I plan to fix it but we need this for old daemons anyway.
Tbh, I never figured out where the 100% CPU was coming from. Profiling CPU/Blocking was quite useless showing lots of work in stdlib
runtime.selectgo
but no call path. I tried to debug the newfor-select
code in monitor, but I couldn't confirm that it started to loop. But after I fixed the release bug the 100% CPU was gone as well.@ktock
Signed-off-by: Tonis Tiigi [email protected]