Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gracefull shutdown RPC even when build is in error #4384

Merged

Conversation

tejal29
Copy link
Contributor

@tejal29 tejal29 commented Jun 24, 2020

Fixes: #3970
Maybe fixes #3991
Related: Relevant tracking issues, for context
Merge before/after: Dependent or prerequisite PRs

Description
Another attempt to gracefully shutdown RPC.

  1. This is mostly bringing back what Gracefully shutdown RPC servers. #4010 did however,
    GracefulStop and server.Shutdown wait for all idle connections to close.
    IDES RPC client connections never terminate and go into idle state hence the above functions never return.
    In this PR, the GracefulStop and server.Shutdown are wrapped in a timeout context.
    They wait for max 1 second which is enough to read the last events on the channel.

  2. shutdownAPIServer was called in PersistedPostRun.
    However, PersistentPostRun is only called if a command's RunE does not result into any error.
    https://github.com/spf13/cobra/blob/master/command.go#L843

To make sure, the servers shut down gracefully when an error occurs, we need to call the shutdownAPIServer in RunE command to execute.

User facing changes (remove if N/A)
No.

** events API changes **
On any example, please run a command where build would fail.
e.g. skaffold dev --cache-artifacts=false (remove global default repo setting)

On master

*** Before ***

curl localhost:50052/v1/event_log
....
{"result":{"timestamp":"2020-06-24T05:18:10.565560Z","
        event":{"devLoopEvent":{"status":"In Progress"}},"entry":"DevInit Iteration 0 in progress"}}
{"result":{"timestamp":"2020-06-24T05:18:10.612157Z",
    "event":{"buildEvent":{"artifact":"leeroy-web","status":"In Progress"}},"entry":"Build started for artifact leeroy-web"}}
curl: (18) transfer closed with outstanding read data remaining

You never see the build failed and devloop iteration failed events for a failed build.

On the branch

curl localhost:50052/v1/event_log
....
{"result":{"timestamp":"2020-06-24T05:21:35.823218Z",
   "event":{"devLoopEvent":{"status":"In Progress"}},"entry":"DevInit Iteration 0 in progress"}}
{"result":{"timestamp":"2020-06-24T05:21:35.864723Z",
    "event":{"buildEvent":{"artifact":"leeroy-web","status":"In Progress"}},"entry":"Build started for artifact leeroy-web"}}
{"result":{"timestamp":"2020-06-24T05:21:42.684603Z",
    "event":{"buildEvent":{
                 "artifact":"leeroy-web",
                 "status":"Failed",
                 "err":"could not push image \"leeroy-web:v1.11.0-99-g473026664\": denied: requested access to the resource is denied",
                "errCode":"BUILD_UNKNOWN",
                 "actionableErr":{"errCode":"BUILD_UNKNOWN","message":"could not push image \"leeroy-web:v1.11.0-99-g473026664\": denied: requested access to the resource is denied"}}},"entry":"Build failed for artifact leeroy-web"}}

{"result":{"timestamp":"2020-06-24T05:21:42.684624Z",
      "event":{
             "devLoopEvent":{
             "status":"Failed",
             "err":{
              "errCode":"BUILD_UNKNOWN",
              "message":"couldn't build \"leeroy-web\": could not push image \"leeroy-web:v1.11.0-99-g473026664\": denied: requested access to the resource is denied"}}},
              "entry":"DevInit Iteration 0 failed with error code BUILD_UNKNOWN"}
}

On the branch you will see failure events.

Follow-up Work (remove if N/A)

@tejal29 tejal29 requested a review from dgageot June 24, 2020 05:31
@tejal29 tejal29 requested a review from a team as a code owner June 24, 2020 05:31
@tejal29
Copy link
Contributor Author

tejal29 commented Jun 24, 2020

@dgageot Can you please take a look at this?

Thx!

@codecov
Copy link

codecov bot commented Jun 24, 2020

Codecov Report

Merging #4384 into master will increase coverage by 0.02%.
The diff coverage is 79.31%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #4384      +/-   ##
==========================================
+ Coverage   71.75%   71.77%   +0.02%     
==========================================
  Files         325      325              
  Lines       12572    12613      +41     
==========================================
+ Hits         9021     9053      +32     
- Misses       2976     2982       +6     
- Partials      575      578       +3     
Impacted Files Coverage Δ
cmd/skaffold/app/cmd/cmd.go 66.91% <ø> (-0.73%) ⬇️
cmd/skaffold/app/cmd/commands.go 87.23% <50.00%> (-7.89%) ⬇️
pkg/skaffold/server/server.go 53.77% <90.47%> (+4.88%) ⬆️
cmd/skaffold/app/cmd/schema/list.go 91.30% <0.00%> (-8.70%) ⬇️
cmd/skaffold/app/cmd/schema.go 82.60% <0.00%> (+2.60%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 46badeb...5357cf4. Read the comment docs.

@tejal29 tejal29 force-pushed the another_graceful_shutdown branch from 5467e82 to 6649a5f Compare June 24, 2020 05:55
Copy link
Contributor

@balopat balopat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM but please put the two explanatory comments to the two places for clarity - it is not obvious at all why these design choices were made just from reading the code - and while someone might look at this PRs description, it's best to have this kind of context in the code. Thank you for fixing this!!!

@tejal29 tejal29 force-pushed the another_graceful_shutdown branch from 893cae9 to 24862cf Compare June 24, 2020 18:43
@tejal29 tejal29 force-pushed the another_graceful_shutdown branch from 24862cf to 5357cf4 Compare June 24, 2020 18:44
@tejal29
Copy link
Contributor Author

tejal29 commented Jun 24, 2020

LGTM but please put the two explanatory comments to the two places for clarity - it is not obvious at all why these design choices were made just from reading the code - and while someone might look at this PRs description, it's best to have this kind of context in the code. Thank you for fixing this!!!

Done!

@tejal29 tejal29 merged commit 286fa95 into GoogleContainerTools:master Jun 24, 2020
@briandealwis
Copy link
Member

I've been digging into this worrying message from curl:

curl: (18) transfer closed with outstanding read data remaining

The grpc-gateway uses Transfer-Encoding: chunked with HTTP/1.1 connections. With chunked, each message is sent prefixed with the byte size of the message and a CR-LF:

$ curl -i --raw localhost:50052/v1/events
HTTP/1.1 200 OK
Content-Type: application/json
Grpc-Metadata-Content-Type: application/grpc
Date: Mon, 13 Jul 2020 15:04:10 GMT
Transfer-Encoding: chunked

1b5
{"result":{"timestamp":"2020-07-13T15:04:10.200435Z","event":{"metaEvent":{"entry":"Starting Skaffold: \u0026{Version: ConfigVersion:skaffold/v2beta6 GitVersion: GitCommit: GitTreeState: BuildDate: GoVersion:go1.14.3 Compiler:gc Platform:darwin/amd64}","metadata":{"build":{"numberOfArtifacts":1,"builders":[{"type":"BUILDPACKS","count":1}],"type":"LOCAL"},"deploy":{"deployers":[{"type":"KUBECTL","count":1}],"cluster":"MINIKUBE"}}}}}}

93
{"result":{"timestamp":"2020-07-13T15:04:10.201334Z","event":{"devLoopEvent":{"status":"In Progress"}},"entry":"DevInit Iteration 0 in progress"}}

c1
{"result":{"timestamp":"2020-07-13T15:04:10.299620Z","event":{"buildEvent":{"artifact":"skaffold-buildpacks","status":"In Progress"}},"entry":"Build started for artifact skaffold-buildpacks"}}

201
{"result":{"timestamp":"2020-07-13T15:04:11.777416Z","event":{"buildEvent":{"artifact":"skaffold-buildpacks","status":"Failed","err":"failed to fetch builder image 'gcr.io/buildpacks/builder:v1': error getting credentials - err: exit status 1, out: ``","errCode":"BUILD_UNKNOWN","actionableErr":{"errCode":"BUILD_UNKNOWN","message":"failed to fetch builder image 'gcr.io/buildpacks/builder:v1': error getting credentials - err: exit status 1, out: ``"}}},"entry":"Build failed for artifact skaffold-buildpacks"}}

173
{"result":{"timestamp":"2020-07-13T15:04:11.777773Z","event":{"devLoopEvent":{"status":"Failed","err":{"errCode":"BUILD_UNKNOWN","message":"couldn't build \"skaffold-buildpacks\": failed to fetch builder image 'gcr.io/buildpacks/builder:v1': error getting credentials - err: exit status 1, out: ``"}}},"entry":"DevInit Iteration 0 failed with error code BUILD_UNKNOWN"}}

curl: (18) transfer closed with outstanding read data remaining

A chunked connection is normally closed by sending a 0-byte message. But calling net.http.Server.Shutdown() to gracefully shutdown the server closes the underlying connection and does not use the chunked writer's close. Since curl doesn't receive the 0-byte message, it complains that there was outstanding data pending. I think a case could be made that this is not a graceful shutdown.

@tejal29 tejal29 deleted the another_graceful_shutdown branch April 15, 2021 07:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Skaffold event API failed to send a failure reason when project not found. Gracefully shutdown RPC server.
4 participants