Handle panics in plugins #26694

jbardin · 2020-10-23T23:35:14Z

In prior versions of terraform, the combined stderr log stream would cause provider crashes to trigger panicwrap, and print a formidable message about terraform crashing, and how to report it. While the updated log handling now prevents the plugin stack traces from reaching panicwrap, they were also lost completely unless logging was enabled. This leaves the user with only a rather unhelpful grpc error from a crashing provider:

Error: rpc error: code = Unavailable desc = transport is closing

Since there is no direct way for core to handle a crash in a provider, we will catch the output by wrapping the plugin loggers to look for panic: or fatal error: and record the following output. This relies on the behavior of go-plugin, which sends any unstructured stderr output to a Debug log by default. If in the future the plugin sdk were to install a recovery middleware, which could get the panic traceback and return a gRPC error, it would not effect this codepath. This also still serves to catch panics that happen outside the handler goroutine, and fatal errors, which cannot be recovered from.

Once we have recorded any possible tracebacks, the main package can lookup any panics that may have happened via logging.PluginPanics() when there was an error in execution. These records should already be complete, as the plugin process would have exited at this point. We limit the length of the traceback to preserve the terminal scrollback (the important information is often within the first few lines), and format these in a similar way to the original terraform panic output:

Stack trace from the terraform-provider-null plugin:

panic: crasher!

goroutine 27 [running]:
github.com/terraform-providers/terraform-provider-null/null.resourceRead(0xc000078e80, 0x0, 0x0, 0x1cdcd10, 0x1804778)
    [... continued stack trace ...]

Error: The terraform-provider-null plugin crashed!

This is always indicative of a bug within the plugin. It would
be immensely helpful if you could please report the crash with
that plugin's maintainers so that it can be fixed. The output
above should help diagnose the issue.

On top of the panic handler, this PR adds a function to annotate some gRPC error codes that we may encounter. The primary one being code = Unavailable, which happens when the provider process crashes. Rather than an rpc error code, the new diagnostics will contain a textual description, and the name of the calling method.

Error: Provider did not respond

The provider encountered an error, and failed to respond to the ReadResource
call.

Unfortunately we don't have the configured provider name available at the point of the error creation yet, but an error log entry has been added to help with correlating the error to an actual provider binary and version.

Terraform does not use rpc errors for any error communication, so these are always something that went wrong in outside of the plugin protocol. The most common example of which is a provider crash, which would return "rpc error: code = Unavailable desc = transport is closing". Replace these error codes with something a little more presentable for the user, and insert the calling method name to help correlate it to the operation that failed.

codecov · 2020-10-23T23:35:21Z

Codecov Report

Merging #26694 into master will decrease coverage by 0.03%.
The diff coverage is 31.63%.

Impacted Files	Coverage Δ
internal/logging/logging.go	`30.15% <0.00%> (-0.99%)`	⬇️
plugin/grpc_error.go	`0.00% <0.00%> (ø)`
plugin/grpc_provisioner.go	`36.70% <0.00%> (ø)`
main.go	`37.86% <50.00%> (+0.11%)`	⬆️
plugin/grpc_provider.go	`55.18% <50.00%> (ø)`
internal/logging/panic.go	`39.53% <65.38%> (+39.53%)`	⬆️
internal/providercache/dir.go	`65.30% <0.00%> (-6.13%)`	⬇️
terraform/eval_apply.go	`72.86% <0.00%> (-0.59%)`	⬇️
states/statefile/version4.go	`58.19% <0.00%> (+0.23%)`	⬆️
... and 2 more

mildwonkey

This looks really great! I left a couple of nitpicky comments and suggestions but they are all very minor; ignore as you see fit.

internal/logging/panic_test.go

mildwonkey · 2020-10-26T12:32:25Z

plugin/grpc_provider.go

@@ -146,7 +148,7 @@ func (p *GRPCProvider) GetSchema() (resp providers.GetSchemaResponse) {

 	resp.Provider = convert.ProtoToProviderSchema(protoResp.Provider)
 	if protoResp.ProviderMeta == nil {
-		log.Printf("[TRACE] No provider meta schema returned")
+		logger.Debug("No provider meta schema returned")


did you mean to change this from TRACE to DEBUG? (I have no argument if you did this on purpose)

plugin/grpc_error.go

Create a logger that will record any apparent crash output for later processing. If the cli command returns with a non-zero exit status, check for any recorded crashes and add those to the output.

Extract a better function name and make the errors generic for different plugin types.

ghost · 2020-11-26T01:51:56Z

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

jbardin requested a review from a team October 23, 2020 23:36

jbardin force-pushed the jbardin/plugin-panics branch 4 times, most recently from f4972bd to 17fc1de Compare October 25, 2020 16:27

mildwonkey approved these changes Oct 26, 2020

View reviewed changes

jbardin added 2 commits October 26, 2020 09:34

record all plugin panics, and print on main exit

3225d9a

Create a logger that will record any apparent crash output for later processing. If the cli command returns with a non-zero exit status, check for any recorded crashes and add those to the output.

make grpcErr work for either plugin type

5f063ae

Extract a better function name and make the errors generic for different plugin types.

jbardin force-pushed the jbardin/plugin-panics branch from 3b388f0 to 5f063ae Compare October 26, 2020 13:34

jbardin merged commit 12c0775 into master Oct 26, 2020

jbardin deleted the jbardin/plugin-panics branch October 26, 2020 14:32

This was referenced Oct 26, 2020

Refactor logging infrastructure to use a structured logger #26546

Closed

Inconsistent logging structure #17367

Closed

TRACE level logging when TF_LOG=DEBUG #6468

Closed

This was referenced Nov 5, 2020

Better failure output #26820

Closed

Make debug logging less chatty #26291

Closed

Terraform vague error message on length=-1 for random_string resource #26849

Closed

ghost locked as resolved and limited conversation to collaborators Nov 26, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle panics in plugins #26694

Handle panics in plugins #26694

jbardin commented Oct 23, 2020 •

edited

Loading

codecov bot commented Oct 23, 2020 •

edited

Loading

mildwonkey left a comment

mildwonkey Oct 26, 2020

ghost commented Nov 26, 2020

Handle panics in plugins #26694

Handle panics in plugins #26694

Conversation

jbardin commented Oct 23, 2020 • edited Loading

codecov bot commented Oct 23, 2020 • edited Loading

Codecov Report

mildwonkey left a comment

Choose a reason for hiding this comment

mildwonkey Oct 26, 2020

Choose a reason for hiding this comment

ghost commented Nov 26, 2020

jbardin commented Oct 23, 2020 •

edited

Loading

codecov bot commented Oct 23, 2020 •

edited

Loading