Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

csi: panic at server during volume claim release #7332

Closed
tgross opened this issue Mar 12, 2020 · 2 comments
Closed

csi: panic at server during volume claim release #7332

tgross opened this issue Mar 12, 2020 · 2 comments
Assignees
Milestone

Comments

@tgross
Copy link
Member

tgross commented Mar 12, 2020

During volume claim release, we're throwing a panic that's in turn corrupting state such that we can no longer schedule volumes.

2020-03-12T20:27:44.594Z [DEBUG] worker: dequeued evaluation: eval_id=860013f2-7ee0-32f9-17ea-c5f608a79c2d
 panic: reflect: call of reflect.Value.Set on zero Value
 goroutine 102 [running]:
 reflect.flag.mustBeAssignable(0x0)
         /opt/goenv/versions/1.12.15/src/reflect/value.go:227 +0xb7
 reflect.Value.Set(0x0, 0x0, 0x0, 0x2cbd3c0, 0xc00051f1c0, 0x199)
         /opt/goenv/versions/1.12.15/src/reflect/value.go:1467 +0x2f
 github.com/hashicorp/nomad/helper/codec.(*InmemCodec).ReadRequestBody(0xc00051f200, 0x0, 0x0, 0xc000202bc0, 0x0)
         /go/src/github.com/hashicorp/nomad/helper/codec/inmem.go:25 +0x181
 net/rpc.(*Server).readRequest(0xc0006d4050, 0x3849f80, 0xc00051f200, 0xc00051f200, 0x40, 0x40, 0xc0001b3180, 0x7f585a9d2008, 0x0, 0x38, ...
         /opt/goenv/versions/1.12.15/src/net/rpc/server.go:556 +0xd4
 net/rpc.(*Server).ServeRequest(0xc0006d4050, 0x3849f80, 0xc00051f200, 0x2a39ac0, 0x1)
         /opt/goenv/versions/1.12.15/src/net/rpc/server.go:493 +0x77
 github.com/hashicorp/nomad/nomad.(*Server).RPC(0xc0003ed200, 0x30edb20, 0x20, 0x29b1500, 0xc00051f1c0, 0x2716700, 0x5ddd278, 0x1, 0x203000)
         /go/src/github.com/hashicorp/nomad/nomad/server.go:1514 +0xb9
 github.com/hashicorp/nomad/nomad.(*Server).controllerUnpublishVolume(0xc0003ed200, 0xc0011274a0, 0xc000d83a10, 0x24, 0x5ddf5c0, 0x0)
         /go/src/github.com/hashicorp/nomad/nomad/csi_endpoint.go:595 +0x340
 github.com/hashicorp/nomad/nomad.(*CoreScheduler).volumeClaimReap(0xc000730c20, 0xc000a09d80, 0x1, 0x1, 0xc0015963c0, 0x24, 0xc00000d680, 0
         /go/src/github.com/hashicorp/nomad/nomad/core_sched.go:818 +0x7a7
 github.com/hashicorp/nomad/nomad.(*CoreScheduler).csiVolumeClaimGC(0xc000730c20, 0xc000e6de00, 0x13, 0x1)
         /go/src/github.com/hashicorp/nomad/nomad/core_sched.go:738 +0x292
 github.com/hashicorp/nomad/nomad.(*CoreScheduler).Process(0xc000730c20, 0xc000e6de00, 0xa, 0x3891940)
         /go/src/github.com/hashicorp/nomad/nomad/core_sched.go:57 +0x35f
 github.com/hashicorp/nomad/nomad.(*Worker).invokeScheduler(0xc000588fc0, 0xc0016ec7b0, 0xc000e6de00, 0xc001596270, 0x24, 0x0, 0x0)
         /go/src/github.com/hashicorp/nomad/nomad/worker.go:268 +0x40a
 github.com/hashicorp/nomad/nomad.(*Worker).run(0xc000588fc0)
         /go/src/github.com/hashicorp/nomad/nomad/worker.go:129 +0x2e6
 created by github.com/hashicorp/nomad/nomad.NewWorker
         /go/src/github.com/hashicorp/nomad/nomad/worker.go:81 +0x153

The eval status for this is:

▶ nomad eval status 86001
ID                 = 860013f2
Create Time        = 32m34s ago
Modify Time        = 25m45s ago
Status             = failed
Status Description = evaluation reached delivery limit (3)
Type               = _core
TriggeredBy        = alloc-stop
Priority           = 200
Placement Failures = false

After this happens, you'll get errors like the following:

▶ nomad job run volume2.nomad
==> Monitoring evaluation "c19b6557"
    Evaluation triggered by job "example2"
    Evaluation within deployment: "9f0988d5"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "c19b6557" finished with status "complete" but failed to place all allocations:
    Task Group "cache" (failed to place 1 allocation):
      * Constraint "missing CSI plugins" filtered 2 nodes
    Evaluation "d5719861" waiting for additional capacity to place remainder
@tgross
Copy link
Member Author

tgross commented Mar 13, 2020

With a bit of work with -coverprofile I was able to reproduce this with a new unit test that covers the controller unpublish part of volume claim GC. This panic is because #7218 isn't implemented yet.

Unfortunately because the names of RPCs are just strings, the compiler won't save you from implementing an RPC that doesn't exist and in that scenario you get this silly stack trace because of all the reflection in net/rpc. The test snippet below will exercise the same failure mode:

func TestNonsenseRPC(t *testing.T) {
	srv, shutdown := TestServer(t, func(c *Config) { c.NumSchedulers = 0 })
	defer shutdown()

	var req struct{}
	var resp struct{}
	err := srv.RPC("Nonsense", req, resp)
	require.NoError(t, err)
}

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 11, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

1 participant