Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed metrics gauge callback deadlock #350

Merged
merged 1 commit into from
Jun 12, 2023

Conversation

merlimat
Copy link
Collaborator

When there is a leader election and a Prometheus metrics collection event is happening there is potential for a deadlock between these 2 go-routines:

Shard director gauge callback

This holds the metrics registry mutex and acquires the read-only mutex on the shards director

1 @ 0x4481f6 0x4594fe 0x4594d5 0x475965 0x1a67dfb 0x1a67dd7 0xa56d5e 0xa35aa6 0xa38667 0xa3198e 0xa51a82 0x8a3253 0x479a01
# labels: {"oxia":"metrics"}
#	0x475964	sync.runtime_SemacquireMutex+0x24						/usr/local/go/src/runtime/sema.go:77
#	0x1a67dfa	sync.(*RWMutex).RLock+0x5a							/usr/local/go/src/sync/rwmutex.go:71
#	0x1a67dd6	oxia/server.NewShardsDirector.func1+0x36					/src/oxia/server/shards_director.go:71
#	0xa56d5d	oxia/common/metrics.NewGauge.func1+0x5d						/src/oxia/common/metrics/gauge.go:58
#	0xa35aa5	go.opentelemetry.io/otel/sdk/metric.(*meter).RegisterCallback.func1+0x85	/go/pkg/mod/go.opentelemetry.io/otel/sdk/[email protected]/meter.go:263
#	0xa38666	go.opentelemetry.io/otel/sdk/metric.(*pipeline).produce+0x326			/go/pkg/mod/go.opentelemetry.io/otel/sdk/[email protected]/pipeline.go:144
#	0xa3198d	go.opentelemetry.io/otel/sdk/metric.(*manualReader).Collect+0xed		/go/pkg/mod/go.opentelemetry.io/otel/sdk/[email protected]/manual_reader.go:139
#	0xa51a81	go.opentelemetry.io/otel/exporters/prometheus.(*collector).Collect+0x81		/go/pkg/mod/go.opentelemetry.io/otel/exporters/[email protected]/exporter.go:119
#	0x8a3252	github.com/prometheus/client_golang/prometheus.(*Registry).Gather.func1+0xf2	/go/pkg/mod/github.com/prometheus/[email protected]/prometheus/registry.go:457

Wal closing

This holds the shards director mutex and tries to acquire the metrics registry mutex to unregister a gauge callback.

1 @ 0x4481f6 0x4594fe 0x4594d5 0x475965 0x4906a5 0xa3820d 0xa381e1 0xa4ef0f 0xa568c2 0x1846545 0x184642b 0x1a513d9 0x1a512d9 0x1a68602 0x1a58ece 0x99d498 0x98bc27 0x99d358 0x974dd0 0x979e8f 0x9726b8 0x479a01
# labels: {"bind":"[::]:6649", "oxia":"internal"}
#	0x475964	sync.runtime_SemacquireMutex+0x24								/usr/local/go/src/runtime/sema.go:77
#	0x4906a4	sync.(*Mutex).lockSlow+0x164									/usr/local/go/src/sync/mutex.go:171
#	0xa3820c	sync.(*Mutex).Lock+0x4c										/usr/local/go/src/sync/mutex.go:90
#	0xa381e0	go.opentelemetry.io/otel/sdk/metric.(*pipeline).addMultiCallback.func1+0x20			/go/pkg/mod/go.opentelemetry.io/otel/sdk/[email protected]/pipeline.go:116
#	0xa4ef0e	go.opentelemetry.io/otel/sdk/metric.unregisterFuncs.Unregister+0x6e				/go/pkg/mod/go.opentelemetry.io/otel/sdk/[email protected]/pipeline.go:521
#	0xa568c1	oxia/common/metrics.(*gauge).Unregister+0x21							/src/oxia/common/metrics/gauge.go:36
#	0x1846544	oxia/server/wal.(*wal).close+0x44								/src/oxia/server/wal/wal_impl.go:215
#	0x184642a	oxia/server/wal.(*wal).Close+0x6a								/src/oxia/server/wal/wal_impl.go:210
#	0x1a513d8	oxia/server.(*followerController).close+0x38							/src/oxia/server/follower_controller.go:193
#	0x1a512d8	oxia/server.(*followerController).Close+0xd8							/src/oxia/server/follower_controller.go:186
#	0x1a68601	oxia/server.(*shardsDirector).GetOrCreateLeader+0x121						/src/oxia/server/shards_director.go:137
#	0x1a58ecd	oxia/server.(*internalRpcServer).BecomeLeader+0x34d						/src/oxia/server/internal_rpc_server.go:140
#	0x99d497	oxia/proto._OxiaCoordination_BecomeLeader_Handler.func1+0x77					/src/oxia/proto/replication_grpc.pb.go:225
#	0x98bc26	github.com/grpc-ecosystem/go-grpc-prometheus.(*ServerMetrics).UnaryServerInterceptor.func1+0x86	/go/pkg/mod/github.com/grpc-ecosystem/[email protected]/server_metrics.go:107
#	0x99d357	oxia/proto._OxiaCoordination_BecomeLeader_Handler+0x137						/src/oxia/proto/replication_grpc.pb.go:227
#	0x974dcf	google.golang.org/grpc.(*Server).processUnaryRPC+0xdef						/go/pkg/mod/google.golang.org/[email protected]/server.go:1345
#	0x979e8e	google.golang.org/grpc.(*Server).handleStream+0xa2e						/go/pkg/mod/google.golang.org/[email protected]/server.go:1722
#	0x9726b7	google.golang.org/grpc.(*Server).serveStreams.func1.2+0x97					/go/pkg/mod/google.golang.org/[email protected]/server.go:966

Modifications

Converted the shards director gauge into an up/down counter so that we don't have to acquire any mutex at the metrics collection time.

@@ -138,14 +135,18 @@ func (s *shardsDirector) GetOrCreateLeader(namespace string, shardId int64) (Lea
return nil, err
}

delete(s.followers, shardId)
if _, ok := s.followers[shardId]; ok {
s.followersCounter.Dec()
Copy link
Member

@mattisonchao mattisonchao Jun 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it better to call this line behind delete?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be ok either way. We already checked the key is in the map, and the delete will not return error either.

@@ -169,14 +170,18 @@ func (s *shardsDirector) GetOrCreateFollower(namespace string, shardId int64) (F
return nil, err
}

delete(s.leaders, shardId)
if _, ok := s.leaders[shardId]; ok {
s.leadersCounter.Dec()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it better to call this line behind delete?

@merlimat merlimat merged commit 95bab49 into streamnative:main Jun 12, 2023
@merlimat merlimat deleted the gauge-deadlock branch June 12, 2023 15:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants