Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

panic: concurrent map writes in watch/dynamic.go #193

Closed
tomasaschan opened this issue Nov 24, 2021 · 4 comments · Fixed by #194
Closed

panic: concurrent map writes in watch/dynamic.go #193

tomasaschan opened this issue Nov 24, 2021 · 4 comments · Fixed by #194
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@tomasaschan
Copy link
Member

tomasaschan commented Nov 24, 2021

We're seeing a panic on dw.lastRV[key] = rv with the following stack trace:

"concurrent map writes"
Stack:
	2  0x000000000309d871 in sigs.k8s.io/kubebuilder-declarative-pattern/pkg/patterns/declarative/pkg/watch.(*dynamicWatch).watchUntilClosed
	    at /home/tomasl/go/pkg/mod/ghe.spotify.net/declarative-infra/[email protected]/pkg/patterns/declarative/pkg/watch/dynamic.go:150
	3  0x000000000309c71a in sigs.k8s.io/kubebuilder-declarative-pattern/pkg/patterns/declarative/pkg/watch.(*dynamicWatch).Add.func1
	    at /home/tomasl/go/pkg/mod/ghe.spotify.net/declarative-infra/[email protected]/pkg/patterns/declarative/pkg/watch/dynamic.go:86

It seems to happen every time our operator starts, but for different keys every time. I don't grok what goes on under the hood well enough to understand what different paths could be updating this simultaneously; what am I looking for in troubleshooting this? Or is it just a case of having too many resources? (This operator is looking at 1000+ instances of the CR, which each creates a couple of RoleBindings and an IAMPartialPolicy in the namespace where the CR is, plus a RoleBinding in a common namespace.)

If there's anything I can help out with to contribute a workaround/fix here, please let me know.

@tomasaschan tomasaschan added the kind/bug Categorizes issue or PR as related to a bug. label Nov 24, 2021
@justinsb
Copy link
Contributor

Ooops - thanks for reporting!

This should be a relatively simple fix - I forgot a mutex to guard lastRW. However, the reason I forgot that mutex is that I thought that there should only be a single goroutine, but in fact there can be several if we are watching multiple types. And because of that, lastRW should also be keyed by kind (or each kind needs its own map).

I can send a fix in an hour or two... Adding a mutex is probably sufficient to avoid the crashes, but we/I should fix the map keying also.

@tomasaschan
Copy link
Member Author

Thanks!

We have a local fork of this repo where we've added a few other things (most notably a POC implementation of #191, but also #192 and a helper AddError on CommonObjects that adds an error string to the status if it's not already present; I'll probably submit PRs back here for most/all of those here eventually 😄).

In there, I did this to just make the map thread-safe; do you think it will be sufficient for now? I think we won't have any resources with equal namespace/name but different kinds (yet).

diff --git a/pkg/patterns/declarative/pkg/watch/dynamic.go b/pkg/patterns/declarative/pkg/watch/dynamic.go
index 5f69575..537d618 100644
--- a/pkg/patterns/declarative/pkg/watch/dynamic.go
+++ b/pkg/patterns/declarative/pkg/watch/dynamic.go
@@ -19,6 +19,7 @@ package watch
 import (
 	"context"
 	"fmt"
+	"sync"
 	"time"
 
 	"k8s.io/apimachinery/pkg/api/meta"
@@ -56,7 +57,7 @@ type dynamicWatch struct {
 
 	// lastRV caches the last reported resource version.
 	// This helps us avoid sending duplicate events (e.g. on a rewatch)
-	lastRV map[types.NamespacedName]string
+	lastRV sync.Map
 }
 
 func (dw *dynamicWatch) newDynamicClient(gvk schema.GroupVersionKind, namespace string) (dynamic.ResourceInterface, error) {
@@ -79,7 +80,7 @@ func (dw *dynamicWatch) Add(trigger schema.GroupVersionKind, options metav1.List
 		return fmt.Errorf("creating client for (%s): %v", trigger.String(), err)
 	}
 
-	dw.lastRV = make(map[types.NamespacedName]string)
+	dw.lastRV = sync.Map{}
 
 	go func() {
 		for {
@@ -140,14 +141,14 @@ func (dw *dynamicWatch) watchUntilClosed(client dynamic.ResourceInterface, trigg
 		switch clientEvent.Type {
 		case watch.Deleted:
 			// stop lastRV growing indefinitely
-			delete(dw.lastRV, key)
+			dw.lastRV.Delete(key)
 			// We always send the delete notification
 		case watch.Added, watch.Modified:
-			if previousRV, found := dw.lastRV[key]; found && previousRV == rv {
+			if previousRV, found := dw.lastRV.Load(key); found && previousRV == rv {
 				// Don't send spurious invalidations
 				continue
 			}
-			dw.lastRV[key] = rv
+			dw.lastRV.Store(key, rv)
 		}
 
 		log.WithValues("type", clientEvent.Type).WithValues("kind", trigger.String()).WithValues("name", key.Name, "namespace", key.Namespace).Info("broadcasting event")

@justinsb
Copy link
Contributor

Thanks - looking forward to those PRs :-)

And yes, that should be sufficient. This is "just" an optimization, and resourceVersions are (usually) globally unique, so as long as we're not crashing it's an improvement!

@tomasaschan
Copy link
Member Author

I'll file this one right away (it was easy enough to put in place!). The others will come once we've been using them for a little while so we can make breaking changes to the APIs etc if we realize ways to make stuff better.

tomasaschan added a commit to tomasaschan/kubebuilder-declarative-pattern that referenced this issue Nov 24, 2021
justinsb added a commit to justinsb/kubebuilder-declarative-pattern that referenced this issue Nov 25, 2021
This avoids the race when we were (mistakenly) sharing the lastRV map.

Issue kubernetes-sigs#193
justinsb added a commit to justinsb/kubebuilder-declarative-pattern that referenced this issue Nov 25, 2021
This avoids the race when we were (mistakenly) sharing the lastRV map.

Issue kubernetes-sigs#193
justinsb added a commit to justinsb/kubebuilder-declarative-pattern that referenced this issue Nov 25, 2021
This avoids the race when we were (mistakenly) sharing the lastRV map.

Issue kubernetes-sigs#193
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants