Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Seeing errors/panics when trying to upgrade past 2.14.1 #12010

Closed
alpeb opened this issue Jan 29, 2024 Discussed in #11927 · 4 comments · Fixed by #12022 or #12053
Closed

Seeing errors/panics when trying to upgrade past 2.14.1 #12010

alpeb opened this issue Jan 29, 2024 Discussed in #11927 · 4 comments · Fixed by #12022 or #12053

Comments

@alpeb
Copy link
Member

alpeb commented Jan 29, 2024

Discussed in #11927

Originally posted by nealharris January 12, 2024
This feels like a bug to me, but I'm not exactly sure how to reproduce it, so I'm not yet willing to open this as an issue in the project.

We're currently running Linkerd 2.14.1 in production. We try to keep up to date, and always run the latest stable release. However, for each stable release after 2.14.1, we've had some trouble when trying to upgrade.

Specifically, when we upgrade in our staging environment, after a few days, we notice restarts in the destination controller. When looking at logs, I see things like

ERROR 2024-01-12T08:00:46.599129659Z [resource.labels.containerName: destination] fatal error: concurrent map iteration and map write
ERROR 2024-01-12T08:00:46.602373609Z [resource.labels.containerName: destination] {}
ERROR 2024-01-12T08:00:46.602422320Z [resource.labels.containerName: destination] goroutine 824961 [running]:
ERROR 2024-01-12T08:00:46.602427273Z [resource.labels.containerName: destination] github.com/linkerd/linkerd2/controller/api/destination.(*endpointTranslator).add(...)
ERROR 2024-01-12T08:00:46.602430571Z [resource.labels.containerName: destination] /linkerd-build/controller/api/destination/endpoint_translator.go:192
ERROR 2024-01-12T08:00:46.602434365Z [resource.labels.containerName: destination] github.com/linkerd/linkerd2/controller/api/destination.(*endpointTranslator).processUpdate(0xc0043d8ea0, {0x1b12480?, 0xc000e1e2e8?})
ERROR 2024-01-12T08:00:46.602436814Z [resource.labels.containerName: destination] /linkerd-build/controller/api/destination/endpoint_translator.go:183 +0x12d
ERROR 2024-01-12T08:00:46.602439176Z [resource.labels.containerName: destination] github.com/linkerd/linkerd2/controller/api/destination.(*endpointTranslator).Start.func1()
ERROR 2024-01-12T08:00:46.602441799Z [resource.labels.containerName: destination] /linkerd-build/controller/api/destination/endpoint_translator.go:167 +0x37
ERROR 2024-01-12T08:00:46.602444278Z [resource.labels.containerName: destination] created by github.com/linkerd/linkerd2/controller/api/destination.(*endpointTranslator).Start
ERROR 2024-01-12T08:00:46.602446869Z [resource.labels.containerName: destination] /linkerd-build/controller/api/destination/endpoint_translator.go:163 +0x56
ERROR 2024-01-12T08:00:46.602449332Z [resource.labels.containerName: destination] {}
ERROR 2024-01-12T08:00:46.602479161Z [resource.labels.containerName: destination] goroutine 1 [chan receive, 2169 minutes]:
ERROR 2024-01-12T08:00:46.602482727Z [resource.labels.containerName: destination] github.com/linkerd/linkerd2/controller/cmd/destination.Main({0xc0000500e0, 0xa, 0xa})
ERROR 2024-01-12T08:00:46.602486313Z [resource.labels.containerName: destination] /linkerd-build/controller/cmd/destination/main.go:162 +0xf65
ERROR 2024-01-12T08:00:46.602496863Z [resource.labels.containerName: destination] main.main()
ERROR 2024-01-12T08:00:46.602499788Z [resource.labels.containerName: destination] /linkerd-build/controller/cmd/main.go:23 +0x191
ERROR 2024-01-12T08:00:46.602502201Z [resource.labels.containerName: destination] {}
ERROR 2024-01-12T08:00:46.602505639Z [resource.labels.containerName: destination] goroutine 48 [select]:
ERROR 2024-01-12T08:00:46.602508155Z [resource.labels.containerName: destination] go.opencensus.io/stats/view.(*worker).start(0xc000146200)
ERROR 2024-01-12T08:00:46.602513801Z [resource.labels.containerName: destination] /go/pkg/mod/[email protected]/stats/view/worker.go:292 +0xad
ERROR 2024-01-12T08:00:46.602516690Z [resource.labels.containerName: destination] created by go.opencensus.io/stats/view.init.0
ERROR 2024-01-12T08:00:46.602521661Z [resource.labels.containerName: destination] /go/pkg/mod/[email protected]/stats/view/worker.go:34 +0x8d
ERROR 2024-01-12T08:00:46.602524146Z [resource.labels.containerName: destination] {}
ERROR 2024-01-12T08:00:46.602526628Z [resource.labels.containerName: destination] goroutine 185 [sync.Cond.Wait, 10 minutes]:
ERROR 2024-01-12T08:00:46.602533518Z [resource.labels.containerName: destination] sync.runtime_notifyListWait(0xc000612028, 0xa90)
ERROR 2024-01-12T08:00:46.602543572Z [resource.labels.containerName: destination] /usr/local/go/src/runtime/sema.go:517 +0x14c
ERROR 2024-01-12T08:00:46.602548176Z [resource.labels.containerName: destination] sync.(*Cond).Wait(0xc0042b0800?)
ERROR 2024-01-12T08:00:46.602551037Z [resource.labels.containerName: destination] /usr/local/go/src/sync/cond.go:70 +0x8c
ERROR 2024-01-12T08:00:46.602558126Z [resource.labels.containerName: destination] k8s.io/client-go/tools/cache.(*DeltaFIFO).Pop(0xc000612000, 0xc000616010)
ERROR 2024-01-12T08:00:46.602560684Z [resource.labels.containerName: destination] /go/pkg/mod/k8s.io/[email protected]/tools/cache/delta_fifo.go:575 +0x256
ERROR 2024-01-12T08:00:46.602569140Z [resource.labels.containerName: destination] k8s.io/client-go/tools/cache.(*controller).processLoop(0xc000618000)
ERROR 2024-01-12T08:00:46.602571907Z [resource.labels.containerName: destination] /go/pkg/mod/k8s.io/[email protected]/tools/cache/controller.go:192 +0x36
ERROR 2024-01-12T08:00:46.602592218Z [resource.labels.containerName: destination] k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x40d91f?)
ERROR 2024-01-12T08:00:46.602612454Z [resource.labels.containerName: destination] /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/backoff.go:226 +0x3e
ERROR 2024-01-12T08:00:46.602621037Z [resource.labels.containerName: destination] k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xb331c5?, {0x22a2200, 0xc000614030}, 0x1, 0x0)
ERROR 2024-01-12T08:00:46.602624141Z [resource.labels.containerName: destination] /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/backoff.go:227 +0xb6
ERROR 2024-01-12T08:00:46.602627223Z [resource.labels.containerName: destination] k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc000618078?, 0x3b9aca00, 0x0, 0x0?, 0x8bb2c97000?)
ERROR 2024-01-12T08:00:46.602629739Z [resource.labels.containerName: destination] /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/backoff.go:204 +0x89
ERROR 2024-01-12T08:00:46.602632922Z [resource.labels.containerName: destination] k8s.io/apimachinery/pkg/util/wait.Until(...)
ERROR 2024-01-12T08:00:46.602635384Z [resource.labels.containerName: destination] /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/backoff.go:161
ERROR 2024-01-12T08:00:46.602649519Z [resource.labels.containerName: destination] k8s.io/client-go/tools/cache.(*controller).Run(0xc000618000, 0x0)
ERROR 2024-01-12T08:00:46.602666866Z [resource.labels.containerName: destination] /go/pkg/mod/k8s.io/[email protected]/tools/cache/controller.go:163 +0x385
ERROR 2024-01-12T08:00:46.602693511Z [resource.labels.containerName: destination] k8s.io/client-go/tools/cache.(*sharedIndexInformer).Run(0xc00051a210, 0xc000129fd0?)
ERROR 2024-01-12T08:00:46.602705904Z [resource.labels.containerName: destination] /go/pkg/mod/k8s.io/[email protected]/tools/cache/shared_informer.go:503 +0x542
ERROR 2024-01-12T08:00:46.602710075Z [resource.labels.containerName: destination] k8s.io/client-go/informers.(*sharedInformerFactory).Start.func1()
ERROR 2024-01-12T08:00:46.602713741Z [resource.labels.containerName: destination] /go/pkg/mod/k8s.io/[email protected]/informers/factory.go:150 +0x6b
ERROR 2024-01-12T08:00:46.602717251Z [resource.labels.containerName: destination] created by k8s.io/client-go/informers.(*sharedInformerFactory).Start
ERROR 2024-01-12T08:00:46.602721222Z [resource.labels.containerName: destination] /go/pkg/mod/k8s.io/[email protected]/informers/factory.go:148 +0x22a

The above are from our most recent attempt to upgrade, which is for 2.14.8.

@nealharris
Copy link

Thanks for moving this over @alpeb.

A minor update here, we've tried again with an upgrade in our staging environment, this time to 2.14.9. This was on January 24. So far, we haven't seen any restarts in the the destination pods. I don't know if the changes in 2.14.9 might have fixed what I was seeing, or if I just haven't waited long enough to observe a restart. But I thought it was worth reporting that. (I think other upgrade attempts usually had a restart after a few days, but it's possible some took a week or more.)

alpeb added a commit that referenced this issue Jan 31, 2024
Fixes #12010

## Problem

We're observing crashes in the destination controller in some scenarios, due to data race as described in #12010.

## Cause

The problem is the same instance of the `AddressSet.Addresses` map is getting mutated in the endpoints watcher Server [informer handler](https://github.com/linkerd/linkerd2/blob/edge-24.1.3/controller/api/destination/watcher/endpoints_watcher.go#L1309), and iterated over in the endpoint translator [queue loop](https://github.com/linkerd/linkerd2/blob/edge-24.1.3/controller/api/destination/endpoint_translator.go#L197-L211), which run in different goroutines and the map is not guarded. I believe this doesn't result in Destination returning stale data; it's more of a correctness issue.

## Solution

Added a new mutex for `AddressSet`, used whenever its fields are accessed/mutated. As mutexes cannot be copied, we can only pass a pointer around, and so any use of `AddressSet` ended up being a pointer.

## Repro

Install linkerd core and injected emojivoto and patch the endpoint translator to include a sleep call that will help surfacing the race (don't install the patch in the cluster; we'll only use it locally below):

<details>
  <summary>endpoint_translator.go diff</summary>

```diff
diff --git a/controller/api/destination/endpoint_translator.go b/controller/api/destination/endpoint_translator.go
index d1018d5f9..7d5abd638 100644
--- a/controller/api/destination/endpoint_translator.go
+++ b/controller/api/destination/endpoint_translator.go
@@ -5,6 +5,7 @@ import (
        "reflect"
        "strconv"
        "strings"
+       "time"

        pb "github.com/linkerd/linkerd2-proxy-api/go/destination"
        "github.com/linkerd/linkerd2-proxy-api/go/net"
@@ -195,7 +196,9 @@ func (et *endpointTranslator) processUpdate(update interface{}) {
 }

 func (et *endpointTranslator) add(set watcher.AddressSet) {
        for id, address := range set.Addresses {
+               time.Sleep(1 * time.Second)
                et.availableEndpoints.Addresses[id] = address
        }
```
</details>

Then create these two Server manifests:

<details>
  <summary>emoji-web-server.yml</summary>

```yaml
apiVersion: policy.linkerd.io/v1beta2
kind: Server
metadata:
  namespace: emojivoto
  name: web-http
  labels:
    app.kubernetes.io/part-of: emojivoto
    app.kubernetes.io/name: web
    app.kubernetes.io/version: v11
spec:
  podSelector:
    matchLabels:
      app: web-svc
  port: http
  proxyProtocol: HTTP/1
```
</details>

<details>
  <summary>emoji-web-server-opaque.yml</summary>

```yaml
apiVersion: policy.linkerd.io/v1beta2
kind: Server
metadata:
  namespace: emojivoto
  name: web-http
  labels:
    app.kubernetes.io/part-of: emojivoto
    app.kubernetes.io/name: web
    app.kubernetes.io/version: v11
spec:
  podSelector:
    matchLabels:
      app: web-svc
  port: http
  proxyProtocol: opaque
```
</details>

In separate consoles run the patched destination service and a destination client:

```bash
HOSTNAME=foobar go run -race ./controller/cmd/main.go destination -enable-h2-upgrade=true -enable-endpoint-slices=true -cluster-domain=cluster.local -identity-trust-domain=cluster.local -default-opaque-ports=25,587,3306,4444,5432,6379,9300,11211
```

```bash
go run ./controller/script/destination-client -path web-svc.emojivoto.svc.cluster.local:80
```

And run this to continuously switch the `proxyProtocol` field:

```bash
while true; do kubectl apply -f ~/src/k8s/sample_yamls/emoji-web-server.yml; kubectl apply -f ~/src/k8s/sample_yamls/emoji-web-server-opaque.yml ; done
```

Tou'll see the following data race report in the Destination controller logs:

<details>
  <summary>destination logs</summary>

```console
==================
WARNING: DATA RACE
Write at 0x00c0006d30e0 by goroutine 178:
  github.com/linkerd/linkerd2/controller/api/destination/watcher.(*portPublisher).updateServer()
      /home/alpeb/pr/destination-race/linkerd2/controller/api/destination/watcher/endpoints_watcher.go:1310 +0x772
  github.com/linkerd/linkerd2/controller/api/destination/watcher.(*servicePublisher).updateServer()
      /home/alpeb/pr/destination-race/linkerd2/controller/api/destination/watcher/endpoints_watcher.go:711 +0x150
  github.com/linkerd/linkerd2/controller/api/destination/watcher.(*EndpointsWatcher).addServer()
      /home/alpeb/pr/destination-race/linkerd2/controller/api/destination/watcher/endpoints_watcher.go:514 +0x173
  github.com/linkerd/linkerd2/controller/api/destination/watcher.(*EndpointsWatcher).updateServer()
      /home/alpeb/pr/destination-race/linkerd2/controller/api/destination/watcher/endpoints_watcher.go:528 +0x26f
  github.com/linkerd/linkerd2/controller/api/destination/watcher.(*EndpointsWatcher).updateServer-fm()
      <autogenerated>:1 +0x64
  k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnUpdate()
      /home/alpeb/go/pkg/mod/k8s.io/[email protected]/tools/cache/controller.go:246 +0x81
  k8s.io/client-go/tools/cache.(*ResourceEventHandlerFuncs).OnUpdate()
      <autogenerated>:1 +0x1f
  k8s.io/client-go/tools/cache.(*processorListener).run.func1()
      /home/alpeb/go/pkg/mod/k8s.io/[email protected]/tools/cache/shared_informer.go:970 +0x1f4
  k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1()
      /home/alpeb/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/backoff.go:226 +0x41
  k8s.io/apimachinery/pkg/util/wait.BackoffUntil()
      /home/alpeb/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/backoff.go:227 +0xbe
  k8s.io/apimachinery/pkg/util/wait.JitterUntil()
      /home/alpeb/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/backoff.go:204 +0x10a
  k8s.io/apimachinery/pkg/util/wait.Until()
      /home/alpeb/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/backoff.go:161 +0x9b
  k8s.io/client-go/tools/cache.(*processorListener).run()
      /home/alpeb/go/pkg/mod/k8s.io/[email protected]/tools/cache/shared_informer.go:966 +0x38
  k8s.io/client-go/tools/cache.(*processorListener).run-fm()
      <autogenerated>:1 +0x33
  k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1()
      /home/alpeb/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:72 +0x86

Previous read at 0x00c0006d30e0 by goroutine 360:
  github.com/linkerd/linkerd2/controller/api/destination.(*endpointTranslator).add()
      /home/alpeb/pr/destination-race/linkerd2/controller/api/destination/endpoint_translator.go:200 +0x1ab
  github.com/linkerd/linkerd2/controller/api/destination.(*endpointTranslator).processUpdate()
      /home/alpeb/pr/destination-race/linkerd2/controller/api/destination/endpoint_translator.go:190 +0x166
  github.com/linkerd/linkerd2/controller/api/destination.(*endpointTranslator).Start.func1()
      /home/alpeb/pr/destination-race/linkerd2/controller/api/destination/endpoint_translator.go:174 +0x45
```
</details>

## Extras

This also removes the unused method `func (as *AddressSet) WithPort(port Port) AddressSet` in endpoints_watcher.go
@nealharris
Copy link

@alpeb thanks for opening a PR with the fix!

Do you happen to know if this bug was introduced after 2.14.1? I'm trying to understand if I just happen to not have noticed it in our production environment yet, but that the bug was still there (and so I might as well upgrade to the latest stable anyway, since those are no worse than what I'm running wrt this issue).

@alpeb
Copy link
Member Author

alpeb commented Feb 5, 2024

I confirmed the issue was introduced in 2.14.2 via #11491. We're still reviewing the fix, but I'm wondering why it hadn't been broadly surfaced elsewhere... Do you have an environment with a high churn of resources (Pods, Endpoints or linkerd Server resources)?

@nealharris
Copy link

nealharris commented Feb 5, 2024

@alpeb Our environment does have a fair amount of churn in our linkerd data plane. Lots of pods being created due to HPAs. Though to be clear, that refers to our production environment. While our staging environment also has those HPAs, the level of churn there is significantly lower than our production environment. And we never deployed past 2.14.1 in production, because we observed this issue in our staging environment.

I'm sorry for not reporting this sooner! Next time, I'll file something right away.

alpeb added a commit that referenced this issue Feb 7, 2024
Fixes #12010

## Problem

We're observing crashes in the destination controller in some scenarios, due to data race as described in #12010.

## Cause

The problem is the same instance of the `AddressSet.Addresses` map is getting mutated in the endpoints watcher Server [informer handler](https://github.com/linkerd/linkerd2/blob/edge-24.1.3/controller/api/destination/watcher/endpoints_watcher.go#L1309), and iterated over in the endpoint translator [queue loop](https://github.com/linkerd/linkerd2/blob/edge-24.1.3/controller/api/destination/endpoint_translator.go#L197-L211), which run in different goroutines and the map is not guarded. I believe this doesn't result in Destination returning stale data; it's more of a correctness issue.

## Solution

Make a shallow copy of `pp.addresses` in the endpoints watcher and only pass that to the listeners. It's a shallow copy because we avoid making copies of the pod reference in there, knowing it won't get mutated.

## Repro

Install linkerd core and injected emojivoto and patch the endpoint translator to include a sleep call that will help surfacing the race (don't install the patch in the cluster; we'll only use it locally below):

<details>
  <summary>endpoint_translator.go diff</summary>

```diff
diff --git a/controller/api/destination/endpoint_translator.go b/controller/api/destination/endpoint_translator.go
index d1018d5f9..7d5abd638 100644
--- a/controller/api/destination/endpoint_translator.go
+++ b/controller/api/destination/endpoint_translator.go
@@ -5,6 +5,7 @@ import (
        "reflect"
        "strconv"
        "strings"
+       "time"

        pb "github.com/linkerd/linkerd2-proxy-api/go/destination"
        "github.com/linkerd/linkerd2-proxy-api/go/net"
@@ -195,7 +196,9 @@ func (et *endpointTranslator) processUpdate(update interface{}) {
 }

 func (et *endpointTranslator) add(set watcher.AddressSet) {
        for id, address := range set.Addresses {
+               time.Sleep(1 * time.Second)
                et.availableEndpoints.Addresses[id] = address
        }
```
</details>

Then create these two Server manifests:

<details>
  <summary>emoji-web-server.yml</summary>

```yaml
apiVersion: policy.linkerd.io/v1beta2
kind: Server
metadata:
  namespace: emojivoto
  name: web-http
  labels:
    app.kubernetes.io/part-of: emojivoto
    app.kubernetes.io/name: web
    app.kubernetes.io/version: v11
spec:
  podSelector:
    matchLabels:
      app: web-svc
  port: http
  proxyProtocol: HTTP/1
```
</details>

<details>
  <summary>emoji-web-server-opaque.yml</summary>

```yaml
apiVersion: policy.linkerd.io/v1beta2
kind: Server
metadata:
  namespace: emojivoto
  name: web-http
  labels:
    app.kubernetes.io/part-of: emojivoto
    app.kubernetes.io/name: web
    app.kubernetes.io/version: v11
spec:
  podSelector:
    matchLabels:
      app: web-svc
  port: http
  proxyProtocol: opaque
```
</details>

In separate consoles run the patched destination service and a destination client:

```bash
HOSTNAME=foobar go run -race ./controller/cmd/main.go destination -enable-h2-upgrade=true -enable-endpoint-slices=true -cluster-domain=cluster.local -identity-trust-domain=cluster.local -default-opaque-ports=25,587,3306,4444,5432,6379,9300,11211
```

```bash
go run ./controller/script/destination-client -path web-svc.emojivoto.svc.cluster.local:80
```

And run this to continuously switch the `proxyProtocol` field:

```bash
while true; do kubectl apply -f ~/src/k8s/sample_yamls/emoji-web-server.yml; kubectl apply -f ~/src/k8s/sample_yamls/emoji-web-server-opaque.yml ; done
```

You'll see the following data race report in the Destination controller logs:

<details>
  <summary>destination logs</summary>

```console
==================
WARNING: DATA RACE
Write at 0x00c0006d30e0 by goroutine 178:
  github.com/linkerd/linkerd2/controller/api/destination/watcher.(*portPublisher).updateServer()
      /home/alpeb/pr/destination-race/linkerd2/controller/api/destination/watcher/endpoints_watcher.go:1310 +0x772
  github.com/linkerd/linkerd2/controller/api/destination/watcher.(*servicePublisher).updateServer()
      /home/alpeb/pr/destination-race/linkerd2/controller/api/destination/watcher/endpoints_watcher.go:711 +0x150
  github.com/linkerd/linkerd2/controller/api/destination/watcher.(*EndpointsWatcher).addServer()
      /home/alpeb/pr/destination-race/linkerd2/controller/api/destination/watcher/endpoints_watcher.go:514 +0x173
  github.com/linkerd/linkerd2/controller/api/destination/watcher.(*EndpointsWatcher).updateServer()
      /home/alpeb/pr/destination-race/linkerd2/controller/api/destination/watcher/endpoints_watcher.go:528 +0x26f
  github.com/linkerd/linkerd2/controller/api/destination/watcher.(*EndpointsWatcher).updateServer-fm()
      <autogenerated>:1 +0x64
  k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnUpdate()
      /home/alpeb/go/pkg/mod/k8s.io/[email protected]/tools/cache/controller.go:246 +0x81
  k8s.io/client-go/tools/cache.(*ResourceEventHandlerFuncs).OnUpdate()
      <autogenerated>:1 +0x1f
  k8s.io/client-go/tools/cache.(*processorListener).run.func1()
      /home/alpeb/go/pkg/mod/k8s.io/[email protected]/tools/cache/shared_informer.go:970 +0x1f4
  k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1()
      /home/alpeb/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/backoff.go:226 +0x41
  k8s.io/apimachinery/pkg/util/wait.BackoffUntil()
      /home/alpeb/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/backoff.go:227 +0xbe
  k8s.io/apimachinery/pkg/util/wait.JitterUntil()
      /home/alpeb/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/backoff.go:204 +0x10a
  k8s.io/apimachinery/pkg/util/wait.Until()
      /home/alpeb/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/backoff.go:161 +0x9b
  k8s.io/client-go/tools/cache.(*processorListener).run()
      /home/alpeb/go/pkg/mod/k8s.io/[email protected]/tools/cache/shared_informer.go:966 +0x38
  k8s.io/client-go/tools/cache.(*processorListener).run-fm()
      <autogenerated>:1 +0x33
  k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1()
      /home/alpeb/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:72 +0x86

Previous read at 0x00c0006d30e0 by goroutine 360:
  github.com/linkerd/linkerd2/controller/api/destination.(*endpointTranslator).add()
      /home/alpeb/pr/destination-race/linkerd2/controller/api/destination/endpoint_translator.go:200 +0x1ab
  github.com/linkerd/linkerd2/controller/api/destination.(*endpointTranslator).processUpdate()
      /home/alpeb/pr/destination-race/linkerd2/controller/api/destination/endpoint_translator.go:190 +0x166
  github.com/linkerd/linkerd2/controller/api/destination.(*endpointTranslator).Start.func1()
      /home/alpeb/pr/destination-race/linkerd2/controller/api/destination/endpoint_translator.go:174 +0x45
```
</details>

## Extras

This also removes the unused method `func (as *AddressSet) WithPort(port Port) AddressSet` in endpoints_watcher.go
mateiidavid added a commit that referenced this issue Feb 9, 2024
This release addresses some issues in the destination service that could cause
it to behave unexpectedly when processing updates.

* Fixed a race condition in the destination service that could cause panics
  under very specific conditions ([#12022]; fixes [#12010])
* Changed how updates to a `Server` selector are handled in the destination
  service. When a `Server` that marks a port as opaque no longer selects a
  resource, the resource's opaqueness will reverted to default settings
  ([#12031]; fixes [#11995])
* Introduced Helm configuration values for liveness and readiness probe
  timeouts and delays ([#11458]; fixes [#11453]) (thanks @jan-kantert!)

[#12010]: #12010
[#12022]: #12022
[#11995]: #11995
[#12031]: #12031
[#11453]: #11453
[#11458]: #11458

Signed-off-by: Matei David <[email protected]>
adleong pushed a commit that referenced this issue Feb 17, 2024
Fixes #12010

## Problem

We're observing crashes in the destination controller in some scenarios, due to data race as described in #12010.

## Cause

The problem is the same instance of the `AddressSet.Addresses` map is getting mutated in the endpoints watcher Server [informer handler](https://github.com/linkerd/linkerd2/blob/edge-24.1.3/controller/api/destination/watcher/endpoints_watcher.go#L1309), and iterated over in the endpoint translator [queue loop](https://github.com/linkerd/linkerd2/blob/edge-24.1.3/controller/api/destination/endpoint_translator.go#L197-L211), which run in different goroutines and the map is not guarded. I believe this doesn't result in Destination returning stale data; it's more of a correctness issue.

## Solution

Make a shallow copy of `pp.addresses` in the endpoints watcher and only pass that to the listeners. It's a shallow copy because we avoid making copies of the pod reference in there, knowing it won't get mutated.

## Repro

Install linkerd core and injected emojivoto and patch the endpoint translator to include a sleep call that will help surfacing the race (don't install the patch in the cluster; we'll only use it locally below):

<details>
  <summary>endpoint_translator.go diff</summary>

```diff
diff --git a/controller/api/destination/endpoint_translator.go b/controller/api/destination/endpoint_translator.go
index d1018d5f9..7d5abd638 100644
--- a/controller/api/destination/endpoint_translator.go
+++ b/controller/api/destination/endpoint_translator.go
@@ -5,6 +5,7 @@ import (
        "reflect"
        "strconv"
        "strings"
+       "time"

        pb "github.com/linkerd/linkerd2-proxy-api/go/destination"
        "github.com/linkerd/linkerd2-proxy-api/go/net"
@@ -195,7 +196,9 @@ func (et *endpointTranslator) processUpdate(update interface{}) {
 }

 func (et *endpointTranslator) add(set watcher.AddressSet) {
        for id, address := range set.Addresses {
+               time.Sleep(1 * time.Second)
                et.availableEndpoints.Addresses[id] = address
        }
```
</details>

Then create these two Server manifests:

<details>
  <summary>emoji-web-server.yml</summary>

```yaml
apiVersion: policy.linkerd.io/v1beta2
kind: Server
metadata:
  namespace: emojivoto
  name: web-http
  labels:
    app.kubernetes.io/part-of: emojivoto
    app.kubernetes.io/name: web
    app.kubernetes.io/version: v11
spec:
  podSelector:
    matchLabels:
      app: web-svc
  port: http
  proxyProtocol: HTTP/1
```
</details>

<details>
  <summary>emoji-web-server-opaque.yml</summary>

```yaml
apiVersion: policy.linkerd.io/v1beta2
kind: Server
metadata:
  namespace: emojivoto
  name: web-http
  labels:
    app.kubernetes.io/part-of: emojivoto
    app.kubernetes.io/name: web
    app.kubernetes.io/version: v11
spec:
  podSelector:
    matchLabels:
      app: web-svc
  port: http
  proxyProtocol: opaque
```
</details>

In separate consoles run the patched destination service and a destination client:

```bash
HOSTNAME=foobar go run -race ./controller/cmd/main.go destination -enable-h2-upgrade=true -enable-endpoint-slices=true -cluster-domain=cluster.local -identity-trust-domain=cluster.local -default-opaque-ports=25,587,3306,4444,5432,6379,9300,11211
```

```bash
go run ./controller/script/destination-client -path web-svc.emojivoto.svc.cluster.local:80
```

And run this to continuously switch the `proxyProtocol` field:

```bash
while true; do kubectl apply -f ~/src/k8s/sample_yamls/emoji-web-server.yml; kubectl apply -f ~/src/k8s/sample_yamls/emoji-web-server-opaque.yml ; done
```

You'll see the following data race report in the Destination controller logs:

<details>
  <summary>destination logs</summary>

```console
==================
WARNING: DATA RACE
Write at 0x00c0006d30e0 by goroutine 178:
  github.com/linkerd/linkerd2/controller/api/destination/watcher.(*portPublisher).updateServer()
      /home/alpeb/pr/destination-race/linkerd2/controller/api/destination/watcher/endpoints_watcher.go:1310 +0x772
  github.com/linkerd/linkerd2/controller/api/destination/watcher.(*servicePublisher).updateServer()
      /home/alpeb/pr/destination-race/linkerd2/controller/api/destination/watcher/endpoints_watcher.go:711 +0x150
  github.com/linkerd/linkerd2/controller/api/destination/watcher.(*EndpointsWatcher).addServer()
      /home/alpeb/pr/destination-race/linkerd2/controller/api/destination/watcher/endpoints_watcher.go:514 +0x173
  github.com/linkerd/linkerd2/controller/api/destination/watcher.(*EndpointsWatcher).updateServer()
      /home/alpeb/pr/destination-race/linkerd2/controller/api/destination/watcher/endpoints_watcher.go:528 +0x26f
  github.com/linkerd/linkerd2/controller/api/destination/watcher.(*EndpointsWatcher).updateServer-fm()
      <autogenerated>:1 +0x64
  k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnUpdate()
      /home/alpeb/go/pkg/mod/k8s.io/[email protected]/tools/cache/controller.go:246 +0x81
  k8s.io/client-go/tools/cache.(*ResourceEventHandlerFuncs).OnUpdate()
      <autogenerated>:1 +0x1f
  k8s.io/client-go/tools/cache.(*processorListener).run.func1()
      /home/alpeb/go/pkg/mod/k8s.io/[email protected]/tools/cache/shared_informer.go:970 +0x1f4
  k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1()
      /home/alpeb/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/backoff.go:226 +0x41
  k8s.io/apimachinery/pkg/util/wait.BackoffUntil()
      /home/alpeb/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/backoff.go:227 +0xbe
  k8s.io/apimachinery/pkg/util/wait.JitterUntil()
      /home/alpeb/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/backoff.go:204 +0x10a
  k8s.io/apimachinery/pkg/util/wait.Until()
      /home/alpeb/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/backoff.go:161 +0x9b
  k8s.io/client-go/tools/cache.(*processorListener).run()
      /home/alpeb/go/pkg/mod/k8s.io/[email protected]/tools/cache/shared_informer.go:966 +0x38
  k8s.io/client-go/tools/cache.(*processorListener).run-fm()
      <autogenerated>:1 +0x33
  k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1()
      /home/alpeb/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:72 +0x86

Previous read at 0x00c0006d30e0 by goroutine 360:
  github.com/linkerd/linkerd2/controller/api/destination.(*endpointTranslator).add()
      /home/alpeb/pr/destination-race/linkerd2/controller/api/destination/endpoint_translator.go:200 +0x1ab
  github.com/linkerd/linkerd2/controller/api/destination.(*endpointTranslator).processUpdate()
      /home/alpeb/pr/destination-race/linkerd2/controller/api/destination/endpoint_translator.go:190 +0x166
  github.com/linkerd/linkerd2/controller/api/destination.(*endpointTranslator).Start.func1()
      /home/alpeb/pr/destination-race/linkerd2/controller/api/destination/endpoint_translator.go:174 +0x45
```
</details>

## Extras

This also removes the unused method `func (as *AddressSet) WithPort(port Port) AddressSet` in endpoints_watcher.go
adleong added a commit that referenced this issue Feb 20, 2024
This stable release back-ports bugfixes and improvements from recent edge
releases.

* Introduced support for arbitrary labels in the `podMonitors` field in the
  control plane Helm chart (thanks @jseiser!) ([#11222]; fixes [#11175])
* Added a `prometheusUrl` field for the heartbeat job in the control plane Helm
  chart (thanks @david972!) ([#11343]; fixes [#11342])
* Updated the Destination controller to return `INVALID_ARGUMENT` status codes
  properly when a `ServiceProfile` is requested for a service that does not
  exist. ([#11980])
* Reduced the load on the Destination controller by only processing Server
  updates on workloads affected by the Server ([#12017])
* Changed how updates to a `Server` selector are handled in the destination
  service. When a `Server` that marks a port as opaque no longer selects a
  resource, the resource's opaqueness will reverted to default settings
  ([#12031]; fixes [#11995])
* Fixed a race condition in the destination service that could cause panics
  under very specific conditions ([#12022]; fixes [#12010])
* Fixed an issue where inbound policy could be incorrect after certain policy
  resources are deleted ([#12088])

[#11222]: #11222
[#11175]: #11175
[#11343]: #11343
[#11342]: #11342
[#11980]: #11980
[#12017]: #12017
[#11995]: #11995
[#12031]: #12031
[#12010]: #12010
[#12022]: #12022
[#12088]: #12088

Signed-off-by: Alex Leong <[email protected]>
Signed-off-by: David ALEXANDRE <[email protected]>
Signed-off-by: Justin S <[email protected]>
Co-authored-by: Oliver Gould <[email protected]>
Co-authored-by: Alejandro Pedraza <[email protected]>
Co-authored-by: David ALEXANDRE <[email protected]>
Co-authored-by: Justin Seiser <[email protected]>
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Mar 9, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
2 participants