Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected upload network traffic while tailing K8s logs #3586

Closed
mattpolzin opened this issue Jan 25, 2023 · 10 comments · Fixed by #3730
Closed

Unexpected upload network traffic while tailing K8s logs #3586

mattpolzin opened this issue Jan 25, 2023 · 10 comments · Fixed by #3730
Labels

Comments

@mattpolzin
Copy link
Contributor

mattpolzin commented Jan 25, 2023

Bug

Current Behavior

While running tasks or building new Docker images, upload bandwidth spikes to 1Mbps (the max for my slow internet connection) and it stays maxed out for the duration of the task execution (and usually for up to a minute after the task container has "completed" as seen via k9s). The Garden CLI does not report the task has finished when it actually finishes (as noted, it will sometimes succeed or even report a failure a minute after the task has finished successfully in the remote container) because it is apparently bogged down with the network requests it is making. This is not traffic related to sending build context to the cluster, that traffic actually stays closer to 300KB/s upload. As a side effect of bandwidth being maxed out, I also see failed connections frequently which results in Garden reporting a failure to me even though things were working fine in the cluster (the problem was merely communication with the cluster).

I was able to confirm that for some reason this upload traffic is entirely related to tailing logs by commenting out the following code in a local build of Garden:

diff --git a/core/src/plugins/kubernetes/logs.ts b/core/src/plugins/kubernetes/logs.ts
index 34e235e17..a32ed241b 100644
--- a/core/src/plugins/kubernetes/logs.ts
+++ b/core/src/plugins/kubernetes/logs.ts
@@ -230,9 +230,9 @@ export class K8sLogFollower<T> {
   public async followLogs(opts: LogOpts) {
     await this.createConnections(opts)
 
-    this.intervalId = setInterval(async () => {
-      await this.createConnections(opts)
-    }, this.retryIntervalMs)
+//    this.intervalId = setInterval(async () => {
+//      await this.createConnections(opts)
+//    }, this.retryIntervalMs)
 
     return new Promise((resolve, _reject) => {
       this.resolve = resolve

When the above code has been commented out, upload bandwidth stays at almost 0 during task execution rather than constantly maxed out at 1Mbps. The problem of failed connections also goes away because network contention is no longer a problem.

Expected behavior

I expect the Garden process of running a task that is entirely contained in the cluster (no transmission of data for the operation of the task once the Docker image has been built) to result in very little upload network bandwidth usage.

Workaround

Build a custom version of Garden with log tailing commented out. This is what I am working with currently because the experience is so drastically better.

Your environment

  • OS: macOS 13.0.1
  • How I'm running Kubernetes: Azure

garden version: 0.12.48

@to266
Copy link
Contributor

to266 commented Jan 31, 2023

Oh, we've also observed that!

@mcsteele8
Copy link

This is also a big blocker for me. I have seen this same issue where a task has completed but the garden-cli does not proceed because it is stuck waiting for response from remote resource. Big uptick in network consumption is usually aways involved and block other applications from being able to work. Example: slack, chrome, k9s all stop working unit the garden
cmd is killed

@eysi09
Copy link
Collaborator

eysi09 commented Feb 2, 2023

Thanks for flagging this!

We are aware and addressing this is top priority.

@eysi09 eysi09 added the priority:high High priority issue or feature label Feb 2, 2023
thsig added a commit that referenced this issue Feb 9, 2023
Fixes #3586.

We had previously used a very short retry interval for the logs follower
when running one-off pods (e.g. for tests and tasks for our
Kubernetes-based module types).

This was fixed by:

* Lengthening the retry interval from 10 milliseconds to 4 seconds.

* Fetching and streaming the last several seconds of pod logs before
  closing the logs follower, in case any logs were missing (or in case
  the logs follower hadn't had time to connect to the runner pod before
  it finished execution).

* Replacing the old, buffer-based deduplication logic with a simpler,
  more lightweight approach based on comparing the last streamed entry's
  timestamp and message to the entries considered for deduplication.
thsig added a commit that referenced this issue Feb 9, 2023
Co-written with @stefreak.

Fixes #3586.

We had previously used a very short retry interval for the logs follower
when running one-off pods (e.g. for tests and tasks for our
Kubernetes-based module types).

This was fixed by:

* Lengthening the retry interval from 10 milliseconds to 4 seconds.

* Fetching and streaming the last several seconds of pod logs before
  closing the logs follower, in case any logs were missing (or in case
  the logs follower hadn't had time to connect to the runner pod before
  it finished execution).

* Replacing the old, buffer-based deduplication logic with a simpler,
  more lightweight approach based on comparing the last streamed entry's
  timestamp and message to the entries considered for deduplication.
thsig added a commit that referenced this issue Feb 9, 2023
Co-written with @stefreak.

Fixes #3586.

We had previously used a very short retry interval for the logs follower
when running one-off pods (e.g. for tests and tasks for our
Kubernetes-based module types).

This was fixed by:

* Lengthening the retry interval from 10 milliseconds to 4 seconds.

* Fetching and streaming the last several seconds of pod logs before
  closing the logs follower, in case any logs were missing (or in case
  the logs follower hadn't had time to connect to the runner pod before
  it finished execution).

* Replacing the old, buffer-based deduplication logic with a simpler,
  more lightweight approach based on comparing the last streamed entry's
  timestamp and message to the entries considered for deduplication.
thsig added a commit that referenced this issue Feb 9, 2023
Co-written with @stefreak.

Fixes #3586.

We had previously used a very short retry interval for the logs follower
when running one-off pods (e.g. for tests and tasks for our
Kubernetes-based module types).

This was fixed by:

* Lengthening the retry interval from 10 milliseconds to 4 seconds.

* Fetching and streaming the last several seconds of pod logs before
  closing the logs follower, in case any logs were missing (or in case
  the logs follower hadn't had time to connect to the runner pod before
  it finished execution).

* Replacing the old, buffer-based deduplication logic with a simpler,
  more lightweight approach based on comparing the last streamed entry's
  timestamp and message to the entries considered for deduplication.
thsig added a commit that referenced this issue Feb 10, 2023
Co-written with @stefreak.

Fixes #3586.

We had previously used a very short retry interval for the logs follower
when running one-off pods (e.g. for tests and tasks for our
Kubernetes-based module types).

This was fixed by:

* Lengthening the retry interval from 10 milliseconds to 4 seconds.

* Fetching and streaming the last several seconds of pod logs before
  closing the logs follower, in case any logs were missing (or in case
  the logs follower hadn't had time to connect to the runner pod before
  it finished execution).

* Replacing the old, buffer-based deduplication logic with a simpler,
  more lightweight approach based on comparing the last streamed entry's
  timestamp and message to the entries considered for deduplication.
thsig added a commit that referenced this issue Feb 10, 2023
Co-written with @stefreak.

Fixes #3586.

We had previously used a very short retry interval for the logs follower
when running one-off pods (e.g. for tests and tasks for our
Kubernetes-based module types).

This was fixed by:

* Lengthening the retry interval from 10 milliseconds to 4 seconds.

* Fetching and streaming the last several seconds of pod logs before
  closing the logs follower, in case any logs were missing (or in case
  the logs follower hadn't had time to connect to the runner pod before
  it finished execution).

* Replacing the old, buffer-based deduplication logic with a simpler,
  more lightweight approach based on comparing the last streamed entry's
  timestamp and message to the entries considered for deduplication.
stefreak added a commit that referenced this issue Feb 22, 2023
I noticed the following log message when I increased latency & packet
loss in Network Link Conditioner:

```
[silly] <Not connected to container vault in Pod vault-0. Connection status is connecting>
```

This means the connection is not established yet, but the LogFollower is
connecting yet again (which causes a vicious cycle and makes the
internet connection even worse).

This is probably the root cause for the issue described in #3586.

With this bug fixed, I am 100% certain this PR Fixes #3586
stefreak added a commit that referenced this issue Feb 22, 2023
I noticed the following log message when I increased latency & packet
loss in Network Link Conditioner:

```
[silly] <Not connected to container vault in Pod vault-0. Connection status is connecting>
```

This means the connection is not established yet, but the LogFollower is
connecting yet again (which causes a vicious cycle and makes the
internet connection even worse).

This is probably the root cause for the issue described in #3586.

With this bug fixed, I am 100% certain this PR Fixes #3586
stefreak added a commit that referenced this issue Feb 22, 2023
I noticed the following log message when I increased latency & packet
loss in Network Link Conditioner:

```
[silly] <Not connected to container vault in Pod vault-0. Connection status is connecting>
```

This means the connection is not established yet, but the LogFollower is
connecting yet again (which causes a vicious cycle and makes the
internet connection even worse).

This is probably the root cause for the issue described in #3586.

With this bug fixed, I am 100% certain this PR Fixes #3586
stefreak added a commit that referenced this issue Mar 2, 2023
I noticed the following log message when I increased latency & packet
loss in Network Link Conditioner:

```
[silly] <Not connected to container vault in Pod vault-0. Connection status is connecting>
```

This means the connection is not established yet, but the LogFollower is
connecting yet again (which causes a vicious cycle and makes the
internet connection even worse).

This is probably the root cause for the issue described in #3586.

With this bug fixed, I am 100% certain this PR Fixes #3586

Co-authored-by: Thorarinn Sigurdsson <[email protected]>
stefreak added a commit that referenced this issue Mar 7, 2023
I noticed the following log message when I increased latency & packet
loss in Network Link Conditioner:

```
[silly] <Not connected to container vault in Pod vault-0. Connection status is connecting>
```

This means the connection is not established yet, but the LogFollower is
connecting yet again (which causes a vicious cycle and makes the
internet connection even worse).

This is probably the root cause for the issue described in #3586.

With this bug fixed, I am 100% certain this PR Fixes #3586

Co-authored-by: Thorarinn Sigurdsson <[email protected]>
stefreak added a commit that referenced this issue Mar 7, 2023
I noticed the following log message when I increased latency & packet
loss in Network Link Conditioner:

```
[silly] <Not connected to container vault in Pod vault-0. Connection status is connecting>
```

This means the connection is not established yet, but the LogFollower is
connecting yet again (which causes a vicious cycle and makes the
internet connection even worse).

This is probably the root cause for the issue described in #3586.

With this bug fixed, I am 100% certain this PR Fixes #3586

Co-authored-by: Thorarinn Sigurdsson <[email protected]>
stefreak added a commit that referenced this issue Mar 7, 2023
I noticed the following log message when I increased latency & packet
loss in Network Link Conditioner:

```
[silly] <Not connected to container vault in Pod vault-0. Connection status is connecting>
```

This means the connection is not established yet, but the LogFollower is
connecting yet again (which causes a vicious cycle and makes the
internet connection even worse).

This is probably the root cause for the issue described in #3586.

With this bug fixed, I am 100% certain this PR Fixes #3586

Co-authored-by: Thorarinn Sigurdsson <[email protected]>
stefreak added a commit that referenced this issue Mar 7, 2023
* fix(k8s): more stable & performant log streaming 

Fixes an issue with very high upload bandwidth use when running Kubernetes based tests/tasks that produce a lot of log output.

co-written by @stefreak and @thsig

- Improved connection management and pod lifecycle logic, including more robust connection timeout enforcement.
- Removed keepalive logic, since it doesn't work on all operating systems.
- Improved deduplication logic to generate fewer false positives (and eliminate false-negatives).
- Use sinceTime when fetching logs on retry to make sure we don't fetch any unnecessary logs.
- When a runner pod terminates, we make sure to wait until the final logs have been fetched.
- Default to using the tail option in conjunction with a "max log lines in memory" setting instead of limitBytes to avoid clipping / incomplete log lines while also avoiding the loading of too much log data into memory.
- Only start one connection attempt at a time, to prevent multiple connections to the same container at once.
- Make sure that we only call createConnections once it has finished, so that there is only one concurrent instance of the method running per LogFollower at a time.

Fixes #3586.

co-authored-by: thsig <[email protected]>
co-authored-by: Steffen Neubauer <[email protected]>

* fix(k8s): make sure LogFollower only connects once

I noticed the following log message when I increased latency & packet
loss in Network Link Conditioner:

```
[silly] <Not connected to container vault in Pod vault-0. Connection status is connecting>
```

This means the connection is not established yet, but the LogFollower is
connecting yet again (which causes a vicious cycle and makes the
internet connection even worse).

This is probably the root cause for the issue described in #3586.

With this bug fixed, I am 100% certain this PR Fixes #3586

Co-authored-by: Thorarinn Sigurdsson <[email protected]>
co-authored-by: Steffen Neubauer <[email protected]>

* improvement(k8s): cap age of logs on retry attempt in `garden logs`

When streaming logs from the k8s api using `garden logs`, we do not want
to stream old log messages as the user might have been disconnected for
a long time (e.g. when the laptop went to sleep)

Co-authored-by: Eyþór Magnússon <[email protected]>
co-authored-by: Steffen Neubauer <[email protected]>

---------

Co-authored-by: Steffen Neubauer <[email protected]>
Co-authored-by: thsig <[email protected]>
Co-authored-by: Eyþór Magnússon <[email protected]>
@stefreak
Copy link
Member

stefreak commented Mar 7, 2023

@mattpolzin @to266 @mcsteele8 the fix now landed in main! 🥳 Sorry that it took so long; we wanted to get this one right.

Can you run garden self-update edge and see if it works with the latest edge version?

stefreak added a commit that referenced this issue Mar 7, 2023
* fix(k8s): more stable & performant log streaming

Fixes an issue with very high upload bandwidth use when running Kubernetes based tests/tasks that produce a lot of log output.

co-written by @stefreak and @thsig

- Improved connection management and pod lifecycle logic, including more robust connection timeout enforcement.
- Removed keepalive logic, since it doesn't work on all operating systems.
- Improved deduplication logic to generate fewer false positives (and eliminate false-negatives).
- Use sinceTime when fetching logs on retry to make sure we don't fetch any unnecessary logs.
- When a runner pod terminates, we make sure to wait until the final logs have been fetched.
- Default to using the tail option in conjunction with a "max log lines in memory" setting instead of limitBytes to avoid clipping / incomplete log lines while also avoiding the loading of too much log data into memory.
- Only start one connection attempt at a time, to prevent multiple connections to the same container at once.
- Make sure that we only call createConnections once it has finished, so that there is only one concurrent instance of the method running per LogFollower at a time.

Fixes #3586.

co-authored-by: thsig <[email protected]>
co-authored-by: Steffen Neubauer <[email protected]>

* fix(k8s): make sure LogFollower only connects once

I noticed the following log message when I increased latency & packet
loss in Network Link Conditioner:

```
[silly] <Not connected to container vault in Pod vault-0. Connection status is connecting>
```

This means the connection is not established yet, but the LogFollower is
connecting yet again (which causes a vicious cycle and makes the
internet connection even worse).

This is probably the root cause for the issue described in #3586.

With this bug fixed, I am 100% certain this PR Fixes #3586

Co-authored-by: Thorarinn Sigurdsson <[email protected]>
co-authored-by: Steffen Neubauer <[email protected]>

* improvement(k8s): cap age of logs on retry attempt in `garden logs`

When streaming logs from the k8s api using `garden logs`, we do not want
to stream old log messages as the user might have been disconnected for
a long time (e.g. when the laptop went to sleep)

Co-authored-by: Eyþór Magnússon <[email protected]>
co-authored-by: Steffen Neubauer <[email protected]>

---------

Co-authored-by: Steffen Neubauer <[email protected]>
Co-authored-by: thsig <[email protected]>
Co-authored-by: Eyþór Magnússon <[email protected]>
@mattpolzin
Copy link
Contributor Author

@stefreak in my very preliminary testing, this makes all the difference!

@stefreak
Copy link
Member

stefreak commented Mar 7, 2023

Awesome, glad to hear that!

thsig added a commit that referenced this issue Mar 8, 2023
* fix(k8s): more stable & performant log streaming

Fixes an issue with very high upload bandwidth use when running Kubernetes based tests/tasks that produce a lot of log output.

co-written by @stefreak and @thsig

- Improved connection management and pod lifecycle logic, including more robust connection timeout enforcement.
- Removed keepalive logic, since it doesn't work on all operating systems.
- Improved deduplication logic to generate fewer false positives (and eliminate false-negatives).
- Use sinceTime when fetching logs on retry to make sure we don't fetch any unnecessary logs.
- When a runner pod terminates, we make sure to wait until the final logs have been fetched.
- Default to using the tail option in conjunction with a "max log lines in memory" setting instead of limitBytes to avoid clipping / incomplete log lines while also avoiding the loading of too much log data into memory.
- Only start one connection attempt at a time, to prevent multiple connections to the same container at once.
- Make sure that we only call createConnections once it has finished, so that there is only one concurrent instance of the method running per LogFollower at a time.

Fixes #3586.

co-authored-by: thsig <[email protected]>
co-authored-by: Steffen Neubauer <[email protected]>

* fix(k8s): make sure LogFollower only connects once

I noticed the following log message when I increased latency & packet
loss in Network Link Conditioner:

```
[silly] <Not connected to container vault in Pod vault-0. Connection status is connecting>
```

This means the connection is not established yet, but the LogFollower is
connecting yet again (which causes a vicious cycle and makes the
internet connection even worse).

This is probably the root cause for the issue described in #3586.

With this bug fixed, I am 100% certain this PR Fixes #3586

Co-authored-by: Thorarinn Sigurdsson <[email protected]>
co-authored-by: Steffen Neubauer <[email protected]>

* improvement(k8s): cap age of logs on retry attempt in `garden logs`

When streaming logs from the k8s api using `garden logs`, we do not want
to stream old log messages as the user might have been disconnected for
a long time (e.g. when the laptop went to sleep)

Co-authored-by: Eyþór Magnússon <[email protected]>
co-authored-by: Steffen Neubauer <[email protected]>

---------

Co-authored-by: Steffen Neubauer <[email protected]>
Co-authored-by: thsig <[email protected]>
Co-authored-by: Eyþór Magnússon <[email protected]>
edvald pushed a commit that referenced this issue Mar 8, 2023
* fix(k8s): more stable & performant log streaming

Fixes an issue with very high upload bandwidth use when running Kubernetes based tests/tasks that produce a lot of log output.

co-written by @stefreak and @thsig

- Improved connection management and pod lifecycle logic, including more robust connection timeout enforcement.
- Removed keepalive logic, since it doesn't work on all operating systems.
- Improved deduplication logic to generate fewer false positives (and eliminate false-negatives).
- Use sinceTime when fetching logs on retry to make sure we don't fetch any unnecessary logs.
- When a runner pod terminates, we make sure to wait until the final logs have been fetched.
- Default to using the tail option in conjunction with a "max log lines in memory" setting instead of limitBytes to avoid clipping / incomplete log lines while also avoiding the loading of too much log data into memory.
- Only start one connection attempt at a time, to prevent multiple connections to the same container at once.
- Make sure that we only call createConnections once it has finished, so that there is only one concurrent instance of the method running per LogFollower at a time.

Fixes #3586.

co-authored-by: thsig <[email protected]>
co-authored-by: Steffen Neubauer <[email protected]>

* fix(k8s): make sure LogFollower only connects once

I noticed the following log message when I increased latency & packet
loss in Network Link Conditioner:

```
[silly] <Not connected to container vault in Pod vault-0. Connection status is connecting>
```

This means the connection is not established yet, but the LogFollower is
connecting yet again (which causes a vicious cycle and makes the
internet connection even worse).

This is probably the root cause for the issue described in #3586.

With this bug fixed, I am 100% certain this PR Fixes #3586

Co-authored-by: Thorarinn Sigurdsson <[email protected]>
co-authored-by: Steffen Neubauer <[email protected]>

* improvement(k8s): cap age of logs on retry attempt in `garden logs`

When streaming logs from the k8s api using `garden logs`, we do not want
to stream old log messages as the user might have been disconnected for
a long time (e.g. when the laptop went to sleep)

Co-authored-by: Eyþór Magnússon <[email protected]>
co-authored-by: Steffen Neubauer <[email protected]>

---------

Co-authored-by: Steffen Neubauer <[email protected]>
Co-authored-by: thsig <[email protected]>
Co-authored-by: Eyþór Magnússon <[email protected]>
@mcsteele8
Copy link

@eysi09 Do you know when the next garden release will be that will include this change?

@stefreak
Copy link
Member

@mcsteele8 @mattpolzin @to266 hey everyone, I just wanted to let you know that this has been released as part of 0.12.53

@mattpolzin
Copy link
Contributor Author

Thanks for putting all of that work into fixing this!

@stefreak
Copy link
Member

Honestly, it was a pleasure :) Thank you so much for the high quality report on this, and your patience.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants