Unexpected upload network traffic while tailing K8s logs #3586

mattpolzin · 2023-01-25T15:13:28Z

Bug

Current Behavior

While running tasks or building new Docker images, upload bandwidth spikes to 1Mbps (the max for my slow internet connection) and it stays maxed out for the duration of the task execution (and usually for up to a minute after the task container has "completed" as seen via k9s). The Garden CLI does not report the task has finished when it actually finishes (as noted, it will sometimes succeed or even report a failure a minute after the task has finished successfully in the remote container) because it is apparently bogged down with the network requests it is making. This is not traffic related to sending build context to the cluster, that traffic actually stays closer to 300KB/s upload. As a side effect of bandwidth being maxed out, I also see failed connections frequently which results in Garden reporting a failure to me even though things were working fine in the cluster (the problem was merely communication with the cluster).

I was able to confirm that for some reason this upload traffic is entirely related to tailing logs by commenting out the following code in a local build of Garden:

diff --git a/core/src/plugins/kubernetes/logs.ts b/core/src/plugins/kubernetes/logs.ts
index 34e235e17..a32ed241b 100644
--- a/core/src/plugins/kubernetes/logs.ts
+++ b/core/src/plugins/kubernetes/logs.ts
@@ -230,9 +230,9 @@ export class K8sLogFollower<T> {
   public async followLogs(opts: LogOpts) {
     await this.createConnections(opts)
 
-    this.intervalId = setInterval(async () => {
-      await this.createConnections(opts)
-    }, this.retryIntervalMs)
+//    this.intervalId = setInterval(async () => {
+//      await this.createConnections(opts)
+//    }, this.retryIntervalMs)
 
     return new Promise((resolve, _reject) => {
       this.resolve = resolve

When the above code has been commented out, upload bandwidth stays at almost 0 during task execution rather than constantly maxed out at 1Mbps. The problem of failed connections also goes away because network contention is no longer a problem.

Expected behavior

I expect the Garden process of running a task that is entirely contained in the cluster (no transmission of data for the operation of the task once the Docker image has been built) to result in very little upload network bandwidth usage.

Workaround

Build a custom version of Garden with log tailing commented out. This is what I am working with currently because the experience is so drastically better.

Your environment

OS: macOS 13.0.1
How I'm running Kubernetes: Azure

garden version: 0.12.48

The text was updated successfully, but these errors were encountered:

to266 · 2023-01-31T12:25:41Z

Oh, we've also observed that!

mcsteele8 · 2023-02-01T21:09:58Z

This is also a big blocker for me. I have seen this same issue where a task has completed but the garden-cli does not proceed because it is stuck waiting for response from remote resource. Big uptick in network consumption is usually aways involved and block other applications from being able to work. Example: slack, chrome, k9s all stop working unit the garden
cmd is killed

eysi09 · 2023-02-02T19:44:59Z

Thanks for flagging this!

We are aware and addressing this is top priority.

Fixes #3586. We had previously used a very short retry interval for the logs follower when running one-off pods (e.g. for tests and tasks for our Kubernetes-based module types). This was fixed by: * Lengthening the retry interval from 10 milliseconds to 4 seconds. * Fetching and streaming the last several seconds of pod logs before closing the logs follower, in case any logs were missing (or in case the logs follower hadn't had time to connect to the runner pod before it finished execution). * Replacing the old, buffer-based deduplication logic with a simpler, more lightweight approach based on comparing the last streamed entry's timestamp and message to the entries considered for deduplication.

@stefreak

Co-written with @stefreak. Fixes #3586. We had previously used a very short retry interval for the logs follower when running one-off pods (e.g. for tests and tasks for our Kubernetes-based module types). This was fixed by: * Lengthening the retry interval from 10 milliseconds to 4 seconds. * Fetching and streaming the last several seconds of pod logs before closing the logs follower, in case any logs were missing (or in case the logs follower hadn't had time to connect to the runner pod before it finished execution). * Replacing the old, buffer-based deduplication logic with a simpler, more lightweight approach based on comparing the last streamed entry's timestamp and message to the entries considered for deduplication.

@stefreak

Co-written with @stefreak. Fixes #3586. We had previously used a very short retry interval for the logs follower when running one-off pods (e.g. for tests and tasks for our Kubernetes-based module types). This was fixed by: * Lengthening the retry interval from 10 milliseconds to 4 seconds. * Fetching and streaming the last several seconds of pod logs before closing the logs follower, in case any logs were missing (or in case the logs follower hadn't had time to connect to the runner pod before it finished execution). * Replacing the old, buffer-based deduplication logic with a simpler, more lightweight approach based on comparing the last streamed entry's timestamp and message to the entries considered for deduplication.

@stefreak

Co-written with @stefreak. Fixes #3586. We had previously used a very short retry interval for the logs follower when running one-off pods (e.g. for tests and tasks for our Kubernetes-based module types). This was fixed by: * Lengthening the retry interval from 10 milliseconds to 4 seconds. * Fetching and streaming the last several seconds of pod logs before closing the logs follower, in case any logs were missing (or in case the logs follower hadn't had time to connect to the runner pod before it finished execution). * Replacing the old, buffer-based deduplication logic with a simpler, more lightweight approach based on comparing the last streamed entry's timestamp and message to the entries considered for deduplication.

@stefreak

Co-written with @stefreak. Fixes #3586. We had previously used a very short retry interval for the logs follower when running one-off pods (e.g. for tests and tasks for our Kubernetes-based module types). This was fixed by: * Lengthening the retry interval from 10 milliseconds to 4 seconds. * Fetching and streaming the last several seconds of pod logs before closing the logs follower, in case any logs were missing (or in case the logs follower hadn't had time to connect to the runner pod before it finished execution). * Replacing the old, buffer-based deduplication logic with a simpler, more lightweight approach based on comparing the last streamed entry's timestamp and message to the entries considered for deduplication.

@stefreak

Co-written with @stefreak. Fixes #3586. We had previously used a very short retry interval for the logs follower when running one-off pods (e.g. for tests and tasks for our Kubernetes-based module types). This was fixed by: * Lengthening the retry interval from 10 milliseconds to 4 seconds. * Fetching and streaming the last several seconds of pod logs before closing the logs follower, in case any logs were missing (or in case the logs follower hadn't had time to connect to the runner pod before it finished execution). * Replacing the old, buffer-based deduplication logic with a simpler, more lightweight approach based on comparing the last streamed entry's timestamp and message to the entries considered for deduplication.

I noticed the following log message when I increased latency & packet loss in Network Link Conditioner: ``` [silly] <Not connected to container vault in Pod vault-0. Connection status is connecting> ``` This means the connection is not established yet, but the LogFollower is connecting yet again (which causes a vicious cycle and makes the internet connection even worse). This is probably the root cause for the issue described in #3586. With this bug fixed, I am 100% certain this PR Fixes #3586

I noticed the following log message when I increased latency & packet loss in Network Link Conditioner: ``` [silly] <Not connected to container vault in Pod vault-0. Connection status is connecting> ``` This means the connection is not established yet, but the LogFollower is connecting yet again (which causes a vicious cycle and makes the internet connection even worse). This is probably the root cause for the issue described in #3586. With this bug fixed, I am 100% certain this PR Fixes #3586 Co-authored-by: Thorarinn Sigurdsson <[email protected]>

@stefreak

* fix(k8s): more stable & performant log streaming Fixes an issue with very high upload bandwidth use when running Kubernetes based tests/tasks that produce a lot of log output. co-written by @stefreak and @thsig - Improved connection management and pod lifecycle logic, including more robust connection timeout enforcement. - Removed keepalive logic, since it doesn't work on all operating systems. - Improved deduplication logic to generate fewer false positives (and eliminate false-negatives). - Use sinceTime when fetching logs on retry to make sure we don't fetch any unnecessary logs. - When a runner pod terminates, we make sure to wait until the final logs have been fetched. - Default to using the tail option in conjunction with a "max log lines in memory" setting instead of limitBytes to avoid clipping / incomplete log lines while also avoiding the loading of too much log data into memory. - Only start one connection attempt at a time, to prevent multiple connections to the same container at once. - Make sure that we only call createConnections once it has finished, so that there is only one concurrent instance of the method running per LogFollower at a time. Fixes #3586. co-authored-by: thsig <[email protected]> co-authored-by: Steffen Neubauer <[email protected]> * fix(k8s): make sure LogFollower only connects once I noticed the following log message when I increased latency & packet loss in Network Link Conditioner: ``` [silly] <Not connected to container vault in Pod vault-0. Connection status is connecting> ``` This means the connection is not established yet, but the LogFollower is connecting yet again (which causes a vicious cycle and makes the internet connection even worse). This is probably the root cause for the issue described in #3586. With this bug fixed, I am 100% certain this PR Fixes #3586 Co-authored-by: Thorarinn Sigurdsson <[email protected]> co-authored-by: Steffen Neubauer <[email protected]> * improvement(k8s): cap age of logs on retry attempt in `garden logs` When streaming logs from the k8s api using `garden logs`, we do not want to stream old log messages as the user might have been disconnected for a long time (e.g. when the laptop went to sleep) Co-authored-by: Eyþór Magnússon <[email protected]> co-authored-by: Steffen Neubauer <[email protected]> --------- Co-authored-by: Steffen Neubauer <[email protected]> Co-authored-by: thsig <[email protected]> Co-authored-by: Eyþór Magnússon <[email protected]>

stefreak · 2023-03-07T16:31:19Z

@mattpolzin @to266 @mcsteele8 the fix now landed in main! 🥳 Sorry that it took so long; we wanted to get this one right.

Can you run garden self-update edge and see if it works with the latest edge version?

@stefreak

* fix(k8s): more stable & performant log streaming Fixes an issue with very high upload bandwidth use when running Kubernetes based tests/tasks that produce a lot of log output. co-written by @stefreak and @thsig - Improved connection management and pod lifecycle logic, including more robust connection timeout enforcement. - Removed keepalive logic, since it doesn't work on all operating systems. - Improved deduplication logic to generate fewer false positives (and eliminate false-negatives). - Use sinceTime when fetching logs on retry to make sure we don't fetch any unnecessary logs. - When a runner pod terminates, we make sure to wait until the final logs have been fetched. - Default to using the tail option in conjunction with a "max log lines in memory" setting instead of limitBytes to avoid clipping / incomplete log lines while also avoiding the loading of too much log data into memory. - Only start one connection attempt at a time, to prevent multiple connections to the same container at once. - Make sure that we only call createConnections once it has finished, so that there is only one concurrent instance of the method running per LogFollower at a time. Fixes #3586. co-authored-by: thsig <[email protected]> co-authored-by: Steffen Neubauer <[email protected]> * fix(k8s): make sure LogFollower only connects once I noticed the following log message when I increased latency & packet loss in Network Link Conditioner: ``` [silly] <Not connected to container vault in Pod vault-0. Connection status is connecting> ``` This means the connection is not established yet, but the LogFollower is connecting yet again (which causes a vicious cycle and makes the internet connection even worse). This is probably the root cause for the issue described in #3586. With this bug fixed, I am 100% certain this PR Fixes #3586 Co-authored-by: Thorarinn Sigurdsson <[email protected]> co-authored-by: Steffen Neubauer <[email protected]> * improvement(k8s): cap age of logs on retry attempt in `garden logs` When streaming logs from the k8s api using `garden logs`, we do not want to stream old log messages as the user might have been disconnected for a long time (e.g. when the laptop went to sleep) Co-authored-by: Eyþór Magnússon <[email protected]> co-authored-by: Steffen Neubauer <[email protected]> --------- Co-authored-by: Steffen Neubauer <[email protected]> Co-authored-by: thsig <[email protected]> Co-authored-by: Eyþór Magnússon <[email protected]>

mattpolzin · 2023-03-07T16:56:54Z

@stefreak in my very preliminary testing, this makes all the difference!

stefreak · 2023-03-07T17:03:02Z

Awesome, glad to hear that!

@stefreak

* fix(k8s): more stable & performant log streaming Fixes an issue with very high upload bandwidth use when running Kubernetes based tests/tasks that produce a lot of log output. co-written by @stefreak and @thsig - Improved connection management and pod lifecycle logic, including more robust connection timeout enforcement. - Removed keepalive logic, since it doesn't work on all operating systems. - Improved deduplication logic to generate fewer false positives (and eliminate false-negatives). - Use sinceTime when fetching logs on retry to make sure we don't fetch any unnecessary logs. - When a runner pod terminates, we make sure to wait until the final logs have been fetched. - Default to using the tail option in conjunction with a "max log lines in memory" setting instead of limitBytes to avoid clipping / incomplete log lines while also avoiding the loading of too much log data into memory. - Only start one connection attempt at a time, to prevent multiple connections to the same container at once. - Make sure that we only call createConnections once it has finished, so that there is only one concurrent instance of the method running per LogFollower at a time. Fixes #3586. co-authored-by: thsig <[email protected]> co-authored-by: Steffen Neubauer <[email protected]> * fix(k8s): make sure LogFollower only connects once I noticed the following log message when I increased latency & packet loss in Network Link Conditioner: ``` [silly] <Not connected to container vault in Pod vault-0. Connection status is connecting> ``` This means the connection is not established yet, but the LogFollower is connecting yet again (which causes a vicious cycle and makes the internet connection even worse). This is probably the root cause for the issue described in #3586. With this bug fixed, I am 100% certain this PR Fixes #3586 Co-authored-by: Thorarinn Sigurdsson <[email protected]> co-authored-by: Steffen Neubauer <[email protected]> * improvement(k8s): cap age of logs on retry attempt in `garden logs` When streaming logs from the k8s api using `garden logs`, we do not want to stream old log messages as the user might have been disconnected for a long time (e.g. when the laptop went to sleep) Co-authored-by: Eyþór Magnússon <[email protected]> co-authored-by: Steffen Neubauer <[email protected]> --------- Co-authored-by: Steffen Neubauer <[email protected]> Co-authored-by: thsig <[email protected]> Co-authored-by: Eyþór Magnússon <[email protected]>

@stefreak

* fix(k8s): more stable & performant log streaming Fixes an issue with very high upload bandwidth use when running Kubernetes based tests/tasks that produce a lot of log output. co-written by @stefreak and @thsig - Improved connection management and pod lifecycle logic, including more robust connection timeout enforcement. - Removed keepalive logic, since it doesn't work on all operating systems. - Improved deduplication logic to generate fewer false positives (and eliminate false-negatives). - Use sinceTime when fetching logs on retry to make sure we don't fetch any unnecessary logs. - When a runner pod terminates, we make sure to wait until the final logs have been fetched. - Default to using the tail option in conjunction with a "max log lines in memory" setting instead of limitBytes to avoid clipping / incomplete log lines while also avoiding the loading of too much log data into memory. - Only start one connection attempt at a time, to prevent multiple connections to the same container at once. - Make sure that we only call createConnections once it has finished, so that there is only one concurrent instance of the method running per LogFollower at a time. Fixes #3586. co-authored-by: thsig <[email protected]> co-authored-by: Steffen Neubauer <[email protected]> * fix(k8s): make sure LogFollower only connects once I noticed the following log message when I increased latency & packet loss in Network Link Conditioner: ``` [silly] <Not connected to container vault in Pod vault-0. Connection status is connecting> ``` This means the connection is not established yet, but the LogFollower is connecting yet again (which causes a vicious cycle and makes the internet connection even worse). This is probably the root cause for the issue described in #3586. With this bug fixed, I am 100% certain this PR Fixes #3586 Co-authored-by: Thorarinn Sigurdsson <[email protected]> co-authored-by: Steffen Neubauer <[email protected]> * improvement(k8s): cap age of logs on retry attempt in `garden logs` When streaming logs from the k8s api using `garden logs`, we do not want to stream old log messages as the user might have been disconnected for a long time (e.g. when the laptop went to sleep) Co-authored-by: Eyþór Magnússon <[email protected]> co-authored-by: Steffen Neubauer <[email protected]> --------- Co-authored-by: Steffen Neubauer <[email protected]> Co-authored-by: thsig <[email protected]> Co-authored-by: Eyþór Magnússon <[email protected]>

mcsteele8 · 2023-03-10T04:21:50Z

@eysi09 Do you know when the next garden release will be that will include this change?

stefreak · 2023-03-17T14:18:32Z

@mcsteele8 @mattpolzin @to266 hey everyone, I just wanted to let you know that this has been released as part of 0.12.53

mattpolzin · 2023-03-17T14:27:56Z

Thanks for putting all of that work into fixing this!

stefreak · 2023-03-17T15:05:34Z

Honestly, it was a pleasure :) Thank you so much for the high quality report on this, and your patience.

vvagaytsev added bug triage/accepted labels Jan 26, 2023

eysi09 added the priority:high High priority issue or feature label Feb 2, 2023

thsig mentioned this issue Feb 9, 2023

fix(k8s): more stable & performant log streaming #3730

Merged

stefreak closed this as completed in #3730 Mar 7, 2023

stefreak mentioned this issue Mar 7, 2023

0.13 cherry-pick: more stable & performant log streaming (#3730) #3906

Merged

ShankyJS added the community label Apr 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unexpected upload network traffic while tailing K8s logs #3586

Unexpected upload network traffic while tailing K8s logs #3586

mattpolzin commented Jan 25, 2023 •

edited

Loading

to266 commented Jan 31, 2023

mcsteele8 commented Feb 1, 2023

eysi09 commented Feb 2, 2023

stefreak commented Mar 7, 2023

mattpolzin commented Mar 7, 2023

stefreak commented Mar 7, 2023 •

edited

Loading

mcsteele8 commented Mar 10, 2023

stefreak commented Mar 17, 2023

mattpolzin commented Mar 17, 2023

stefreak commented Mar 17, 2023

Unexpected upload network traffic while tailing K8s logs #3586

Unexpected upload network traffic while tailing K8s logs #3586

Comments

mattpolzin commented Jan 25, 2023 • edited Loading

Bug

Current Behavior

Expected behavior

Workaround

Your environment

to266 commented Jan 31, 2023

mcsteele8 commented Feb 1, 2023

eysi09 commented Feb 2, 2023

stefreak commented Mar 7, 2023

mattpolzin commented Mar 7, 2023

stefreak commented Mar 7, 2023 • edited Loading

mcsteele8 commented Mar 10, 2023

stefreak commented Mar 17, 2023

mattpolzin commented Mar 17, 2023

stefreak commented Mar 17, 2023

mattpolzin commented Jan 25, 2023 •

edited

Loading

stefreak commented Mar 7, 2023 •

edited

Loading