-
Notifications
You must be signed in to change notification settings - Fork 273
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unexpected upload network traffic while tailing K8s logs #3586
Comments
Oh, we've also observed that! |
This is also a big blocker for me. I have seen this same issue where a task has completed but the garden-cli does not proceed because it is stuck waiting for response from remote resource. Big uptick in network consumption is usually aways involved and block other applications from being able to work. Example: slack, chrome, k9s all stop working unit the garden |
Thanks for flagging this! We are aware and addressing this is top priority. |
Fixes #3586. We had previously used a very short retry interval for the logs follower when running one-off pods (e.g. for tests and tasks for our Kubernetes-based module types). This was fixed by: * Lengthening the retry interval from 10 milliseconds to 4 seconds. * Fetching and streaming the last several seconds of pod logs before closing the logs follower, in case any logs were missing (or in case the logs follower hadn't had time to connect to the runner pod before it finished execution). * Replacing the old, buffer-based deduplication logic with a simpler, more lightweight approach based on comparing the last streamed entry's timestamp and message to the entries considered for deduplication.
Co-written with @stefreak. Fixes #3586. We had previously used a very short retry interval for the logs follower when running one-off pods (e.g. for tests and tasks for our Kubernetes-based module types). This was fixed by: * Lengthening the retry interval from 10 milliseconds to 4 seconds. * Fetching and streaming the last several seconds of pod logs before closing the logs follower, in case any logs were missing (or in case the logs follower hadn't had time to connect to the runner pod before it finished execution). * Replacing the old, buffer-based deduplication logic with a simpler, more lightweight approach based on comparing the last streamed entry's timestamp and message to the entries considered for deduplication.
Co-written with @stefreak. Fixes #3586. We had previously used a very short retry interval for the logs follower when running one-off pods (e.g. for tests and tasks for our Kubernetes-based module types). This was fixed by: * Lengthening the retry interval from 10 milliseconds to 4 seconds. * Fetching and streaming the last several seconds of pod logs before closing the logs follower, in case any logs were missing (or in case the logs follower hadn't had time to connect to the runner pod before it finished execution). * Replacing the old, buffer-based deduplication logic with a simpler, more lightweight approach based on comparing the last streamed entry's timestamp and message to the entries considered for deduplication.
Co-written with @stefreak. Fixes #3586. We had previously used a very short retry interval for the logs follower when running one-off pods (e.g. for tests and tasks for our Kubernetes-based module types). This was fixed by: * Lengthening the retry interval from 10 milliseconds to 4 seconds. * Fetching and streaming the last several seconds of pod logs before closing the logs follower, in case any logs were missing (or in case the logs follower hadn't had time to connect to the runner pod before it finished execution). * Replacing the old, buffer-based deduplication logic with a simpler, more lightweight approach based on comparing the last streamed entry's timestamp and message to the entries considered for deduplication.
Co-written with @stefreak. Fixes #3586. We had previously used a very short retry interval for the logs follower when running one-off pods (e.g. for tests and tasks for our Kubernetes-based module types). This was fixed by: * Lengthening the retry interval from 10 milliseconds to 4 seconds. * Fetching and streaming the last several seconds of pod logs before closing the logs follower, in case any logs were missing (or in case the logs follower hadn't had time to connect to the runner pod before it finished execution). * Replacing the old, buffer-based deduplication logic with a simpler, more lightweight approach based on comparing the last streamed entry's timestamp and message to the entries considered for deduplication.
Co-written with @stefreak. Fixes #3586. We had previously used a very short retry interval for the logs follower when running one-off pods (e.g. for tests and tasks for our Kubernetes-based module types). This was fixed by: * Lengthening the retry interval from 10 milliseconds to 4 seconds. * Fetching and streaming the last several seconds of pod logs before closing the logs follower, in case any logs were missing (or in case the logs follower hadn't had time to connect to the runner pod before it finished execution). * Replacing the old, buffer-based deduplication logic with a simpler, more lightweight approach based on comparing the last streamed entry's timestamp and message to the entries considered for deduplication.
I noticed the following log message when I increased latency & packet loss in Network Link Conditioner: ``` [silly] <Not connected to container vault in Pod vault-0. Connection status is connecting> ``` This means the connection is not established yet, but the LogFollower is connecting yet again (which causes a vicious cycle and makes the internet connection even worse). This is probably the root cause for the issue described in #3586. With this bug fixed, I am 100% certain this PR Fixes #3586
I noticed the following log message when I increased latency & packet loss in Network Link Conditioner: ``` [silly] <Not connected to container vault in Pod vault-0. Connection status is connecting> ``` This means the connection is not established yet, but the LogFollower is connecting yet again (which causes a vicious cycle and makes the internet connection even worse). This is probably the root cause for the issue described in #3586. With this bug fixed, I am 100% certain this PR Fixes #3586
I noticed the following log message when I increased latency & packet loss in Network Link Conditioner: ``` [silly] <Not connected to container vault in Pod vault-0. Connection status is connecting> ``` This means the connection is not established yet, but the LogFollower is connecting yet again (which causes a vicious cycle and makes the internet connection even worse). This is probably the root cause for the issue described in #3586. With this bug fixed, I am 100% certain this PR Fixes #3586
I noticed the following log message when I increased latency & packet loss in Network Link Conditioner: ``` [silly] <Not connected to container vault in Pod vault-0. Connection status is connecting> ``` This means the connection is not established yet, but the LogFollower is connecting yet again (which causes a vicious cycle and makes the internet connection even worse). This is probably the root cause for the issue described in #3586. With this bug fixed, I am 100% certain this PR Fixes #3586 Co-authored-by: Thorarinn Sigurdsson <[email protected]>
I noticed the following log message when I increased latency & packet loss in Network Link Conditioner: ``` [silly] <Not connected to container vault in Pod vault-0. Connection status is connecting> ``` This means the connection is not established yet, but the LogFollower is connecting yet again (which causes a vicious cycle and makes the internet connection even worse). This is probably the root cause for the issue described in #3586. With this bug fixed, I am 100% certain this PR Fixes #3586 Co-authored-by: Thorarinn Sigurdsson <[email protected]>
I noticed the following log message when I increased latency & packet loss in Network Link Conditioner: ``` [silly] <Not connected to container vault in Pod vault-0. Connection status is connecting> ``` This means the connection is not established yet, but the LogFollower is connecting yet again (which causes a vicious cycle and makes the internet connection even worse). This is probably the root cause for the issue described in #3586. With this bug fixed, I am 100% certain this PR Fixes #3586 Co-authored-by: Thorarinn Sigurdsson <[email protected]>
I noticed the following log message when I increased latency & packet loss in Network Link Conditioner: ``` [silly] <Not connected to container vault in Pod vault-0. Connection status is connecting> ``` This means the connection is not established yet, but the LogFollower is connecting yet again (which causes a vicious cycle and makes the internet connection even worse). This is probably the root cause for the issue described in #3586. With this bug fixed, I am 100% certain this PR Fixes #3586 Co-authored-by: Thorarinn Sigurdsson <[email protected]>
* fix(k8s): more stable & performant log streaming Fixes an issue with very high upload bandwidth use when running Kubernetes based tests/tasks that produce a lot of log output. co-written by @stefreak and @thsig - Improved connection management and pod lifecycle logic, including more robust connection timeout enforcement. - Removed keepalive logic, since it doesn't work on all operating systems. - Improved deduplication logic to generate fewer false positives (and eliminate false-negatives). - Use sinceTime when fetching logs on retry to make sure we don't fetch any unnecessary logs. - When a runner pod terminates, we make sure to wait until the final logs have been fetched. - Default to using the tail option in conjunction with a "max log lines in memory" setting instead of limitBytes to avoid clipping / incomplete log lines while also avoiding the loading of too much log data into memory. - Only start one connection attempt at a time, to prevent multiple connections to the same container at once. - Make sure that we only call createConnections once it has finished, so that there is only one concurrent instance of the method running per LogFollower at a time. Fixes #3586. co-authored-by: thsig <[email protected]> co-authored-by: Steffen Neubauer <[email protected]> * fix(k8s): make sure LogFollower only connects once I noticed the following log message when I increased latency & packet loss in Network Link Conditioner: ``` [silly] <Not connected to container vault in Pod vault-0. Connection status is connecting> ``` This means the connection is not established yet, but the LogFollower is connecting yet again (which causes a vicious cycle and makes the internet connection even worse). This is probably the root cause for the issue described in #3586. With this bug fixed, I am 100% certain this PR Fixes #3586 Co-authored-by: Thorarinn Sigurdsson <[email protected]> co-authored-by: Steffen Neubauer <[email protected]> * improvement(k8s): cap age of logs on retry attempt in `garden logs` When streaming logs from the k8s api using `garden logs`, we do not want to stream old log messages as the user might have been disconnected for a long time (e.g. when the laptop went to sleep) Co-authored-by: Eyþór Magnússon <[email protected]> co-authored-by: Steffen Neubauer <[email protected]> --------- Co-authored-by: Steffen Neubauer <[email protected]> Co-authored-by: thsig <[email protected]> Co-authored-by: Eyþór Magnússon <[email protected]>
@mattpolzin @to266 @mcsteele8 the fix now landed in main! 🥳 Sorry that it took so long; we wanted to get this one right. Can you run |
* fix(k8s): more stable & performant log streaming Fixes an issue with very high upload bandwidth use when running Kubernetes based tests/tasks that produce a lot of log output. co-written by @stefreak and @thsig - Improved connection management and pod lifecycle logic, including more robust connection timeout enforcement. - Removed keepalive logic, since it doesn't work on all operating systems. - Improved deduplication logic to generate fewer false positives (and eliminate false-negatives). - Use sinceTime when fetching logs on retry to make sure we don't fetch any unnecessary logs. - When a runner pod terminates, we make sure to wait until the final logs have been fetched. - Default to using the tail option in conjunction with a "max log lines in memory" setting instead of limitBytes to avoid clipping / incomplete log lines while also avoiding the loading of too much log data into memory. - Only start one connection attempt at a time, to prevent multiple connections to the same container at once. - Make sure that we only call createConnections once it has finished, so that there is only one concurrent instance of the method running per LogFollower at a time. Fixes #3586. co-authored-by: thsig <[email protected]> co-authored-by: Steffen Neubauer <[email protected]> * fix(k8s): make sure LogFollower only connects once I noticed the following log message when I increased latency & packet loss in Network Link Conditioner: ``` [silly] <Not connected to container vault in Pod vault-0. Connection status is connecting> ``` This means the connection is not established yet, but the LogFollower is connecting yet again (which causes a vicious cycle and makes the internet connection even worse). This is probably the root cause for the issue described in #3586. With this bug fixed, I am 100% certain this PR Fixes #3586 Co-authored-by: Thorarinn Sigurdsson <[email protected]> co-authored-by: Steffen Neubauer <[email protected]> * improvement(k8s): cap age of logs on retry attempt in `garden logs` When streaming logs from the k8s api using `garden logs`, we do not want to stream old log messages as the user might have been disconnected for a long time (e.g. when the laptop went to sleep) Co-authored-by: Eyþór Magnússon <[email protected]> co-authored-by: Steffen Neubauer <[email protected]> --------- Co-authored-by: Steffen Neubauer <[email protected]> Co-authored-by: thsig <[email protected]> Co-authored-by: Eyþór Magnússon <[email protected]>
@stefreak in my very preliminary testing, this makes all the difference! |
Awesome, glad to hear that! |
* fix(k8s): more stable & performant log streaming Fixes an issue with very high upload bandwidth use when running Kubernetes based tests/tasks that produce a lot of log output. co-written by @stefreak and @thsig - Improved connection management and pod lifecycle logic, including more robust connection timeout enforcement. - Removed keepalive logic, since it doesn't work on all operating systems. - Improved deduplication logic to generate fewer false positives (and eliminate false-negatives). - Use sinceTime when fetching logs on retry to make sure we don't fetch any unnecessary logs. - When a runner pod terminates, we make sure to wait until the final logs have been fetched. - Default to using the tail option in conjunction with a "max log lines in memory" setting instead of limitBytes to avoid clipping / incomplete log lines while also avoiding the loading of too much log data into memory. - Only start one connection attempt at a time, to prevent multiple connections to the same container at once. - Make sure that we only call createConnections once it has finished, so that there is only one concurrent instance of the method running per LogFollower at a time. Fixes #3586. co-authored-by: thsig <[email protected]> co-authored-by: Steffen Neubauer <[email protected]> * fix(k8s): make sure LogFollower only connects once I noticed the following log message when I increased latency & packet loss in Network Link Conditioner: ``` [silly] <Not connected to container vault in Pod vault-0. Connection status is connecting> ``` This means the connection is not established yet, but the LogFollower is connecting yet again (which causes a vicious cycle and makes the internet connection even worse). This is probably the root cause for the issue described in #3586. With this bug fixed, I am 100% certain this PR Fixes #3586 Co-authored-by: Thorarinn Sigurdsson <[email protected]> co-authored-by: Steffen Neubauer <[email protected]> * improvement(k8s): cap age of logs on retry attempt in `garden logs` When streaming logs from the k8s api using `garden logs`, we do not want to stream old log messages as the user might have been disconnected for a long time (e.g. when the laptop went to sleep) Co-authored-by: Eyþór Magnússon <[email protected]> co-authored-by: Steffen Neubauer <[email protected]> --------- Co-authored-by: Steffen Neubauer <[email protected]> Co-authored-by: thsig <[email protected]> Co-authored-by: Eyþór Magnússon <[email protected]>
* fix(k8s): more stable & performant log streaming Fixes an issue with very high upload bandwidth use when running Kubernetes based tests/tasks that produce a lot of log output. co-written by @stefreak and @thsig - Improved connection management and pod lifecycle logic, including more robust connection timeout enforcement. - Removed keepalive logic, since it doesn't work on all operating systems. - Improved deduplication logic to generate fewer false positives (and eliminate false-negatives). - Use sinceTime when fetching logs on retry to make sure we don't fetch any unnecessary logs. - When a runner pod terminates, we make sure to wait until the final logs have been fetched. - Default to using the tail option in conjunction with a "max log lines in memory" setting instead of limitBytes to avoid clipping / incomplete log lines while also avoiding the loading of too much log data into memory. - Only start one connection attempt at a time, to prevent multiple connections to the same container at once. - Make sure that we only call createConnections once it has finished, so that there is only one concurrent instance of the method running per LogFollower at a time. Fixes #3586. co-authored-by: thsig <[email protected]> co-authored-by: Steffen Neubauer <[email protected]> * fix(k8s): make sure LogFollower only connects once I noticed the following log message when I increased latency & packet loss in Network Link Conditioner: ``` [silly] <Not connected to container vault in Pod vault-0. Connection status is connecting> ``` This means the connection is not established yet, but the LogFollower is connecting yet again (which causes a vicious cycle and makes the internet connection even worse). This is probably the root cause for the issue described in #3586. With this bug fixed, I am 100% certain this PR Fixes #3586 Co-authored-by: Thorarinn Sigurdsson <[email protected]> co-authored-by: Steffen Neubauer <[email protected]> * improvement(k8s): cap age of logs on retry attempt in `garden logs` When streaming logs from the k8s api using `garden logs`, we do not want to stream old log messages as the user might have been disconnected for a long time (e.g. when the laptop went to sleep) Co-authored-by: Eyþór Magnússon <[email protected]> co-authored-by: Steffen Neubauer <[email protected]> --------- Co-authored-by: Steffen Neubauer <[email protected]> Co-authored-by: thsig <[email protected]> Co-authored-by: Eyþór Magnússon <[email protected]>
@eysi09 Do you know when the next garden release will be that will include this change? |
@mcsteele8 @mattpolzin @to266 hey everyone, I just wanted to let you know that this has been released as part of 0.12.53 |
Thanks for putting all of that work into fixing this! |
Honestly, it was a pleasure :) Thank you so much for the high quality report on this, and your patience. |
Bug
Current Behavior
While running tasks or building new Docker images, upload bandwidth spikes to 1Mbps (the max for my slow internet connection) and it stays maxed out for the duration of the task execution (and usually for up to a minute after the task container has "completed" as seen via
k9s
). The Garden CLI does not report the task has finished when it actually finishes (as noted, it will sometimes succeed or even report a failure a minute after the task has finished successfully in the remote container) because it is apparently bogged down with the network requests it is making. This is not traffic related to sending build context to the cluster, that traffic actually stays closer to 300KB/s upload. As a side effect of bandwidth being maxed out, I also see failed connections frequently which results in Garden reporting a failure to me even though things were working fine in the cluster (the problem was merely communication with the cluster).I was able to confirm that for some reason this upload traffic is entirely related to tailing logs by commenting out the following code in a local build of Garden:
When the above code has been commented out, upload bandwidth stays at almost 0 during task execution rather than constantly maxed out at 1Mbps. The problem of failed connections also goes away because network contention is no longer a problem.
Expected behavior
I expect the Garden process of running a task that is entirely contained in the cluster (no transmission of data for the operation of the task once the Docker image has been built) to result in very little upload network bandwidth usage.
Workaround
Build a custom version of Garden with log tailing commented out. This is what I am working with currently because the experience is so drastically better.
Your environment
garden version
: 0.12.48The text was updated successfully, but these errors were encountered: