-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
wasm: fix network leak #13836
wasm: fix network leak #13836
Conversation
Signed-off-by: Kuat Yessenov <[email protected]>
if (downstream_closed_) { | ||
onCloseTCP(); | ||
} | ||
onCloseTCP(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This whole change could be basically minimized to this (i.e. on downstream close, close TCP context).
I don't think we can remove upstream close event, since it's a breaking change, or that we even need to remove it in order to fix the issue at hand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
Who exactly are we breaking? I don't think anyone relies on this callback or they would definitely complain about it before me.
-
What's the difference between upstream close and downstream local close? What does upstream close mean in case of direct response?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
I don't know, we don't have an exhaustive list of everybody using Proxy-Wasm extensions. Notably, this seems to be only broken when using clusters with TLS transport socket (see: onUpstreamData(end_stream=true) is never raised when using cluster with TLS transport socket #13856).
-
I think you're confusing downstream/upstream with remote/local close events:
- downstream local close means that the downstream connection was closed by Envoy,
- downstream remote close means that the downstream connection was closed by the client,
- upstream local close means that the upstream connection was closed by Envoy,
- upstream remote close means that the upstream connection was closed by the backend,
In any case, this bug is not unique to Wasm, and I think the proper fix should emit upstream connection close event (again, see: #13856).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, #13856 is not the root cause for #13806, but it results in the same leak.
For #13806 the issue is that the upstream close event won't be triggered if the connection to upstream was never established (e.g. connection timeout, connection refused), but once #13856 is fixed, then this event is always triggered if the connection to upstream was established, so I don't see the point in removing it.
Like I said originally, this PR should be reduced to "on downstream close event, destroy context" (ideally, we should keep waiting for upstream close if we ever transmitted data in either direction), but without removing the upstream close event.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can reproduce this without TLS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how to easily debug this. The issue is that either callback is not called or response flags are empty. It's hard to detect absence.
Could you share the steps to reproduce so that I can try to debug it?
BTW I'm still not convinced about "upstream data". What does that even mean in the upstream network filter? We're using network filters on clusters in istio.
It might work by accident, but Proxy-Wasm plugins don't officially support "upstream network filters", and I suspect that some things will be broken because of flipped direction and related checks. I'm also pretty sure there are no tests for that use case. Please open an issue if you want to use them as "upstream network filters".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Add this diff:
diff --git a/extensions/stats/plugin.cc b/extensions/stats/plugin.cc
index a051b5d7..2e8dcd87 100644
--- a/extensions/stats/plugin.cc
+++ b/extensions/stats/plugin.cc
@@ -546,6 +546,7 @@ bool PluginRootContext::onDone() {
}
void PluginRootContext::onTick() {
+ LOG_INFO(absl::StrCat("connection_count: ", connection_count, " log_count: ", log_count));
if (request_queue_.empty()) {
return;
}
diff --git a/extensions/stats/plugin.h b/extensions/stats/plugin.h
index 8472d6e8..fba522c2 100644
--- a/extensions/stats/plugin.h
+++ b/extensions/stats/plugin.h
@@ -260,6 +260,9 @@ class PluginRootContext : public RootContext {
::Wasm::Common::RequestInfo* request_info);
void deleteFromRequestQueue(uint32_t context_id);
+ int64_t log_count = 0;
+ int64_t connection_count = 0;
+
protected:
const std::vector<MetricTag>& defaultTags();
const std::vector<MetricFactory>& defaultMetrics();
@@ -333,6 +336,8 @@ class PluginContext : public Context {
if (request_info_.request_protocol == ::Wasm::Common::Protocol::TCP) {
request_info_.tcp_connections_closed++;
}
+
+ rootContext()->log_count++;
rootContext()->report(request_info_, true);
};
@@ -361,6 +366,7 @@ class PluginContext : public Context {
request_info_.request_protocol = ::Wasm::Common::Protocol::TCP;
request_info_.tcp_connections_opened++;
rootContext()->addToRequestQueue(id(), &request_info_);
+ rootContext()->connection_count++;
return FilterStatus::Continue;
}
- Start server:
fortio server
. - Start envoy with above config, concurrency 1.
- Generate load:
fortio load -qps 10000 -c 100 -t 10s localhost:8001
- Terminate ^C server, restart it, and keep doing it until load finishes.
- Observe mismatched counts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filed #13929 .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have high confidence that there's some other issue lurking about error propagation. If this happened under trivial config, it's very likely that more complex config can cause the same leak. We should not expect that upstream connection events are correctly emitted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kyessenov could you reduce this PR to calling onCloseTCP()
on downstream connection close without waiting for upstream, but without removing all upstream connection close code? i.e. basically only removing this line:
if (upstream_closed_ || getRequestStreamInfo()->hasAnyResponseFlag()) {
This way, we can use this as a temporary fix until proper fix (upstream connection events) is merged.
Signed-off-by: Kuat Yessenov <[email protected]>
Signed-off-by: Kuat Yessenov <[email protected]>
@PiotrSikora how do I fix the rust test:
|
You need to remove calls to |
Signed-off-by: Kuat Yessenov <[email protected]>
/retest |
Retrying Azure Pipelines: |
You need to merge master to fix the CodeQL test on the CI. |
/retest |
Retrying Azure Pipelines: |
@lizan could you take a look? Thanks! |
Signed-off-by: Kuat Yessenov <[email protected]>
* build: update rules_rust to allow Rustc in RBE (envoyproxy#13595) Signed-off-by: Lizan Zhou <[email protected]> Signed-off-by: Piotr Sikora <[email protected]> * fix macos v8 build (envoyproxy#13572) Signed-off-by: Rama Chavali <[email protected]> * wasm: update proxy-wasm-cpp-host (envoyproxy#13606) The PR updates proxy-wasm-cpp-host dependency for enhancing the capability and fixing a bug in WASM extensions. The change consists of three PRs in proxy-wasm-cpp-host: 1. proxy-wasm/proxy-wasm-cpp-host#68 @PiotrSikora 2. proxy-wasm/proxy-wasm-cpp-host#65 @mathetake (me) 3. proxy-wasm/proxy-wasm-cpp-host#64 @mathetake (me) The code change can be found at proxy-wasm/proxy-wasm-cpp-host@49ed20e...c5658d3 . 1 & 2 enhance WASM capability, and 3 fixes a bug in situations where users share vm_id for multiple filters. For details, please take a look at these original PRs. Signed-off-by: mathetake <[email protected]> Signed-off-by: Piotr Sikora <[email protected]> * wasm: re-enable tests with precompiled modules. (envoyproxy#13583) Fixes envoyproxy#12335. Signed-off-by: Piotr Sikora <[email protected]> * wasm: flip the meaning of the "repository" in envoy_wasm_cc_binary(). (envoyproxy#13621) Change the meaning of the "repository" parameter to refer to an external Bazel repository, instead of using "@envoy" in targets that are included in the Envoy repository. This aligns with other envoy_* rules. Signed-off-by: Piotr Sikora <[email protected]> * build: support ppc64le with wasm (envoyproxy#13657) The build has only been tested with gn git sha 5da62d5 as recommended by ppc64 maintainers of the v8 runtime. Signed-off-by: Christopher M. Luciano <[email protected]> * wasm: remove no longer needed Emscripten metadata. (envoyproxy#13667) Signed-off-by: Piotr Sikora <[email protected]> * fix wasm compilation (envoyproxy#13765) Signed-off-by: Asra Ali <[email protected]> * wasm: strip only Custom Sections with precompiled Wasm modules. (envoyproxy#13775) Signed-off-by: Piotr Sikora <[email protected]> * build: don't build shared libraries for zlib and zlib-ng. (envoyproxy#13652) Signed-off-by: Piotr Sikora <[email protected]> * wasm: update V8 to v8.7.220.10. (envoyproxy#13568) Signed-off-by: Piotr Sikora <[email protected]> * build: exclude wee8/out from inputs (envoyproxy#13866) If you build without sandboxing for performance, the output files from this custom build genrule contained timestamps which caused it to rebuild every single build. Signed-off-by: Keith Smiley <[email protected]> * tls: fix detection of the upstream connection close event. (envoyproxy#13858) Fixes envoyproxy#13856. Signed-off-by: Piotr Sikora <[email protected]> * wasm: Force stop iteration after local response is sent (envoyproxy#13930) Resolves envoyproxy#13857 ref: -proxy-wasm/proxy-wasm-rust-sdk#44 -proxy-wasm/proxy-wasm-cpp-host#88 -proxy-wasm/proxy-wasm-cpp-host#93 Signed-off-by: mathetake <[email protected]> Signed-off-by: Piotr Sikora <[email protected]> * wasm: fix order of callbacks for paused requests. (envoyproxy#13840) Fixes proxy-wasm/proxy-wasm-rust-sdk#43. Signed-off-by: Piotr Sikora <[email protected]> * wasm: fix network leak (envoyproxy#13836) Signed-off-by: Kuat Yessenov <[email protected]> Co-authored-by: Lizan Zhou <[email protected]> Co-authored-by: Rama Chavali <[email protected]> Co-authored-by: Takeshi Yoneda <[email protected]> Co-authored-by: cmluciano <[email protected]> Co-authored-by: asraa <[email protected]> Co-authored-by: Keith Smiley <[email protected]> Co-authored-by: Takeshi Yoneda <[email protected]> Co-authored-by: Kuat <[email protected]>
Is this going to land in Istio 1.6 / Istio 1.7 as its fixing memory leak - istio/istio#24720. |
Signed-off-by: Kuat Yessenov <[email protected]>
Signed-off-by: Kuat Yessenov <[email protected]> Signed-off-by: Qin Qin <[email protected]>
@cpakulski could you backport this? /backport |
@PiotrSikora I am starting backport to 1.16. |
@PiotrSikora I think that this change cannot be backported to 1.16. wasm was merged into master one day after 1.16 release - on Oct 9 via #12546. |
@cpakulski Oops, good point, we need to backport this to istio/envoy release branches for Istio. Thanks for checking! |
Signed-off-by: Kuat Yessenov <[email protected]> Signed-off-by: Zsolt Varga <[email protected]>
Signed-off-by: Kuat Yessenov <[email protected]> Signed-off-by: Zsolt Varga <[email protected]>
Signed-off-by: Kuat Yessenov <[email protected]> Signed-off-by: Kuat <[email protected]>
* wasm: fix network leak (envoyproxy#13836) Signed-off-by: Kuat Yessenov <[email protected]> Signed-off-by: Kuat <[email protected]> * skip tcp integration test when tsan Signed-off-by: Pengyuan Bian <[email protected]> Co-authored-by: Kuat <[email protected]>
Signed-off-by: Kuat Yessenov [email protected]
Commit Message: Remove upstream closure callback to fix the network leak #13806. Confirmed by manually testing under load. I need help fixing the rust test since I don't know rust.
Risk Level:
Testing:
Docs Changes:
Release Notes:
Platform Specific Features: