Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wasm: fix network leak #13836

Merged
merged 6 commits into from
Nov 19, 2020
Merged

Conversation

kyessenov
Copy link
Contributor

@kyessenov kyessenov commented Oct 30, 2020

Signed-off-by: Kuat Yessenov [email protected]

Commit Message: Remove upstream closure callback to fix the network leak #13806. Confirmed by manually testing under load. I need help fixing the rust test since I don't know rust.
Risk Level:
Testing:
Docs Changes:
Release Notes:
Platform Specific Features:

Signed-off-by: Kuat Yessenov <[email protected]>
if (downstream_closed_) {
onCloseTCP();
}
onCloseTCP();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This whole change could be basically minimized to this (i.e. on downstream close, close TCP context).

I don't think we can remove upstream close event, since it's a breaking change, or that we even need to remove it in order to fix the issue at hand.

Copy link
Contributor Author

@kyessenov kyessenov Oct 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Who exactly are we breaking? I don't think anyone relies on this callback or they would definitely complain about it before me.

  2. What's the difference between upstream close and downstream local close? What does upstream close mean in case of direct response?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I don't know, we don't have an exhaustive list of everybody using Proxy-Wasm extensions. Notably, this seems to be only broken when using clusters with TLS transport socket (see: onUpstreamData(end_stream=true) is never raised when using cluster with TLS transport socket #13856).

  2. I think you're confusing downstream/upstream with remote/local close events:

    • downstream local close means that the downstream connection was closed by Envoy,
    • downstream remote close means that the downstream connection was closed by the client,
    • upstream local close means that the upstream connection was closed by Envoy,
    • upstream remote close means that the upstream connection was closed by the backend,

In any case, this bug is not unique to Wasm, and I think the proper fix should emit upstream connection close event (again, see: #13856).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, #13856 is not the root cause for #13806, but it results in the same leak.

For #13806 the issue is that the upstream close event won't be triggered if the connection to upstream was never established (e.g. connection timeout, connection refused), but once #13856 is fixed, then this event is always triggered if the connection to upstream was established, so I don't see the point in removing it.

Like I said originally, this PR should be reduced to "on downstream close event, destroy context" (ideally, we should keep waiting for upstream close if we ever transmitted data in either direction), but without removing the upstream close event.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can reproduce this without TLS.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how to easily debug this. The issue is that either callback is not called or response flags are empty. It's hard to detect absence.

Could you share the steps to reproduce so that I can try to debug it?

BTW I'm still not convinced about "upstream data". What does that even mean in the upstream network filter? We're using network filters on clusters in istio.

It might work by accident, but Proxy-Wasm plugins don't officially support "upstream network filters", and I suspect that some things will be broken because of flipped direction and related checks. I'm also pretty sure there are no tests for that use case. Please open an issue if you want to use them as "upstream network filters".

Copy link
Contributor Author

@kyessenov kyessenov Nov 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Add this diff:
diff --git a/extensions/stats/plugin.cc b/extensions/stats/plugin.cc
index a051b5d7..2e8dcd87 100644
--- a/extensions/stats/plugin.cc
+++ b/extensions/stats/plugin.cc
@@ -546,6 +546,7 @@ bool PluginRootContext::onDone() {
 }

 void PluginRootContext::onTick() {
+  LOG_INFO(absl::StrCat("connection_count: ", connection_count, " log_count: ", log_count));
   if (request_queue_.empty()) {
     return;
   }
diff --git a/extensions/stats/plugin.h b/extensions/stats/plugin.h
index 8472d6e8..fba522c2 100644
--- a/extensions/stats/plugin.h
+++ b/extensions/stats/plugin.h
@@ -260,6 +260,9 @@ class PluginRootContext : public RootContext {
                          ::Wasm::Common::RequestInfo* request_info);
   void deleteFromRequestQueue(uint32_t context_id);

+  int64_t log_count = 0;
+  int64_t connection_count = 0;
+
  protected:
   const std::vector<MetricTag>& defaultTags();
   const std::vector<MetricFactory>& defaultMetrics();
@@ -333,6 +336,8 @@ class PluginContext : public Context {
     if (request_info_.request_protocol == ::Wasm::Common::Protocol::TCP) {
       request_info_.tcp_connections_closed++;
     }
+
+    rootContext()->log_count++;
     rootContext()->report(request_info_, true);
   };

@@ -361,6 +366,7 @@ class PluginContext : public Context {
     request_info_.request_protocol = ::Wasm::Common::Protocol::TCP;
     request_info_.tcp_connections_opened++;
     rootContext()->addToRequestQueue(id(), &request_info_);
+    rootContext()->connection_count++;
     return FilterStatus::Continue;
   }
  1. Start server: fortio server.
  2. Start envoy with above config, concurrency 1.
  3. Generate load: fortio load -qps 10000 -c 100 -t 10s localhost:8001
  4. Terminate ^C server, restart it, and keep doing it until load finishes.
  5. Observe mismatched counts.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed #13929 .

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the instructions, I was able to reproduce the issue. There are mid-stream I/O errors that are silently dropped in transport sockets, see: #13939. Once the core issue is fixed (my attempt in #13941), the counters always match.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have high confidence that there's some other issue lurking about error propagation. If this happened under trivial config, it's very likely that more complex config can cause the same leak. We should not expect that upstream connection events are correctly emitted.

Copy link
Contributor

@PiotrSikora PiotrSikora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kyessenov could you reduce this PR to calling onCloseTCP() on downstream connection close without waiting for upstream, but without removing all upstream connection close code? i.e. basically only removing this line:

if (upstream_closed_ || getRequestStreamInfo()->hasAnyResponseFlag()) {

This way, we can use this as a temporary fix until proper fix (upstream connection events) is merged.

@kyessenov
Copy link
Contributor Author

@PiotrSikora how do I fix the rust test:

Unexpected mock function call - returning directly.
    Function call: log_(5, "panicked at 'invalid context_id', external/raze__proxy_wasm__0_1_2/src/dispatcher.rs:255:13")
Google Mock tried the following 2 expectations, but none matched:

@PiotrSikora
Copy link
Contributor

@PiotrSikora how do I fix the rust test:

Unexpected mock function call - returning directly.
    Function call: log_(5, "panicked at 'invalid context_id', external/raze__proxy_wasm__0_1_2/src/dispatcher.rs:255:13")
Google Mock tried the following 2 expectations, but none matched:

You need to remove calls to closeStream in test/extensions/filters/network/wasm/wasm_filter_test.cc in line 182 and 183, since with the changes in this PR, the in-VM context is destroyed upon receiving RemoteClose event in line 181, so those closeStream functions are now called for a non-existing context.

Signed-off-by: Kuat Yessenov <[email protected]>
@PiotrSikora
Copy link
Contributor

/retest

@repokitteh-read-only
Copy link

Retrying Azure Pipelines:
Check envoy-presubmit didn't fail.

🐱

Caused by: a #13836 (comment) was created by @PiotrSikora.

see: more, trace.

@PiotrSikora
Copy link
Contributor

You need to merge master to fix the CodeQL test on the CI.

@PiotrSikora
Copy link
Contributor

/retest

@repokitteh-read-only
Copy link

Retrying Azure Pipelines:
Retried failed jobs in: envoy-presubmit

🐱

Caused by: a #13836 (comment) was created by @PiotrSikora.

see: more, trace.

@PiotrSikora
Copy link
Contributor

@lizan could you take a look? Thanks!

@mattklein123 mattklein123 merged commit cc024ae into envoyproxy:master Nov 19, 2020
PiotrSikora pushed a commit to PiotrSikora/envoy that referenced this pull request Nov 20, 2020
istio-testing pushed a commit to istio/envoy that referenced this pull request Nov 20, 2020
* build: update rules_rust to allow Rustc in RBE (envoyproxy#13595)

Signed-off-by: Lizan Zhou <[email protected]>
Signed-off-by: Piotr Sikora <[email protected]>

* fix macos v8 build (envoyproxy#13572)

Signed-off-by: Rama Chavali <[email protected]>

* wasm: update proxy-wasm-cpp-host (envoyproxy#13606)

The PR updates proxy-wasm-cpp-host dependency for enhancing the capability and fixing a bug in WASM extensions.

The change consists of three PRs in proxy-wasm-cpp-host:

1. proxy-wasm/proxy-wasm-cpp-host#68 @PiotrSikora
2. proxy-wasm/proxy-wasm-cpp-host#65 @mathetake (me)
3. proxy-wasm/proxy-wasm-cpp-host#64 @mathetake (me)

The code change can be found at proxy-wasm/proxy-wasm-cpp-host@49ed20e...c5658d3 .

1 & 2 enhance WASM capability, and 3 fixes a bug in situations where users share vm_id for multiple filters. For details, please take a look at these original PRs.

Signed-off-by: mathetake <[email protected]>
Signed-off-by: Piotr Sikora <[email protected]>

* wasm: re-enable tests with precompiled modules. (envoyproxy#13583)

Fixes envoyproxy#12335.

Signed-off-by: Piotr Sikora <[email protected]>

* wasm: flip the meaning of the "repository" in envoy_wasm_cc_binary(). (envoyproxy#13621)

Change the meaning of the "repository" parameter to refer to an external
Bazel repository, instead of using "@envoy" in targets that are included
in the Envoy repository.

This aligns with other envoy_* rules.

Signed-off-by: Piotr Sikora <[email protected]>

* build: support ppc64le with wasm (envoyproxy#13657)

The build has only been tested with gn git sha 5da62d5 as
recommended by ppc64 maintainers of the v8 runtime.

Signed-off-by: Christopher M. Luciano <[email protected]>

* wasm: remove no longer needed Emscripten metadata. (envoyproxy#13667)

Signed-off-by: Piotr Sikora <[email protected]>

* fix wasm compilation (envoyproxy#13765)

Signed-off-by: Asra Ali <[email protected]>

* wasm: strip only Custom Sections with precompiled Wasm modules. (envoyproxy#13775)

Signed-off-by: Piotr Sikora <[email protected]>

* build: don't build shared libraries for zlib and zlib-ng. (envoyproxy#13652)

Signed-off-by: Piotr Sikora <[email protected]>

* wasm: update V8 to v8.7.220.10. (envoyproxy#13568)

Signed-off-by: Piotr Sikora <[email protected]>

* build: exclude wee8/out from inputs (envoyproxy#13866)

If you build without sandboxing for performance, the output files from
this custom build genrule contained timestamps which caused it to
rebuild every single build.

Signed-off-by: Keith Smiley <[email protected]>

* tls: fix detection of the upstream connection close event. (envoyproxy#13858)

Fixes envoyproxy#13856.

Signed-off-by: Piotr Sikora <[email protected]>

* wasm: Force stop iteration after local response is sent (envoyproxy#13930)

Resolves envoyproxy#13857

ref:
-proxy-wasm/proxy-wasm-rust-sdk#44
-proxy-wasm/proxy-wasm-cpp-host#88
-proxy-wasm/proxy-wasm-cpp-host#93

Signed-off-by: mathetake <[email protected]>
Signed-off-by: Piotr Sikora <[email protected]>

* wasm: fix order of callbacks for paused requests. (envoyproxy#13840)

Fixes proxy-wasm/proxy-wasm-rust-sdk#43.

Signed-off-by: Piotr Sikora <[email protected]>

* wasm: fix network leak (envoyproxy#13836)

Signed-off-by: Kuat Yessenov <[email protected]>

Co-authored-by: Lizan Zhou <[email protected]>
Co-authored-by: Rama Chavali <[email protected]>
Co-authored-by: Takeshi Yoneda <[email protected]>
Co-authored-by: cmluciano <[email protected]>
Co-authored-by: asraa <[email protected]>
Co-authored-by: Keith Smiley <[email protected]>
Co-authored-by: Takeshi Yoneda <[email protected]>
Co-authored-by: Kuat <[email protected]>
@Kyslik
Copy link

Kyslik commented Nov 23, 2020

Is this going to land in Istio 1.6 / Istio 1.7 as its fixing memory leak - istio/istio#24720.

andreyprezotto pushed a commit to andreyprezotto/envoy that referenced this pull request Nov 24, 2020
qqustc pushed a commit to qqustc/envoy that referenced this pull request Nov 24, 2020
Signed-off-by: Kuat Yessenov <[email protected]>
Signed-off-by: Qin Qin <[email protected]>
@PiotrSikora
Copy link
Contributor

@cpakulski could you backport this?

/backport

@repokitteh-read-only repokitteh-read-only bot added the backport/review Request to backport to stable releases label Nov 30, 2020
@cpakulski
Copy link
Contributor

@PiotrSikora I am starting backport to 1.16.

@cpakulski cpakulski added backport/approved Approved backports to stable releases and removed backport/review Request to backport to stable releases labels Dec 1, 2020
@cpakulski
Copy link
Contributor

@PiotrSikora I think that this change cannot be backported to 1.16.
1.16 was cut off on Oct 8 via #13438.

wasm was merged into master one day after 1.16 release - on Oct 9 via #12546.

@PiotrSikora
Copy link
Contributor

@cpakulski Oops, good point, we need to backport this to istio/envoy release branches for Istio. Thanks for checking!

waynz0r pushed a commit to banzaicloud/envoy that referenced this pull request Dec 17, 2020
Signed-off-by: Kuat Yessenov <[email protected]>
Signed-off-by: Zsolt Varga <[email protected]>
waynz0r pushed a commit to banzaicloud/envoy that referenced this pull request Dec 18, 2020
Signed-off-by: Kuat Yessenov <[email protected]>
Signed-off-by: Zsolt Varga <[email protected]>
bianpengyuan pushed a commit to bianpengyuan/envoy that referenced this pull request Jan 28, 2021
Signed-off-by: Kuat Yessenov <[email protected]>

Signed-off-by: Kuat <[email protected]>
istio-testing pushed a commit to istio/envoy that referenced this pull request Jan 28, 2021
* wasm: fix network leak (envoyproxy#13836)

Signed-off-by: Kuat Yessenov <[email protected]>

Signed-off-by: Kuat <[email protected]>

* skip tcp integration test when tsan

Signed-off-by: Pengyuan Bian <[email protected]>

Co-authored-by: Kuat <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/approved Approved backports to stable releases
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants