Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wasm: fix network leak #13836

Merged
merged 6 commits into from
Nov 19, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 1 addition & 5 deletions source/extensions/common/wasm/context.cc
Original file line number Diff line number Diff line change
Expand Up @@ -858,11 +858,7 @@ BufferInterface* Context::getBuffer(WasmBufferType type) {
void Context::onDownstreamConnectionClose(CloseType close_type) {
ContextBase::onDownstreamConnectionClose(close_type);
downstream_closed_ = true;
// Call close on TCP connection, if upstream connection closed or there was a failure seen in
// this connection.
if (upstream_closed_ || getRequestStreamInfo()->hasAnyResponseFlag()) {
onCloseTCP();
}
PiotrSikora marked this conversation as resolved.
Show resolved Hide resolved
onCloseTCP();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This whole change could be basically minimized to this (i.e. on downstream close, close TCP context).

I don't think we can remove upstream close event, since it's a breaking change, or that we even need to remove it in order to fix the issue at hand.

Copy link
Contributor Author

@kyessenov kyessenov Oct 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Who exactly are we breaking? I don't think anyone relies on this callback or they would definitely complain about it before me.

  2. What's the difference between upstream close and downstream local close? What does upstream close mean in case of direct response?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I don't know, we don't have an exhaustive list of everybody using Proxy-Wasm extensions. Notably, this seems to be only broken when using clusters with TLS transport socket (see: onUpstreamData(end_stream=true) is never raised when using cluster with TLS transport socket #13856).

  2. I think you're confusing downstream/upstream with remote/local close events:

    • downstream local close means that the downstream connection was closed by Envoy,
    • downstream remote close means that the downstream connection was closed by the client,
    • upstream local close means that the upstream connection was closed by Envoy,
    • upstream remote close means that the upstream connection was closed by the backend,

In any case, this bug is not unique to Wasm, and I think the proper fix should emit upstream connection close event (again, see: #13856).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, #13856 is not the root cause for #13806, but it results in the same leak.

For #13806 the issue is that the upstream close event won't be triggered if the connection to upstream was never established (e.g. connection timeout, connection refused), but once #13856 is fixed, then this event is always triggered if the connection to upstream was established, so I don't see the point in removing it.

Like I said originally, this PR should be reduced to "on downstream close event, destroy context" (ideally, we should keep waiting for upstream close if we ever transmitted data in either direction), but without removing the upstream close event.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can reproduce this without TLS.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how to easily debug this. The issue is that either callback is not called or response flags are empty. It's hard to detect absence.

Could you share the steps to reproduce so that I can try to debug it?

BTW I'm still not convinced about "upstream data". What does that even mean in the upstream network filter? We're using network filters on clusters in istio.

It might work by accident, but Proxy-Wasm plugins don't officially support "upstream network filters", and I suspect that some things will be broken because of flipped direction and related checks. I'm also pretty sure there are no tests for that use case. Please open an issue if you want to use them as "upstream network filters".

Copy link
Contributor Author

@kyessenov kyessenov Nov 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Add this diff:
diff --git a/extensions/stats/plugin.cc b/extensions/stats/plugin.cc
index a051b5d7..2e8dcd87 100644
--- a/extensions/stats/plugin.cc
+++ b/extensions/stats/plugin.cc
@@ -546,6 +546,7 @@ bool PluginRootContext::onDone() {
 }

 void PluginRootContext::onTick() {
+  LOG_INFO(absl::StrCat("connection_count: ", connection_count, " log_count: ", log_count));
   if (request_queue_.empty()) {
     return;
   }
diff --git a/extensions/stats/plugin.h b/extensions/stats/plugin.h
index 8472d6e8..fba522c2 100644
--- a/extensions/stats/plugin.h
+++ b/extensions/stats/plugin.h
@@ -260,6 +260,9 @@ class PluginRootContext : public RootContext {
                          ::Wasm::Common::RequestInfo* request_info);
   void deleteFromRequestQueue(uint32_t context_id);

+  int64_t log_count = 0;
+  int64_t connection_count = 0;
+
  protected:
   const std::vector<MetricTag>& defaultTags();
   const std::vector<MetricFactory>& defaultMetrics();
@@ -333,6 +336,8 @@ class PluginContext : public Context {
     if (request_info_.request_protocol == ::Wasm::Common::Protocol::TCP) {
       request_info_.tcp_connections_closed++;
     }
+
+    rootContext()->log_count++;
     rootContext()->report(request_info_, true);
   };

@@ -361,6 +366,7 @@ class PluginContext : public Context {
     request_info_.request_protocol = ::Wasm::Common::Protocol::TCP;
     request_info_.tcp_connections_opened++;
     rootContext()->addToRequestQueue(id(), &request_info_);
+    rootContext()->connection_count++;
     return FilterStatus::Continue;
   }
  1. Start server: fortio server.
  2. Start envoy with above config, concurrency 1.
  3. Generate load: fortio load -qps 10000 -c 100 -t 10s localhost:8001
  4. Terminate ^C server, restart it, and keep doing it until load finishes.
  5. Observe mismatched counts.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed #13929 .

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the instructions, I was able to reproduce the issue. There are mid-stream I/O errors that are silently dropped in transport sockets, see: #13939. Once the core issue is fixed (my attempt in #13941), the counters always match.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have high confidence that there's some other issue lurking about error propagation. If this happened under trivial config, it's very likely that more complex config can cause the same leak. We should not expect that upstream connection events are correctly emitted.

}

void Context::onUpstreamConnectionClose(CloseType close_type) {
Expand Down
4 changes: 0 additions & 4 deletions test/extensions/filters/network/wasm/wasm_filter_test.cc
Original file line number Diff line number Diff line change
Expand Up @@ -163,15 +163,11 @@ TEST_P(WasmNetworkFilterTest, CloseStream) {
// Create context.
EXPECT_CALL(filter(), log_(spdlog::level::trace, Eq(absl::string_view("onNewConnection 2"))));
EXPECT_EQ(Network::FilterStatus::Continue, filter().onNewConnection());
EXPECT_CALL(filter(),
log_(spdlog::level::trace, Eq(absl::string_view("onDownstreamConnectionClose 2 1"))));
EXPECT_CALL(filter(),
log_(spdlog::level::trace, Eq(absl::string_view("onDownstreamConnectionClose 2 2"))));

filter().onEvent(static_cast<Network::ConnectionEvent>(9999)); // Does nothing.
filter().onEvent(Network::ConnectionEvent::RemoteClose);
filter().closeStream(proxy_wasm::WasmStreamType::Downstream);
filter().closeStream(proxy_wasm::WasmStreamType::Upstream);
}

TEST_P(WasmNetworkFilterTest, SegvFailOpen) {
Expand Down