-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vhds: tsan failure in test/integration/vhds_integration_test #9784
Comments
@dmitri-d PTAL |
looking into it... |
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions. |
This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted". Thank you for your contributions. |
This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted". Thank you for your contributions. |
ran into this again with a local tsan test. |
Please see my comment here: #9828 (comment) |
I've yet to see tsan fail when there wasn't a race. Looks like there might be a teardown issue? I'd suggest adding a test where we kill the request and tear down the filter when there's a route request in progress. I suspect that'll make the failure mode more clear - if not I'll agree to take a whack at it. |
So, apologies, but I think I actually noticed this issue months ago and meant to open an issue but never did. AFAICT the on demand filter does not handle |
Opened #11341 which adds a test to verify vhds updates when client closes its connection. In an attempt to make things a bit more deterministic when client connection is closed midflight, also added an implementation for However, vhds as currently implemented already handles client connections that disappear before/during vhds update propagation: the filter-chain of the active stream is paused in the on-demand update filter for the duration of the on-demand update request-response cycle. The filter-chain is resumed once the on-demand vhds update has been propagated. This is done via a callback to the on-demand filter, which is stored in If I'm reading the thread sanitizer report correctly, it's detecting a conflict between a weak_ptr and its shared_ptr: the shared_ptr (callback stored in the on-demand filter) reading its counter in the destructor, and the weak_ptr is updating the counter in its destructor. |
Slightly different race reported here (and with line-numbers). This time between the original shared_ptr and the weak_ptr destructor.
|
I would need to look in more detail but I strongly suspect what you are doing is not thread safe somehow. I highly doubt this is a false positive. I can look tomorrow. |
This is to help with #9784 Risk Level: Low (added a single test) Testing: a new unit test and integration test Docs Changes: n/a Release Notes: n/a Signed-off-by: Dmitri Dolguikh <[email protected]>
This is still failing, and is pretty easy to reproduce with |
This is to help with envoyproxy#9784 Risk Level: Low (added a single test) Testing: a new unit test and integration test Docs Changes: n/a Release Notes: n/a Signed-off-by: Dmitri Dolguikh <[email protected]> Signed-off-by: yashwant121 <[email protected]>
This is to help with envoyproxy#9784 Risk Level: Low (added a single test) Testing: a new unit test and integration test Docs Changes: n/a Release Notes: n/a Signed-off-by: Dmitri Dolguikh <[email protected]>
This issue is due to use of libc++ not instrumented for use with thread sanitiser. I couldn't find much information about it, but there's this though: https://reviews.llvm.org/D21609?id=61547. I followed the suggestion and built an instrumented version of libc++ (see here: https://github.com/google/sanitizers/wiki/MemorySanitizerLibcxxHowTo, replace |
Ping @alyssawilk |
We'll need a dedicated TSAN base image for this in https://github.com/envoyproxy/envoy-build-tools. This is basically the same issue that was blocking us having MSAN working in OSS Envoy (#918). |
I can look into building a dedicated image. |
Had a chat with @lizan, I'll add a tsan-specific build image |
We built MSAN instrumented libc++ in the same image: https://github.com/envoyproxy/envoy-build-tools/blob/master/build_container/build_container_common.sh#L88-L102 TSAN instumented libc++ can be done in a similar way. |
@htuch we didn't enable it full to reduce CI pressure. Though we can enable them if someone is taking the work (fixing MSAN failures in master) |
This fixes #9784 and re-enables vhds_integration_test Risk Level: Low, but will most likely increase memory usage Signed-off-by: Dmitri Dolguikh <[email protected]>
This is to help with envoyproxy#9784 Risk Level: Low (added a single test) Testing: a new unit test and integration test Docs Changes: n/a Release Notes: n/a Signed-off-by: Dmitri Dolguikh <[email protected]> Signed-off-by: yashwant121 <[email protected]>
This fixes envoyproxy#9784 and re-enables vhds_integration_test Risk Level: Low, but will most likely increase memory usage Signed-off-by: Dmitri Dolguikh <[email protected]> Signed-off-by: Kevin Baichoo <[email protected]>
This fixes envoyproxy#9784 and re-enables vhds_integration_test Risk Level: Low, but will most likely increase memory usage Signed-off-by: Dmitri Dolguikh <[email protected]> Signed-off-by: scheler <[email protected]>
This fixes envoyproxy#9784 and re-enables vhds_integration_test Risk Level: Low, but will most likely increase memory usage Signed-off-by: Dmitri Dolguikh <[email protected]> Signed-off-by: Antonio Vicente <[email protected]>
* hds: fix integration test flakes (#12214) Part of #12184 Signed-off-by: Matt Klein <[email protected]> Signed-off-by: Antonio Vicente <[email protected]> * Switch to a tsan-instrumented libc++ for tsan tests (#12134) This fixes #9784 and re-enables vhds_integration_test Risk Level: Low, but will most likely increase memory usage Signed-off-by: Dmitri Dolguikh <[email protected]> Signed-off-by: Antonio Vicente <[email protected]> * test: shard hds_integration_test (#12482) This should avoid TSAN timeout flakes. Signed-off-by: Matt Klein <[email protected]> Signed-off-by: Antonio Vicente <[email protected]> * test: shard http2_integration_test (#11939) This should mitigate TSAN timeout. Signed-off-by: Lizan Zhou <[email protected]> Signed-off-by: Antonio Vicente <[email protected]> * test: fix http2_integration_test flake (#12450) Fixes #12442 Signed-off-by: Matt Klein <[email protected]> Signed-off-by: Antonio Vicente <[email protected]> * Kick CI Signed-off-by: Antonio Vicente <[email protected]> Co-authored-by: Matt Klein <[email protected]> Co-authored-by: Dmitri Dolguikh <[email protected]> Co-authored-by: Lizan Zhou <[email protected]>
* docs: kick-off 1.15.1 release. (envoyproxy#12166) Signed-off-by: Piotr Sikora <[email protected]> * tls: update BoringSSL-FIPS to 20190808. (envoyproxy#12170) Signed-off-by: Piotr Sikora <[email protected]> * test: Exclude wasm_vm_test from CI by making it a "manual" test. (#207) The wee v8 build times out in CI under --config=asan because the machine the job is scheduled on is too small. Signed-off-by: Antonio Vicente <[email protected]> * [v1.15] http: header map security fixes for duplicate headers (#197) (#200) Previously header matching did not match on all headers for non-inline headers. This patch changes the default behavior to always logically match on all headers. Multiple individual headers will be logically concatenated with ',' similar to what is done with inline headers. This makes the behavior effectively consistent. This behavior can be temporary reverted by setting the runtime value "envoy.reloadable_features.header_match_on_all_headers" to "false". Targeted fixes have been additionally performed on the following extensions which make them consider all duplicate headers by default as a comma concatenated list: 1) Any extension using CEL matching on headers. 2) The header to metadata filter. 3) The JWT filter. 4) The Lua filter. Like primary header matching used in routing, RBAC, etc. this behavior can be disabled by setting the runtime value "envoy.reloadable_features.header_match_on_all_headers" to false. Finally, the setCopy() header map API previously only set the first header in the case of duplicate non-inline headers. setCopy() now behaves similiarly to the other set*() APIs and replaces all found headers with a single value. This may have had security implications in the extauth filter which uses this API. This behavior can be disabled by setting the runtime value "envoy.reloadable_features.http_set_copy_replace_all_headers" to false. Fixes https://github.com/envoyproxy/envoy-setec/issues/188 Signed-off-by: Matt Klein <[email protected]> * backport to v1.15: Fix Kafka Repository Location (#223) Update mirror used to fetch kafka dependency to a valid, working mirror. Based on envoyproxy#13025 Resolves envoyproxy#13011 Signed-off-by: Antonio Vicente <[email protected]> * release: cutting 1.15.1 (#217) Signed-off-by: Antonio Vicente <[email protected]> * docs: Fix release notes for v1.15.1 release. (envoyproxy#13318) Signed-off-by: Antonio Vicente <[email protected]> * Backport flaky test and tsan fixes to releases/v1.15 branch (envoyproxy#13337) * hds: fix integration test flakes (envoyproxy#12214) Part of envoyproxy#12184 Signed-off-by: Matt Klein <[email protected]> Signed-off-by: Antonio Vicente <[email protected]> * Switch to a tsan-instrumented libc++ for tsan tests (envoyproxy#12134) This fixes envoyproxy#9784 and re-enables vhds_integration_test Risk Level: Low, but will most likely increase memory usage Signed-off-by: Dmitri Dolguikh <[email protected]> Signed-off-by: Antonio Vicente <[email protected]> * test: shard hds_integration_test (envoyproxy#12482) This should avoid TSAN timeout flakes. Signed-off-by: Matt Klein <[email protected]> Signed-off-by: Antonio Vicente <[email protected]> * test: shard http2_integration_test (envoyproxy#11939) This should mitigate TSAN timeout. Signed-off-by: Lizan Zhou <[email protected]> Signed-off-by: Antonio Vicente <[email protected]> * test: fix http2_integration_test flake (envoyproxy#12450) Fixes envoyproxy#12442 Signed-off-by: Matt Klein <[email protected]> Signed-off-by: Antonio Vicente <[email protected]> * Kick CI Signed-off-by: Antonio Vicente <[email protected]> Co-authored-by: Matt Klein <[email protected]> Co-authored-by: Dmitri Dolguikh <[email protected]> Co-authored-by: Lizan Zhou <[email protected]> * docs: kick off v1.15.3-dev (envoyproxy#13695) Signed-off-by: Christoph Pakulski <[email protected]> * 1.15: CI fixes backport (envoyproxy#13697) Backport following commits to 1.15: 748b2ab (mac ci: try ignoring update failure (envoyproxy#13658), 2020-10-20) f95f539 (ci: various improvements (envoyproxy#13660), 2020-10-20) 73d78f8 (ci: use multiple stage (envoyproxy#13557), 2020-10-15) b7a4756 (ci: use azp for api and go-control-plane sync (envoyproxy#13550), 2020-10-14) 876a6bb (ci use azp to sync filter example (envoyproxy#13501), 2020-10-12) a0f31ee (ci: use azp to generate docs (envoyproxy#13481), 2020-10-12) Signed-off-by: Lizan Zhou <[email protected]> Co-authored-by: asraa <[email protected]> * 1.15: fix CI script (envoyproxy#13724) Signed-off-by: Lizan Zhou <[email protected]> * Prevent SEGFAULT when disabling listener (envoyproxy#13515) (envoyproxy#13903) This prevents the stop_listening overload action from causing segmentation faults that can occur if the action is enabled after the listener has already shut down. Signed-off-by: Alex Konradi <[email protected]> Signed-off-by: Christoph Pakulski <[email protected]> * proxy protocol: set downstreamRemoteAddress on StreamInfo (envoyproxy#14131) (envoyproxy#14169) This fixes a regression which resulted in the downstreamRemoteAddress on the StreamInfo for a connection not having the address supplied by the proxy protocol filter, but instead having the address of the directly connected peer. This issue does not affect HTTP filters. Fixes envoyproxy#14087 Signed-off-by: Greg Greenway <[email protected]> Signed-off-by: Christoph Pakulski <[email protected]> * ci: temproray disable vhds_integration_test in TSAN (envoyproxy#12067) (envoyproxy#14217) Signed-off-by: Lizan Zhou <[email protected]> * tcmalloc changed and the data coming out of tcmalloc::MallocExtension::GetNumericProperty("generic.current_allocated_bytes") (envoyproxy#14165) Commit Message: tcmalloc changed and the data coming out of tcmalloc::MallocExtension::GetNumericProperty("generic.current_allocated_bytes") no longer appears to be deterministic, even in unthreaded tests. So disable exact mem checks till we sort that out Additional Description: Risk Level: low Testing: just thread_local_store_test Docs Changes: n/a Release Notes: n/a no longer appears to be deterministic, even in unthreaded tests. So disable exact mem checks till we sort that out Signed-off-by: Joshua Marantz <[email protected]> Signed-off-by: Christoph Pakulski <[email protected]> Co-authored-by: Joshua Marantz <[email protected]> * backport to v1.15: connection: Remember transport socket read resumption requests and replay them when re-enabling read. (envoyproxy#13772) (envoyproxy#14173) * connection: Remember transport socket read resumption requests and replay them when re-enabling read. (envoyproxy#13772) Fixes SslSocket read resumption after readDisable when processing the SSL record that contains the last bytes of the HTTP message Signed-off-by: Antonio Vicente <[email protected]> * backport to 1.15: udp: properly handle truncated/dropped datagrams (envoyproxy#14122) (envoyproxy#14166) Signed-off-by: Matt Klein <[email protected]> Signed-off-by: Christoph Pakulski <[email protected]> Co-authored-by: Matt Klein <[email protected]> Co-authored-by: Christoph Pakulski <[email protected]> * backport to 1.15: vrp: allow supervisord to open its log file (envoyproxy#14066) (envoyproxy#14280) Commit Message: Allow supervisord to open its log file Additional Description: Change the default location of the log file and give supervisord permissions to write to it. Risk Level: low Testing: built image locally Docs Changes: n/a Release Notes: n/a Platform Specific Features: n/a Signed-off-by: Alex Konradi <[email protected]> Signed-off-by: Christoph Pakulski <[email protected]> * rel 1.15: close release 1.15.3 (envoyproxy#14303) Signed-off-by: Christoph Pakulski <[email protected]> * Kick off rel 1.15.4. (envoyproxy#14323) Signed-off-by: Christoph Pakulski <[email protected]> * backport to 1.15: http: fix datadog and squash handling of responses without body (envoyproxy#13328) (envoyproxy#14458) Commit Message: Fixing bugs in datadog and sqaush where unexpected bodyless responses would crash Envoy Risk Level: low Testing: new unit tests, updated certs Docs Changes: n/a Release Notes: inline Signed-off-by: Christoph Pakulski <[email protected]> Co-authored-by: alyssawilk <[email protected]> * backport 1.15: http: fixing a bug with IPv6 hosts (envoyproxy#14273) Fixing a bug where HTTP parser offsets for IPv6 hosts did not include [] and Envoy assumed it did. This results in mis-parsing addresses for IPv6 CONNECT requests and IPv6 hosts in fully URLs over HTTP/1.1 Risk Level: low Testing: new unit, integration tests Docs Changes: n/a Release Notes: inline Signed-off-by: Shikugawa <[email protected]> Co-authored-by: alyssawilk <[email protected]> * backport to 1.15: tls: fix detection of the upstream connection close event. (envoyproxy#13858) (envoyproxy#14568) Fixes envoyproxy#13856. Signed-off-by: Piotr Sikora <[email protected]> Signed-off-by: Christoph Pakulski <[email protected]> Co-authored-by: Piotr Sikora <[email protected]> Co-authored-by: antonio <[email protected]> Co-authored-by: Matt Klein <[email protected]> Co-authored-by: Dmitri Dolguikh <[email protected]> Co-authored-by: Lizan Zhou <[email protected]> Co-authored-by: Christoph Pakulski <[email protected]> Co-authored-by: asraa <[email protected]> Co-authored-by: Joshua Marantz <[email protected]> Co-authored-by: Rei Shimizu <[email protected]> Co-authored-by: alyssawilk <[email protected]>
I am not too familiar with VHDS, but i testing #9774 locally I ran into this. It's possible the PR I was testing was responsible but it seems likely to be unrelated, given the stack trace.
The text was updated successfully, but these errors were encountered: