-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
upstream: fix subset_lb crash when host configured with no metadata #15171
upstream: fix subset_lb crash when host configured with no metadata #15171
Conversation
https://oss-fuzz.com/testcase-detail/5135200453525504 The problem is that when executing a fuzz test case with empty host metadata, envoy crashed at subset_lb.cc:rebuildSingle(): line 131. The root cause is that current envoy code is assuming host->medata is valid, and directly using it without NULL check. In the case if metadata is actually NULL, which is a valid configuration, it crashes. The fix is to add a if (metadat != nullptr) before accessing metadata. GTEST code is added to reproduce the issue, and verified the fix. The fix is also verified by running the oss-fuzz test with that special testcase input file. Signed-off-by: Yanjun Xiang <[email protected]>
/assign @asraa @adisuissa |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @yanjunxiang-google!
Could you please make the issue title something like upstream: fix subset_lb crash when host configured with no metadata
?
Also, could you please change the bug link to the monorail one? https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=30705. The detailed report eventually expires after some time. You can also commit the regression corpus entry as an additional regression test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this!
I've added a couple of comments.
Please update the PR first comment fields, such as risk level, etc.
@asraa: should the fuzz test-case be pushed as well?
Signed-off-by: Yanjun Xiang <[email protected]>
Done! |
/retest |
Retrying Azure Pipelines: |
Reopen the PR. |
/retest |
Retrying Azure Pipelines: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing this, LGTM!
{":authority", "host"}}); | ||
response->waitForEndStream(); | ||
EXPECT_TRUE(response->complete()); | ||
EXPECT_THAT(response->headers(), Http::HttpStatusIs("404")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know why this is giving a 404 instead of 503?
It might be configurable, but the default is 503
enum ClusterNotFoundResponseCode { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The configuration for this test is below: cluster and endpoints are configured. My understand is that with this configuration service is available, but page is not found, hence 404 is returned.
static_resources:
clusters:
- name: cluster_0
connect_timeout: 5s
lb_subset_config:
subset_selectors:
- keys:
- type
single_host_per_subset: true
load_assignment:
cluster_name: cluster_0
endpoints:
- lb_endpoints:
- endpoint:
address:
socket_address:
address: 127.0.0.1
port_value: 46423
- endpoint:
address:
socket_address:
address: 127.0.0.1
port_value: 40245
- endpoint:
address:
socket_address:
address: 127.0.0.1
port_value: 39563
- endpoint:
address:
socket_address:
address: 127.0.0.1
port_value: 37041
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually now that I think about it I think we're probably hitting a NR here and not actually hitting the bad code. I think the test file sets up a requirement for the header x-type
to be set to either a
or b
in order to actually route to the upstream. Can you take a look and make sure we're actually attempting to route via the subset lb?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Let me take a look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- The bad code in subset_lb.cc:SubsetLoadBalancer::rebuildSingle() is hit by configuration update, not by traffic. Which means in the newly added test code:
TEST_P(HttpSubsetLbIntegrationTest, SubsetLoadBalancerSingleHostPerSubsetNoMetadata)
initialize(), which eventually call SubsetLoadBalancer::rebuildSingle() can trigger crash. We reproduced with this, also verified the fix with it.
- The original test function: TEST_P(HttpSubsetLbIntegrationTest, SubsetLoadBalancer) setup the config with metadata as
...
- lb_endpoints:
- endpoint:
address:
socket_address:
address: 127.0.0.1
port_value: 34033
metadata:
filter_metadata:
envoy.lb:
type: a
- endpoint:
address:
socket_address:
address: 127.0.0.1
port_value: 35437
metadata:
filter_metadata:
envoy.lb:
type: b
...
also call runTest() with type_a and type_b as in TEST_P(..., SubsetLoadBalancer), in this test, Envoy is expected to route via subset lb with this test.
However, the newly added test function doesn't have type a/b metadata config, neither call runTest with the two different request types, it probably won't route via the subset lb.
- In the newly added test function, the code after initialize(), i.e, creating HTTP connection and sending a HTTP request and verify the response is not specially testing the code change in SubsetLoadBalancer::rebuildSingle(), but a general purpose test, i.e, just to send some traffic, and verify response as expected.
Please let me know whether this addressed your question.
Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fix!
source/common/upstream/subset_lb.cc
Outdated
// Two hosts with the same metadata value were found. Ignore all but one of them, and | ||
// set a metric for how many times this happened. | ||
collision_count++; | ||
if (metadata != nullptr) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe do
if (metadata == nullptr) {
continue;
}
to reduce the nesting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, will do.
// Set single_host_per_subset to be true | ||
auto* subset_selector = cluster->mutable_lb_subset_config()->mutable_subset_selectors(0); | ||
subset_selector->set_single_host_per_subset(true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like this crash is limited to when this option is enabled, could we be more explicit about this in the comments or test name? Maybe even add a test case where this is disabled for completeness sake.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this crash only happens when single_host_per_subset is set to be true. Without it subset_lb.cc:rebuildSingle() will bailout early hence no crash. I will add some comments here to cover this.
Logged in with another Github account. Just deleted that comments, and
recommented from the google account.
…On Fri, Feb 26, 2021 at 2:25 PM KevinMass2018 ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In test/integration/http_subset_lb_integration_test.cc
<#15171 (comment)>:
> + for (uint32_t i = 0; i < num_hosts_; i++) {
+ auto* lb_endpoint = endpoints->mutable_lb_endpoints(i);
+ lb_endpoint->clear_metadata();
+ }
+ });
+
+ initialize();
+ codec_client_ = makeHttpConnection(lookupPort("http"));
+ auto response = codec_client_->makeHeaderOnlyRequest(
+ Http::TestRequestHeaderMapImpl{{":method", "GET"},
+ {":path", "/test/long/url"},
+ {":scheme", "http"},
+ {":authority", "host"}});
+ response->waitForEndStream();
+ EXPECT_TRUE(response->complete());
+ EXPECT_THAT(response->headers(), Http::HttpStatusIs("404"));
The configuration used to test the issue is below: cluster and host are
configured. It looks like in this case, 404 is returned.
static_resources:
clusters:
- name: cluster_0
connect_timeout: 5s
lb_subset_config:
subset_selectors:
- keys:
- type
single_host_per_subset: true
load_assignment:
cluster_name: cluster_0
endpoints:
- lb_endpoints:
- endpoint:
address:
socket_address:
address: 127.0.0.1
port_value: 46423
- endpoint:
address:
socket_address:
address: 127.0.0.1
port_value: 40245
- endpoint:
address:
socket_address:
address: 127.0.0.1
port_value: 39563
- endpoint:
address:
socket_address:
address: 127.0.0.1
port_value: 37041
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#15171 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ASZIHLHC7HAEEEMY3BO3RBTTA7YRXANCNFSM4YEVVESQ>
.
|
Signed-off-by: Yanjun Xiang <[email protected]>
/retest |
Retrying Azure Pipelines: |
I think you will need to merge main when this is merged #15238 |
|
/retest |
Retrying Azure Pipelines: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
{":authority", "host"}}); | ||
response->waitForEndStream(); | ||
EXPECT_TRUE(response->complete()); | ||
EXPECT_THAT(response->headers(), Http::HttpStatusIs("404")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually now that I think about it I think we're probably hitting a NR here and not actually hitting the bad code. I think the test file sets up a requirement for the header x-type
to be set to either a
or b
in order to actually route to the upstream. Can you take a look and make sure we're actually attempting to route via the subset lb?
I checked the code, and also the test logs. Here is some comments: The bad code in subset_lb.cc:SubsetLoadBalancer::rebuildSingle() is hit by configuration update, not by traffic. Which means in the newly added test code: The original test function: TEST_P(HttpSubsetLbIntegrationTest, SubsetLoadBalancer) setup the config with metadata as
However, the newly added test function doesn't have type a/b metadata config, neither call runTest with the two different request types, it probably won't route via the subset lb. In the newly added test function, the code after initialize(), i.e, creating HTTP connection and sending a HTTP request and verify the response is not specially testing the code change in SubsetLoadBalancer::rebuildSingle(), but a general purpose test, i.e, just to send some traffic, and verify response as expected. Please let me know whether this addressed your question. Thanks! |
Even if we're not having to validate the behavior via the traffic it seems odd to have a request that doesn't hit the subset LB at all? Would it be hard to make it route via the subset LB? I think we'd just need to set a single header |
Sure, let me add this. |
working on this today |
Signed-off-by: Yanjun Xiang <[email protected]>
With below RequestHeader setup, It's observed Envoy tried to route the request, but failed due to "no healthy host for HTTP connection pool". I think this is expected due to no metadata present in host configuration. Below is the complete error logs. Please check the latest code change. 2021-03-09 16:05:42.229][130][trace][connection] [source/common/network/connection_impl.cc:349] [C47] readDisable: disable=true disable_count=0 state=0 buffer_length=80 [2021-03-09 16:05:42.229][130][debug][http] [source/common/http/filter_manager.cc:774] [C47][S7508058493963056235] request end stream Thanks! |
/retest |
Retrying Azure Pipelines: |
Signed-off-by: Yanjun Xiang <[email protected]>
Yeah that looks right to me, thanks for updating the test! |
…nvoyproxy#15171) This commit is to fix oss-fuzz test issue: https://oss-fuzz.com/testcase-detail/5135200453525504 The problem is that when executing a fuzz test case with empty host metadata, envoy crashed at subset_lb.cc:rebuildSingle(): line 131. The root cause is that current envoy code is assuming host->medata is valid, and directly using it without NULL check. In the case if metadata is actually NULL, which is a valid configuration, it crashes. The fix is to add a if (metadat != nullptr) before accessing metadata. GTEST code is added to reproduce the issue, and verified the fix. The fix is also verified by running the oss-fuzz test with that special testcase input file. Signed-off-by: Yanjun Xiang <[email protected]> Signed-off-by: Auni Ahsan <[email protected]>
Problem Description:
The problem is that when executing a fuzz test case with empty hostmetadata, envoy crashed at subset_lb.cc:rebuildSingle(): line 131.
Root cause:
The root cause is that current envoy code is assuming host->medata is valid, and directly using it without NULL check. In the case if metadata is actually NULL, which is a valid configuration, it crashes.
Fix:
1)The fix is to add a if (metadat != nullptr) in subset_lb.cc:rebuildSingle() before accessing metadata.
2)GTEST code is added to reproduce the issue, also verifies the fix.
3)The oss-fuzz test file which exposed the issue is pushed with the change.
Testing:
1)The fix is tested by running the oss-fuzz test with the special test file which has host configured without metadata.
2)The fix is also tested by the newly added GTEST code.
Release Notes:
N/A
Issues:
Fix #30705 : https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=30705
Signed-off-by: Yanjun Xiang [email protected]
For an explanation of how to fill out the fields, please see the relevant section
in PULL_REQUESTS.md
Commit Message:
Additional Description:
Risk Level:
Testing:
Docs Changes:
Release Notes:
Platform Specific Features:
[Optional Runtime guard:]
[Optional Fixes #Issue]
[Optional Deprecated:]
[Optional API Considerations:]