Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

upstream: fix subset_lb crash when host configured with no metadata #15171

Merged
merged 5 commits into from
Mar 10, 2021

Conversation

yanjunxiang-google
Copy link
Contributor

@yanjunxiang-google yanjunxiang-google commented Feb 24, 2021

Problem Description:
The problem is that when executing a fuzz test case with empty hostmetadata, envoy crashed at subset_lb.cc:rebuildSingle(): line 131.

Root cause:
The root cause is that current envoy code is assuming host->medata is valid, and directly using it without NULL check. In the case if metadata is actually NULL, which is a valid configuration, it crashes.

Fix:
1)The fix is to add a if (metadat != nullptr) in subset_lb.cc:rebuildSingle() before accessing metadata.
2)GTEST code is added to reproduce the issue, also verifies the fix.
3)The oss-fuzz test file which exposed the issue is pushed with the change.

Testing:
1)The fix is tested by running the oss-fuzz test with the special test file which has host configured without metadata.
2)The fix is also tested by the newly added GTEST code.

Release Notes:
N/A

Issues:

Fix #30705 : https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=30705

Signed-off-by: Yanjun Xiang [email protected]

For an explanation of how to fill out the fields, please see the relevant section
in PULL_REQUESTS.md

Commit Message:
Additional Description:
Risk Level:
Testing:
Docs Changes:
Release Notes:
Platform Specific Features:
[Optional Runtime guard:]
[Optional Fixes #Issue]
[Optional Deprecated:]
[Optional API Considerations:]

https://oss-fuzz.com/testcase-detail/5135200453525504

The problem is that when executing a fuzz test case with empty host
metadata, envoy crashed at subset_lb.cc:rebuildSingle(): line 131.

The root cause is that current envoy code is assuming host->medata is
valid, and directly using it without NULL check. In the case if metadata
is actually NULL, which is a valid configuration, it crashes.

The fix is to add a if (metadat != nullptr) before accessing metadata.

GTEST code is added to reproduce the issue, and verified the fix.

The fix is also verified by running the oss-fuzz test with that
special testcase input file.

Signed-off-by: Yanjun Xiang <[email protected]>
@yanjunxiang-google
Copy link
Contributor Author

/assign @asraa @adisuissa

Copy link
Contributor

@asraa asraa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @yanjunxiang-google!
Could you please make the issue title something like upstream: fix subset_lb crash when host configured with no metadata?

Also, could you please change the bug link to the monorail one? https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=30705. The detailed report eventually expires after some time. You can also commit the regression corpus entry as an additional regression test.

Copy link
Contributor

@adisuissa adisuissa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this!
I've added a couple of comments.
Please update the PR first comment fields, such as risk level, etc.
@asraa: should the fuzz test-case be pushed as well?

test/integration/http_subset_lb_integration_test.cc Outdated Show resolved Hide resolved
@yanjunxiang-google yanjunxiang-google changed the title This commit is to fix oss-fuzz test issue: upstream: fix subset_lb crash when host configured with no metadata Feb 25, 2021
@yanjunxiang-google
Copy link
Contributor Author

Thanks @yanjunxiang-google!
Could you please make the issue title something like upstream: fix subset_lb crash when host configured with no metadata?

Also, could you please change the bug link to the monorail one? https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=30705. The detailed report eventually expires after some time. You can also commit the regression corpus entry as an additional regression test.

Done!

@yanjunxiang-google
Copy link
Contributor Author

/retest

@repokitteh-read-only
Copy link

Retrying Azure Pipelines:
Check envoy-presubmit didn't fail.

🐱

Caused by: a #15171 (comment) was created by @yanjunxiang-google.

see: more, trace.

@yanjunxiang-google
Copy link
Contributor Author

Reopen the PR.

@yanjunxiang-google
Copy link
Contributor Author

/retest

@repokitteh-read-only
Copy link

Retrying Azure Pipelines:
Check envoy-presubmit isn't fully completed, but will still attempt retrying.
Check envoy-presubmit didn't fail.
Retried failed jobs in: envoy-presubmit

🐱

Caused by: a #15171 (comment) was created by @yanjunxiang-google.

see: more, trace.

adisuissa
adisuissa previously approved these changes Feb 26, 2021
Copy link
Contributor

@adisuissa adisuissa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this, LGTM!

{":authority", "host"}});
response->waitForEndStream();
EXPECT_TRUE(response->complete());
EXPECT_THAT(response->headers(), Http::HttpStatusIs("404"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know why this is giving a 404 instead of 503?
It might be configurable, but the default is 503

enum ClusterNotFoundResponseCode {

Copy link
Contributor Author

@yanjunxiang-google yanjunxiang-google Feb 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The configuration for this test is below: cluster and endpoints are configured. My understand is that with this configuration service is available, but page is not found, hence 404 is returned.

static_resources:
clusters:
- name: cluster_0
connect_timeout: 5s
lb_subset_config:
subset_selectors:
- keys:
- type
single_host_per_subset: true
load_assignment:
cluster_name: cluster_0
endpoints:
- lb_endpoints:
- endpoint:
address:
socket_address:
address: 127.0.0.1
port_value: 46423
- endpoint:
address:
socket_address:
address: 127.0.0.1
port_value: 40245
- endpoint:
address:
socket_address:
address: 127.0.0.1
port_value: 39563
- endpoint:
address:
socket_address:
address: 127.0.0.1
port_value: 37041

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually now that I think about it I think we're probably hitting a NR here and not actually hitting the bad code. I think the test file sets up a requirement for the header x-type to be set to either a or b in order to actually route to the upstream. Can you take a look and make sure we're actually attempting to route via the subset lb?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Let me take a look.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. The bad code in subset_lb.cc:SubsetLoadBalancer::rebuildSingle() is hit by configuration update, not by traffic. Which means in the newly added test code:
    TEST_P(HttpSubsetLbIntegrationTest, SubsetLoadBalancerSingleHostPerSubsetNoMetadata)

initialize(), which eventually call SubsetLoadBalancer::rebuildSingle() can trigger crash. We reproduced with this, also verified the fix with it.

  1. The original test function: TEST_P(HttpSubsetLbIntegrationTest, SubsetLoadBalancer) setup the config with metadata as
    ...
    - lb_endpoints:
    - endpoint:
    address:
    socket_address:
    address: 127.0.0.1
    port_value: 34033
    metadata:
    filter_metadata:
    envoy.lb:
    type: a
    - endpoint:
    address:
    socket_address:
    address: 127.0.0.1
    port_value: 35437
    metadata:
    filter_metadata:
    envoy.lb:
    type: b

...
also call runTest() with type_a and type_b as in TEST_P(..., SubsetLoadBalancer), in this test, Envoy is expected to route via subset lb with this test.

However, the newly added test function doesn't have type a/b metadata config, neither call runTest with the two different request types, it probably won't route via the subset lb.

  1. In the newly added test function, the code after initialize(), i.e, creating HTTP connection and sending a HTTP request and verify the response is not specially testing the code change in SubsetLoadBalancer::rebuildSingle(), but a general purpose test, i.e, just to send some traffic, and verify response as expected.

Please let me know whether this addressed your question.

Thanks!

Copy link
Contributor

@snowp snowp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix!

// Two hosts with the same metadata value were found. Ignore all but one of them, and
// set a metric for how many times this happened.
collision_count++;
if (metadata != nullptr) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe do

if (metadata == nullptr) {
  continue;
}

to reduce the nesting?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, will do.

Comment on lines 225 to 227
// Set single_host_per_subset to be true
auto* subset_selector = cluster->mutable_lb_subset_config()->mutable_subset_selectors(0);
subset_selector->set_single_host_per_subset(true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like this crash is limited to when this option is enabled, could we be more explicit about this in the comments or test name? Maybe even add a test case where this is disabled for completeness sake.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this crash only happens when single_host_per_subset is set to be true. Without it subset_lb.cc:rebuildSingle() will bailout early hence no crash. I will add some comments here to cover this.

@yanjunxiang-google
Copy link
Contributor Author

yanjunxiang-google commented Feb 26, 2021 via email

@yanjunxiang-google
Copy link
Contributor Author

/retest

@repokitteh-read-only
Copy link

Retrying Azure Pipelines:
Retried failed jobs in: envoy-presubmit

🐱

Caused by: a #15171 (comment) was created by @yanjunxiang-google.

see: more, trace.

@asraa
Copy link
Contributor

asraa commented Mar 1, 2021

I think you will need to merge main when this is merged #15238

@yanjunxiang-google
Copy link
Contributor Author

I think you will need to merge main when this is merged #15238
Thanks for the info!

@yanjunxiang-google
Copy link
Contributor Author

/retest

@repokitteh-read-only
Copy link

Retrying Azure Pipelines:
Retried failed jobs in: envoy-presubmit

🐱

Caused by: a #15171 (comment) was created by @yanjunxiang-google.

see: more, trace.

Copy link
Contributor

@snowp snowp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

{":authority", "host"}});
response->waitForEndStream();
EXPECT_TRUE(response->complete());
EXPECT_THAT(response->headers(), Http::HttpStatusIs("404"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually now that I think about it I think we're probably hitting a NR here and not actually hitting the bad code. I think the test file sets up a requirement for the header x-type to be set to either a or b in order to actually route to the upstream. Can you take a look and make sure we're actually attempting to route via the subset lb?

@snowp snowp added the waiting label Mar 3, 2021
@yanjunxiang-google
Copy link
Contributor Author

@snowp

I checked the code, and also the test logs. Here is some comments:

The bad code in subset_lb.cc:SubsetLoadBalancer::rebuildSingle() is hit by configuration update, not by traffic. Which means in the newly added test code:
TEST_P(HttpSubsetLbIntegrationTest, SubsetLoadBalancerSingleHostPerSubsetNoMetadata)
initialize(), which eventually call SubsetLoadBalancer::rebuildSingle() can trigger crash. We reproduced with this, also verified the fix with it.

The original test function: TEST_P(HttpSubsetLbIntegrationTest, SubsetLoadBalancer) setup the config with metadata as
...

  • lb_endpoints:
  • endpoint:
    address:
    socket_address:
    address: 127.0.0.1
    port_value: 34033
    metadata:
    filter_metadata:
    envoy.lb:
    type: a
  • endpoint:
    address:
    socket_address:
    address: 127.0.0.1
    port_value: 35437
    metadata:
    filter_metadata:
    envoy.lb:
    type: b
    ...
    also call runTest() with type_a_request and type_b_request as in TEST_P(..., SubsetLoadBalancer), in this test, Envoy is routing via subset lb, and verified.

However, the newly added test function doesn't have type a/b metadata config, neither call runTest with the two different request types, it probably won't route via the subset lb.

In the newly added test function, the code after initialize(), i.e, creating HTTP connection and sending a HTTP request and verify the response is not specially testing the code change in SubsetLoadBalancer::rebuildSingle(), but a general purpose test, i.e, just to send some traffic, and verify response as expected.

Please let me know whether this addressed your question.

Thanks!

@snowp
Copy link
Contributor

snowp commented Mar 4, 2021

Even if we're not having to validate the behavior via the traffic it seems odd to have a request that doesn't hit the subset LB at all? Would it be hard to make it route via the subset LB? I think we'd just need to set a single header

@yanjunxiang-google
Copy link
Contributor Author

Even if we're not having to validate the behavior via the traffic it seems odd to have a request that doesn't hit the subset LB at all? Would it be hard to make it route via the subset LB? I think we'd just need to set a single header

Sure, let me add this.

@yanjunxiang-google
Copy link
Contributor Author

working on this today

@yanjunxiang-google
Copy link
Contributor Author

Even if we're not having to validate the behavior via the traffic it seems odd to have a request that doesn't hit the subset LB at all? Would it be hard to make it route via the subset LB? I think we'd just need to set a single header

Sure, let me add this.

With below RequestHeader setup,
Http::TestRequestHeaderMapImpl{{":method", "GET"},
{":path", "/test"},
{":scheme", "http"},
{":authority", "host"},
{"x-type", "a"},
{"x-hash", "hash-a"}});

It's observed Envoy tried to route the request, but failed due to "no healthy host for HTTP connection pool". I think this is expected due to no metadata present in host configuration. Below is the complete error logs. Please check the latest code change.

2021-03-09 16:05:42.229][130][trace][connection] [source/common/network/connection_impl.cc:349] [C47] readDisable: disable=true disable_count=0 state=0 buffer_length=80
[2021-03-09 16:05:42.229][130][debug][http] [source/common/http/conn_manager_impl.cc:883] [C47][S7508058493963056235] request headers complete (end_stream=true):
':authority', 'host'
':path', '/test'
':method', 'GET'
'x-type', 'a'
'x-hash', 'hash-a'
'content-length', '0'

[2021-03-09 16:05:42.229][130][debug][http] [source/common/http/filter_manager.cc:774] [C47][S7508058493963056235] request end stream
[2021-03-09 16:05:42.229][130][debug][router] [source/common/router/router.cc:428] [C47][S7508058493963056235] cluster 'cluster_0' match for URL '/test'
[2021-03-09 16:05:42.229][130][debug][upstream] [source/common/upstream/cluster_manager_impl.cc:1367] no healthy host for HTTP connection pool
[2021-03-09 16:05:42.230][130][debug][http] [source/common/http/filter_manager.cc:858] [C47][S7508058493963056235] Sending local reply with details no_healthy_upstream
[2021-03-09 16:05:42.230][130][trace][misc] [source/common/event/scaled_range_timer_manager_impl.cc:60] enableTimer called on 0x6543f616080 for 300000ms, min is 300000ms
[2021-03-09 16:05:42.230][130][debug][http] [source/common/http/conn_manager_impl.cc:1454] [C47][S7508058493963056235] encoding headers via codec (end_stream=false):
':status', '503'
'content-length', '19'
'content-type', 'text/plain'
'date', 'Tue, 09 Mar 2021 16:05:42 GMT'
'server', 'envoy'

Thanks!

@yanjunxiang-google
Copy link
Contributor Author

/retest

@repokitteh-read-only
Copy link

Retrying Azure Pipelines:
Retried failed jobs in: envoy-presubmit

🐱

Caused by: a #15171 (comment) was created by @yanjunxiang-google.

see: more, trace.

@snowp
Copy link
Contributor

snowp commented Mar 9, 2021

Yeah that looks right to me, thanks for updating the test!

@snowp snowp merged commit e8cd93d into envoyproxy:main Mar 10, 2021
auni53 pushed a commit to auni53/envoy that referenced this pull request Mar 19, 2021
…nvoyproxy#15171)

This commit is to fix oss-fuzz test issue:
https://oss-fuzz.com/testcase-detail/5135200453525504

The problem is that when executing a fuzz test case with empty host
metadata, envoy crashed at subset_lb.cc:rebuildSingle(): line 131.

The root cause is that current envoy code is assuming host->medata is
valid, and directly using it without NULL check. In the case if metadata
is actually NULL, which is a valid configuration, it crashes.

The fix is to add a if (metadat != nullptr) before accessing metadata.

GTEST code is added to reproduce the issue, and verified the fix.

The fix is also verified by running the oss-fuzz test with that
special testcase input file.

Signed-off-by: Yanjun Xiang <[email protected]>
Signed-off-by: Auni Ahsan <[email protected]>
@yanjunxiang-google yanjunxiang-google deleted the yanjun_oss_fuzz_crash branch July 19, 2021 21:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants