Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: use warn log for get healthcheck target status failure #10156

Merged
merged 1 commit into from
Oct 10, 2023

Conversation

shreemaan-abhishek
Copy link
Contributor

@shreemaan-abhishek shreemaan-abhishek commented Sep 4, 2023

Description

core.log.error("failed to get health check target status, addr: ",
node.host, ":", port or node.port, ", host: ", host, ", err: ", err)

The above error log reports an error in some cases even if the request is valid and successful. In the initial requests, get_target_status() will always fail as the target status is not yet finalized. It does not make sense to report an error for positive outcomes.

Note that the get_target_status() failure just signifies that the system failed to get the health status only. We do not know if the target is actually unhealthy. And even if the node is unavailable/unhealthy get_target_status() does not report the error, but some warn logs like this:

2023/09/04 22:13:09 [warn] 14709#16724107: *2442 [lua] healthcheck.lua:650: [healthcheck] (upstream#/apisix/upstreams/1) unhealthy TCP increment (2/2) for '(127.0.0.1:8765)', context: ngx.timer, client: 127.0.0.1, server: 0.0.0.0:9080

Background information:

I encountered this problem while writing a test case for https based healthcheck.

Checklist

  • I have explained the need for this PR and the problem it solves
  • I have explained the changes or the new features added to this PR
  • I have added tests corresponding to this change
  • I have updated the documentation to reflect this change
  • I have verified that this change is backward compatible (If not, please discuss on the APISIX mailing list first)

@monkeyDluffy6017
Copy link
Contributor

The error log occurs when the upstream is rebuilt as we add or delete an upstream node. The state is not correct, it will lead to unbalanced load. It's a problem, we should fix this, so the error log level is appropriate here.

@shreemaan-abhishek
Copy link
Contributor Author

The state is not correct.

Do you mean not correct == unhealthy?

Also, in the beginning, there will be no state as the healthcheck won't be triggered unless the upstream has been accessed.
image

@monkeyDluffy6017 try running the following test case and check the servroot error logs and there you will find error logs even for positive outcomes.

t/stream-node/upstream-tls.t

use t::APISIX 'no_plan';

no_root_location();

add_block_preprocessor(sub {
    my ($block) = @_;

    if (!$block->http_config) {
        my $http_config = <<'_EOC_';
server {
    listen 8765 ssl;
    ssl_certificate ../../certs/mtls_server.crt;
    ssl_certificate_key ../../certs/mtls_server.key;
    ssl_client_certificate ../../certs/mtls_ca.crt;
    ssl_verify_client on;

    location /index {
        return 200 'gangnam style!';
    }
}
server {
    listen 8766 ssl;
    ssl_certificate ../../certs/mtls_server.crt;
    ssl_certificate_key ../../certs/mtls_server.key;
    ssl_client_certificate ../../certs/mtls_ca.crt;
    ssl_verify_client on;

    location /index {
        return 200 'gangnam style!';
    }
}
_EOC_
        $block->set_value("http_config", $http_config);
    }

    if (!$block->request) {
        $block->set_value("request", "GET /t");
    }

});

run_tests;

__DATA__

=== TEST 7: https health check
--- log_level: info
--- config
    location /t {
        content_by_lua_block {
            local t = require("lib.test_admin")
            local core = require("apisix.core")
            local cert = t.read_file("t/certs/mtls_client.crt")
            local key =  t.read_file("t/certs/mtls_client.key")
            local data = {
                retries = 2,
                checks = {
                    active = {
                    https_verify_certificate = false,
                    healthy = {
                        interval = 2,
                        successes = 1
                    },
                    unhealthy = {
                        interval = 1,
                        http_failures = 2
                    },
                    type = "https",
                    timeout = 5
                    },
                },
                type = "roundrobin",
                nodes = {
                    ["127.0.0.1:8766"] = 1,
                    ["127.0.0.1:8765"] = 1
                },
                scheme = "https",
                tls = {
                    client_cert = cert,
                    client_key = key
                }
            }
            local code, body = t.test('/apisix/admin/upstreams/1',
                 ngx.HTTP_PUT,
                 core.json.encode(data)
            )

            if code >= 300 then
                ngx.status = code
            end

            local code, body = t.test('/apisix/admin/routes/1',
                 ngx.HTTP_PUT,
                 [[{
                        "upstream_id": 1,
                        "uri": "/index"
                }]]
            )

            if code >= 300 then
                ngx.status = code
            end
        }
    }
--- error_code: 200
--- timeout: 15



=== TEST 8: ACCESS
--- pipelined_requests eval
["GET /index", "GET /index"]
--- response_body eval
["gangnam style!", "gangnam style!"]
--- error_code eval
[200, 200]

P.S: I am unable to assert the error_log using the test case so you will have to check the log manually, if you change the error log to warn log, the test will pass signifying the successful working of the feature.

@leslie-tsang
Copy link
Member

leslie-tsang commented Sep 7, 2023

APISIX use pre-defined preprocessor to handle the default no_error_log, maybe pass an ignore_error_log will feed you need. @shreemaan-abhishek

@monkeyDluffy6017
Copy link
Contributor

@shreemaan-abhishek If @leslie-tsang's suggestion satisfy your requirement, the pr could be closed?

@shreemaan-abhishek
Copy link
Contributor Author

@monkeyDluffy6017, although the background for this PR was writing a test case, I am still convinced that using an error level log leads to a false negative.

@monkeyDluffy6017
Copy link
Contributor

@shreemaan-abhishek I have fix this by the following pr, so the error log won't happen again:
api7/lua-resty-healthcheck#35
api7/lua-resty-healthcheck#36

@monkeyDluffy6017
Copy link
Contributor

By the way, the log level error is as expected. The get_target_status should not return an error

@monkeyDluffy6017
Copy link
Contributor

I will close this pr, feel free to reopen it

@ZhangShangyu
Copy link
Contributor

The error log occurs when the upstream is rebuilt as we add or delete an upstream node. The state is not correct, it will lead to unbalanced load. It's a problem, we should fix this, so the error log level is appropriate here.

we also meet this error log, #9415, finally we decide to ignore this target not find error, treat the node as health

if err == "target not found" then
    core.log.warn("health check target not found, ignore health check, addr: ",
            node.host, ":", port or node.port, ", host: ", host, ", err: ", err)
    up_nodes = transform_node(up_nodes, node)
else
    core.log.error("failed to get health check target status, addr: ",
            node.host, ":", port or node.port, ", host: ", host, ", err: ", err)
end

btw it may not lead to unbalanced load, because it add all node to the up_nodes as default.

@monkeyDluffy6017
Copy link
Contributor

monkeyDluffy6017 commented Oct 10, 2023

I reopened this pull request because many people have encountered the same issue. This error should only occur when multiple requests are made concurrently while creating or modifying an upstream. While the first request is in the process of creating a health check object, the second request encounters this error because it cannot access the object that hasn't been created yet. However, this error does not actually impact request forwarding. Lowering the error log level from 'error' to 'warn' is reasonable to prevent any misunderstanding.

@monkeyDluffy6017 monkeyDluffy6017 merged commit 7279a34 into apache:master Oct 10, 2023
@shreemaan-abhishek shreemaan-abhishek deleted the fix/health-error branch October 10, 2023 08:56
Revolyssup pushed a commit to Revolyssup/apisix that referenced this pull request Oct 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

bug: the exception of health check in special scene
5 participants