Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validate stale connection to fix the bug: failed to respond #760

Merged
merged 5 commits into from
Dec 1, 2021

Conversation

ryan-tu
Copy link
Contributor

@ryan-tu ryan-tu commented Nov 24, 2021

Problem Statement:

A lot of work has been done to fix the "failed to respond" problem.

#290, #462, #531 report the problem,
#515 use the server side Keep-Alive timeout to replace the client side parameter, to reduce the possibility of "stale connection",
#505 fix the wrong connection reuse strategy, wrong Keep-Alive duration to make sure connection is closed immediately when a request fails,
#540 Add retry logic for idempotent SQL such as SELECT, and use a client side parser to decide whether an sql is idempotent or not. But, for non-idempotent SQL, such as INSERT, this problem still exists. And parsing each SQL in JDBC is a heavy cost, and it's hard to keep consistent with server side SQL syntax.

Proposal:

The root cause to "failed to respond" is half-closed tcp connections(closed by server-side when idle timeout reaches) are still kept in underlying connection pool of HttpClient.
The problem is a common issue to all connection pools like Druid, C3P0, DBCP, HikariCP, and each connection pool MUST provide a validation mechanism to evict stale connection.

The validation mechanism provided by Apache HttpClient before v4.4 requires check per request, thus is too heavy to use.
Since v4.4, a new validation mechanism is imported by: PoolingHttpClientConnectionManager.setValidateAfterInactivity(), which is a middle ground between check per request and no check.

This PR add a new connection setting validateAfterInactivityMillis to enable it.

The best solution should be that the client honors server-side Keep-Alive timeout to do best effort to reuse connections, and set validateAfterInactivityMillis to a slightly shorter value than server-side Keep-Alive timeout to evict stale ones.

@zhicwu
Copy link
Contributor

zhicwu commented Nov 24, 2021

Thanks @ryan-tu - excellent work and great summary!

The default validateAfterInactivity is 2000ms, which is short enough right? See https://github.com/apache/httpcomponents-client/blob/a0b85d44c26430ff8354a0af5a1b7d58e2e5bce5/httpclient/src/main/java/org/apache/http/impl/conn/PoolingHttpClientConnectionManager.java#L183. Also, you may want to add a test for this - I was able to reproduce the issue by forcibly closing the connection from server-side(using Wiremock).

FYI, I rewrote JDBC driver in PR #747. Apache http client is now replaced by HttpURLConnection(similar as clickhouse4j), and it works just fine. JDK9+ HttpClient is apparently a better choice, but my point is that we probably don't need a 3rd party library like Apache http client and OKhttp anymore.

@ryan-tu
Copy link
Contributor Author

ryan-tu commented Nov 24, 2021

@zhicwu

UT test cases are added. It takes some time to reproduce the bug and prove the bug has been fixed.

FYI, I rewrote JDBC driver in PR #747. Apache http client is now replaced by HttpURLConnection(similar as clickhouse4j), and it works just fine. JDK9+ HttpClient is apparently a better choice, but my point is that we probably don't need a 3rd party library like Apache http client and OKhttp anymore.

I'm no expert of HTTP, but i am aware that Apache HttpClient provides a rich advanced feature set, like proxy, pool configuration, async API, HTTP2 etc.. I'm not sure if HttpURLConnection can fulfill all the requirements.
I do agree JDK9+ HttpClient is a good choice.

@zhicwu
Copy link
Contributor

zhicwu commented Nov 26, 2021

Thanks again @ryan-tu for spending extra time adding the tests. We probably need CountDownLatch to avoid flaky test. I will definitely play with the tests during weekend and update accordingly.

Did you notice validateAfterInactivity is set to 2000ms by default in PoolingHttpClientConnectionManager? I think it's short enough as default keep-alive timeout from server is 3000ms. This means we're not supposed to run into failed-to-response issue without any change in code, so it might also be a timing issue(e.g. thread pool Apache http client uses for eviction, and how long the clean up thread wait before execution etc.). Anyway, I have no problem of adding the new option, since it gives people more flexibility in terms of customization.

public PoolingHttpClientConnectionManager(
    final HttpClientConnectionOperator httpClientConnectionOperator,
    final HttpConnectionFactory<HttpRoute, ManagedHttpClientConnection> connFactory,
    final long timeToLive, final TimeUnit timeUnit) {
    super();
    this.configData = new ConfigData();
    this.pool = new CPool(new InternalConnectionFactory(
            this.configData, connFactory), 2, 20, timeToLive, timeUnit);
    this.pool.setValidateAfterInactivity(2000);
    this.connectionOperator = Args.notNull(httpClientConnectionOperator, "HttpClientConnectionOperator");
    this.isShutDown = new AtomicBoolean(false);
}

Regarding http client, although HttpURLConnection does not support advanced feature like http trailer, it supports basic ones including chunking, making it suitable for a simple JDBC driver. Apache http client on the other hand is too heavy and it contains many features that we don't need. Having said that, because we now have JDBC driver and protocol implementation separated in different modules(e.g. clickhouse-http-client for handling http/https), we can easily add Apache http client back when needed for sure.

@zhicwu zhicwu changed the base branch from master to develop November 28, 2021 12:36
@github-actions
Copy link

github-actions bot commented Dec 1, 2021

Benchmark                                (client)  (connection)  (statement)   (type)   Mode  Cnt     Score     Error  Units
Basic.insertOneRandomNumber  clickhouse-http-jdbc         reuse       normal  default  thrpt   20   246.491 ±  39.784  ops/s
Basic.insertOneRandomNumber  clickhouse-http-jdbc         reuse     prepared  default  thrpt   20   238.071 ±  32.913  ops/s
Basic.insertOneRandomNumber  clickhouse-http-jdbc           new       normal  default  thrpt   20   232.754 ±  38.235  ops/s
Basic.insertOneRandomNumber  clickhouse-http-jdbc           new     prepared  default  thrpt   20   222.125 ±  30.067  ops/s
Basic.insertOneRandomNumber  clickhouse-grpc-jdbc         reuse       normal  default  thrpt   20   246.951 ±  35.261  ops/s
Basic.insertOneRandomNumber  clickhouse-grpc-jdbc         reuse     prepared  default  thrpt   20   239.639 ±  35.207  ops/s
Basic.insertOneRandomNumber  clickhouse-grpc-jdbc           new       normal  default  thrpt   20   246.562 ±  34.394  ops/s
Basic.insertOneRandomNumber  clickhouse-grpc-jdbc           new     prepared  default  thrpt   20   244.980 ±  33.047  ops/s
Basic.selectOneRandomNumber  clickhouse-http-jdbc         reuse       normal  default  thrpt   20  1263.437 ± 123.689  ops/s
Basic.selectOneRandomNumber  clickhouse-http-jdbc         reuse     prepared  default  thrpt   20  1264.292 ± 100.631  ops/s
Basic.selectOneRandomNumber  clickhouse-http-jdbc           new       normal  default  thrpt   20  1318.051 ±  86.613  ops/s
Basic.selectOneRandomNumber  clickhouse-http-jdbc           new     prepared  default  thrpt   20  1320.501 ±  90.106  ops/s
Basic.selectOneRandomNumber  clickhouse-grpc-jdbc         reuse       normal  default  thrpt   20   875.779 ±  58.616  ops/s
Basic.selectOneRandomNumber  clickhouse-grpc-jdbc         reuse     prepared  default  thrpt   20   863.397 ±  57.946  ops/s
Basic.selectOneRandomNumber  clickhouse-grpc-jdbc           new       normal  default  thrpt   20   808.740 ±  74.804  ops/s
Basic.selectOneRandomNumber  clickhouse-grpc-jdbc           new     prepared  default  thrpt   20   840.260 ±  63.649  ops/s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ClickHouse exception, code: 1002, host: xxxx, port: xxx; xxxx:xxx failed to respond
2 participants