[FEATURE] Replace blocking httpclient with async httpclient in remote inference #1839

zane-neo · 2024-01-05T03:11:01Z

Is your feature request related to a problem?
Community user brings up a performance issue here, which reveals a performance issue in HttpClient of remote inference. The flow of the prediction can be illustrated as below:

There are two major issues here:

The connection pool size of HttpClient is 20 by default which can cause timeout waiting for connection described here: [FEATURE] Performance issue in CloseableHttpClient #1537.
The blocking HttpClient is a bottleneck since the predict pool thread size by default is 2 * num of vCPUs, this isn't a big value since local model prediction is a CPU bound operation. But for remote inference, it's an IO bound operation and the thread pool size is relatively small.
For issue1, we can enable user to update the configuration of max_connections to handle more parallel predict requests.
For issue2, we can increase the predict thread pool size to bigger number to increase the parallelism, but this is not optimal because more threads would cause more context switch and degrade the overall system performance.

What solution would you like?
Replace the blocking HttpClient with async HttpClient. With async HttpClient, both two issues above can be handled perfectly, there's no connection pool in async HttpClient and we don't need to change the default predict thread pool size since async HttpClient has better performance with only a few threads.
AWS async HttpClient: https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/http-configuration-crt.html

What alternatives have you considered?
Increasing the predict thread pool size and make this a system setting and configurable to user.

Do you have any additional context?
NA

The text was updated successfully, but these errors were encountered:

zane-neo · 2024-01-05T03:15:19Z

@model-collapse @ylwu-amzn @dhrubo-os @austintlee Please chime in.

dhrubo-os · 2024-01-17T11:30:45Z

there's no connection pool in async HttpClient

Could you please explain why? In your provided link, I can see maxConcurrency(100) was set for both (async and sync)

zane-neo · 2024-01-19T06:14:20Z

A mistake here, async httpclient also has connection pools, the 100 is an example value.

zane-neo · 2024-01-29T08:20:45Z

Benchmark results of replacing sync http-client to async

Test settings

commons settings

Benchmark doc count: 100k
One SageMaker endpoint with node type: ml.r5.4xlarge. This node has 16 vCPUs so the full CPU utilization should be 1600%.

sync/async httpclient cluster
One data node with type m5.xlarge

Results

Sync httpclient benchmark result

bulk size: 200

Profile result

{
    "models": {
        "DjD5U40BcDj4M4xaapQ-": {
            "target_worker_nodes": [
                "47xFKefyT_yT4ruLRNVysQ"
            ],
            "worker_nodes": [
                "47xFKefyT_yT4ruLRNVysQ"
            ],
            "nodes": {
                "47xFKefyT_yT4ruLRNVysQ": {
                    "model_state": "DEPLOYED",
                    "predictor": "org.opensearch.ml.engine.algorithms.remote.RemoteModel@267e2248",
                    "model_inference_stats": {
                        "count": 100001,
                        "max": 110.231529,
                        "min": 13.642701,
                        "average": 31.23418597099029,
                        "p50": 29.00999,
                        "p90": 41.798205,
                        "p99": 58.968204
                    },
                    "predict_request_stats": {
                        "count": 100001,
                        "max": 6783.721085,
                        "min": 69.897231,
                        "average": 5924.886211584014,
                        "p50": 5986.277747,
                        "p90": 6370.810034,
                        "p99": 6612.146908
                    }
                }
            }
        }
    }
}

Benchmark result

|                                                  Segment count |        |         162 |        |
|                                                 Min Throughput |   bulk |       69.28 | docs/s |
|                                                Mean Throughput |   bulk |      258.52 | docs/s |
|                                              Median Throughput |   bulk |      255.26 | docs/s |
|                                                 Max Throughput |   bulk |      353.37 | docs/s |
|                                        50th percentile latency |   bulk |     6462.37 |     ms |
|                                        90th percentile latency |   bulk |      6686.9 |     ms |
|                                        99th percentile latency |   bulk |     6815.03 |     ms |
|                                       100th percentile latency |   bulk |      6845.9 |     ms |
|                                   50th percentile service time |   bulk |     6462.37 |     ms |
|                                   90th percentile service time |   bulk |      6686.9 |     ms |
|                                   99th percentile service time |   bulk |     6815.03 |     ms |
|                                  100th percentile service time |   bulk |      6845.9 |     ms |

bulk size: 800

Profile result

{
    "models": {
        "DjD5U40BcDj4M4xaapQ-": {
            "target_worker_nodes": [
                "47xFKefyT_yT4ruLRNVysQ"
            ],
            "worker_nodes": [
                "47xFKefyT_yT4ruLRNVysQ"
            ],
            "nodes": {
                "47xFKefyT_yT4ruLRNVysQ": {
                    "model_state": "DEPLOYED",
                    "predictor": "org.opensearch.ml.engine.algorithms.remote.RemoteModel@63aef70",
                    "model_inference_stats": {
                        "count": 100000,
                        "max": 120.595407,
                        "min": 13.983231,
                        "average": 31.37104606054,
                        "p50": 29.143390500000002,
                        "p90": 41.9421519,
                        "p99": 58.93499283
                    },
                    "predict_request_stats": {
                        "count": 100000,
                        "max": 26867.224684,
                        "min": 72.765313,
                        "average": 23015.738118801088,
                        "p50": 23926.2854905,
                        "p90": 25538.6665085,
                        "p99": 26391.06232507
                    }
                }
            }
        }
    }
}

Benchmark result

|                                                  Segment count |        |         162 |        |
|                                                 Min Throughput |   bulk |        58.4 | docs/s |
|                                                Mean Throughput |   bulk |      267.75 | docs/s |
|                                              Median Throughput |   bulk |      263.69 | docs/s |
|                                                 Max Throughput |   bulk |      320.74 | docs/s |
|                                        50th percentile latency |   bulk |     25708.2 |     ms |
|                                        90th percentile latency |   bulk |     26744.4 |     ms |
|                                        99th percentile latency |   bulk |     27030.1 |     ms |
|                                       100th percentile latency |   bulk |     27085.8 |     ms |
|                                   50th percentile service time |   bulk |     25708.2 |     ms |
|                                   90th percentile service time |   bulk |     26744.4 |     ms |
|                                   99th percentile service time |   bulk |     27030.1 |     ms |
|                                  100th percentile service time |   bulk |     27085.8 |     ms |

Take aways

With even higher bulk size, prediction throughput is not changed, but latency increased, which means more queuing happened. This can be proved by the profile result ‘s predict_request_stats p90/p99.

Async httpclient benchmark result

bulk size: 200

profile result

{
    "models": {
        "z7cLVI0BDnAEuuAYMD8k": {
            "target_worker_nodes": [
                "66XjHiy0TluuW0WSu3EsPg"
            ],
            "worker_nodes": [
                "66XjHiy0TluuW0WSu3EsPg"
            ],
            "nodes": {
                "66XjHiy0TluuW0WSu3EsPg": {
                    "model_state": "DEPLOYED",
                    "predictor": "org.opensearch.ml.engine.algorithms.remote.RemoteModel@72193f1d",
                    "model_inference_stats": {
                        "count": 100000,
                        "max": 4298.986295,
                        "min": 927.558941,
                        "average": 3686.23973879843,
                        "p50": 3732.266947,
                        "p90": 3942.6228318999997,
                        "p99": 4051.5752754
                    },
                    "predict_request_stats": {
                        "count": 100000,
                        "max": 4349.452111,
                        "min": 1238.687734,
                        "average": 3689.50676293382,
                        "p50": 3733.923404,
                        "p90": 3943.9578763,
                        "p99": 4053.8236341399997
                    }
                }
            }
        }
    }
}

benchmark result

|                                                  Segment count |        |         177 |        |
|                                                 Min Throughput |   bulk |          44 | docs/s |
|                                                Mean Throughput |   bulk |      372.27 | docs/s |
|                                              Median Throughput |   bulk |      380.18 | docs/s |
|                                                 Max Throughput |   bulk |      383.27 | docs/s |
|                                        50th percentile latency |   bulk |     4123.04 |     ms |
|                                        90th percentile latency |   bulk |     4213.35 |     ms |
|                                        99th percentile latency |   bulk |     4642.52 |     ms |
|                                       100th percentile latency |   bulk |     6369.56 |     ms |
|                                   50th percentile service time |   bulk |     4123.04 |     ms |
|                                   90th percentile service time |   bulk |     4213.35 |     ms |
|                                   99th percentile service time |   bulk |     4642.52 |     ms |
|                                  100th percentile service time |   bulk |     6369.56 |     ms |

SageMaker CPU usage

With Async httpclient and 200 bulk size, the prediction throughput reaches 1.6k% which means SageMaker CPU is fully utilized.

Latency comparison

E2E latency is also dropped by 37% with same bulk size in async httpclient.
The reason is the predict task is no longer waiting in the ml-commons predict_thread_pool queue, the waiting time is eliminated.

sync httpclient

bulk size 200 has 90%ile e2e latency: 6686.9 ms

Async httpclient

bulk size 200 has 90%ile e2e latency: 4213.35 ms

juntezhang · 2024-03-07T14:17:23Z

Looking forward to this improvement! The expected improvements are very promising.

ylwu-amzn · 2024-03-14T03:08:16Z

@zane-neo Can you help test fine tune the thread pool size can help for sync client?

zane-neo · 2024-03-14T04:53:39Z

@ylwu-amzn, fine tune the thread pool size definitely can improve the sync http client performance but this is not optimal, threads need system resources and also more threads will increase the thread context switch overhead, in the end this will reach to a new performance bottleneck. Using async httpclient can make sure no system resources consumption and can handle very high performance so I think we should go this way.

zane-neo added enhancement New feature or request untriaged labels Jan 5, 2024

zane-neo self-assigned this Jan 5, 2024

zane-neo mentioned this issue Jan 30, 2024

Change httpclient to async #1958

Merged

5 tasks

jngz-es added this to ml-commons projects Feb 13, 2024

b4sjoo removed the untriaged label Feb 13, 2024

b4sjoo moved this to In Progress in ml-commons projects Feb 13, 2024

model-collapse removed this from ml-commons projects Mar 13, 2024

chishui mentioned this issue Apr 2, 2024

[RFC] Parallel & Batch Ingestion opensearch-project/OpenSearch#12457

Closed

ylwu-amzn added the v2.14.0 label May 8, 2024

ylwu-amzn mentioned this issue May 8, 2024

[BUG] BWC issue with async http client change #2417

Closed

ylwu-amzn closed this as completed Jun 12, 2024

zane-neo mentioned this issue Jul 30, 2024

[FEATURE] Performance issue in CloseableHttpClient #1537

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] Replace blocking httpclient with async httpclient in remote inference #1839

[FEATURE] Replace blocking httpclient with async httpclient in remote inference #1839

zane-neo commented Jan 5, 2024 •

edited

Loading

zane-neo commented Jan 5, 2024

dhrubo-os commented Jan 17, 2024

zane-neo commented Jan 19, 2024

zane-neo commented Jan 29, 2024 •

edited

Loading

juntezhang commented Mar 7, 2024 •

edited

Loading

ylwu-amzn commented Mar 14, 2024

zane-neo commented Mar 14, 2024

[FEATURE] Replace blocking httpclient with async httpclient in remote inference #1839

[FEATURE] Replace blocking httpclient with async httpclient in remote inference #1839

Comments

zane-neo commented Jan 5, 2024 • edited Loading

zane-neo commented Jan 5, 2024

dhrubo-os commented Jan 17, 2024

zane-neo commented Jan 19, 2024

zane-neo commented Jan 29, 2024 • edited Loading

Benchmark results of replacing sync http-client to async

Test settings

commons settings

Results

Sync httpclient benchmark result

Take aways

Async httpclient benchmark result

SageMaker CPU usage

Latency comparison

sync httpclient

Async httpclient

juntezhang commented Mar 7, 2024 • edited Loading

ylwu-amzn commented Mar 14, 2024

zane-neo commented Mar 14, 2024

zane-neo commented Jan 5, 2024 •

edited

Loading

zane-neo commented Jan 29, 2024 •

edited

Loading

juntezhang commented Mar 7, 2024 •

edited

Loading