Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Replace blocking httpclient with async httpclient in remote inference #1839

Closed
zane-neo opened this issue Jan 5, 2024 · 7 comments
Assignees
Labels
enhancement New feature or request v2.14.0

Comments

@zane-neo
Copy link
Collaborator

zane-neo commented Jan 5, 2024

Is your feature request related to a problem?
Community user brings up a performance issue here, which reveals a performance issue in HttpClient of remote inference. The flow of the prediction can be illustrated as below:
blocking-httpclient drawio (1)

There are two major issues here:

  1. The connection pool size of HttpClient is 20 by default which can cause timeout waiting for connection described here: [FEATURE] Performance issue in CloseableHttpClient #1537.
  2. The blocking HttpClient is a bottleneck since the predict pool thread size by default is 2 * num of vCPUs, this isn't a big value since local model prediction is a CPU bound operation. But for remote inference, it's an IO bound operation and the thread pool size is relatively small.
    For issue1, we can enable user to update the configuration of max_connections to handle more parallel predict requests.
    For issue2, we can increase the predict thread pool size to bigger number to increase the parallelism, but this is not optimal because more threads would cause more context switch and degrade the overall system performance.

What solution would you like?
Replace the blocking HttpClient with async HttpClient. With async HttpClient, both two issues above can be handled perfectly, there's no connection pool in async HttpClient and we don't need to change the default predict thread pool size since async HttpClient has better performance with only a few threads.
AWS async HttpClient: https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/http-configuration-crt.html

What alternatives have you considered?
Increasing the predict thread pool size and make this a system setting and configurable to user.

Do you have any additional context?
NA

@zane-neo zane-neo added enhancement New feature or request untriaged labels Jan 5, 2024
@zane-neo zane-neo self-assigned this Jan 5, 2024
@zane-neo
Copy link
Collaborator Author

zane-neo commented Jan 5, 2024

@model-collapse @ylwu-amzn @dhrubo-os @austintlee Please chime in.

@dhrubo-os
Copy link
Collaborator

there's no connection pool in async HttpClient

Could you please explain why? In your provided link, I can see maxConcurrency(100) was set for both (async and sync)

@zane-neo
Copy link
Collaborator Author

A mistake here, async httpclient also has connection pools, the 100 is an example value.

@zane-neo
Copy link
Collaborator Author

zane-neo commented Jan 29, 2024

Benchmark results of replacing sync http-client to async

Test settings

commons settings

  • Benchmark doc count: 100k
  • One SageMaker endpoint with node type: ml.r5.4xlarge. This node has 16 vCPUs so the full CPU utilization should be 1600%.
  • sync/async httpclient cluster

  • One data node with type m5.xlarge

Results

Sync httpclient benchmark result

bulk size: 200

Profile result

{
    "models": {
        "DjD5U40BcDj4M4xaapQ-": {
            "target_worker_nodes": [
                "47xFKefyT_yT4ruLRNVysQ"
            ],
            "worker_nodes": [
                "47xFKefyT_yT4ruLRNVysQ"
            ],
            "nodes": {
                "47xFKefyT_yT4ruLRNVysQ": {
                    "model_state": "DEPLOYED",
                    "predictor": "org.opensearch.ml.engine.algorithms.remote.RemoteModel@267e2248",
                    "model_inference_stats": {
                        "count": 100001,
                        "max": 110.231529,
                        "min": 13.642701,
                        "average": 31.23418597099029,
                        "p50": 29.00999,
                        "p90": 41.798205,
                        "p99": 58.968204
                    },
                    "predict_request_stats": {
                        "count": 100001,
                        "max": 6783.721085,
                        "min": 69.897231,
                        "average": 5924.886211584014,
                        "p50": 5986.277747,
                        "p90": 6370.810034,
                        "p99": 6612.146908
                    }
                }
            }
        }
    }
}

Benchmark result

|                                                  Segment count |        |         162 |        |
|                                                 Min Throughput |   bulk |       69.28 | docs/s |
|                                                Mean Throughput |   bulk |      258.52 | docs/s |
|                                              Median Throughput |   bulk |      255.26 | docs/s |
|                                                 Max Throughput |   bulk |      353.37 | docs/s |
|                                        50th percentile latency |   bulk |     6462.37 |     ms |
|                                        90th percentile latency |   bulk |      6686.9 |     ms |
|                                        99th percentile latency |   bulk |     6815.03 |     ms |
|                                       100th percentile latency |   bulk |      6845.9 |     ms |
|                                   50th percentile service time |   bulk |     6462.37 |     ms |
|                                   90th percentile service time |   bulk |      6686.9 |     ms |
|                                   99th percentile service time |   bulk |     6815.03 |     ms |
|                                  100th percentile service time |   bulk |      6845.9 |     ms |

bulk size: 800

Profile result

{
    "models": {
        "DjD5U40BcDj4M4xaapQ-": {
            "target_worker_nodes": [
                "47xFKefyT_yT4ruLRNVysQ"
            ],
            "worker_nodes": [
                "47xFKefyT_yT4ruLRNVysQ"
            ],
            "nodes": {
                "47xFKefyT_yT4ruLRNVysQ": {
                    "model_state": "DEPLOYED",
                    "predictor": "org.opensearch.ml.engine.algorithms.remote.RemoteModel@63aef70",
                    "model_inference_stats": {
                        "count": 100000,
                        "max": 120.595407,
                        "min": 13.983231,
                        "average": 31.37104606054,
                        "p50": 29.143390500000002,
                        "p90": 41.9421519,
                        "p99": 58.93499283
                    },
                    "predict_request_stats": {
                        "count": 100000,
                        "max": 26867.224684,
                        "min": 72.765313,
                        "average": 23015.738118801088,
                        "p50": 23926.2854905,
                        "p90": 25538.6665085,
                        "p99": 26391.06232507
                    }
                }
            }
        }
    }
}

Benchmark result

|                                                  Segment count |        |         162 |        |
|                                                 Min Throughput |   bulk |        58.4 | docs/s |
|                                                Mean Throughput |   bulk |      267.75 | docs/s |
|                                              Median Throughput |   bulk |      263.69 | docs/s |
|                                                 Max Throughput |   bulk |      320.74 | docs/s |
|                                        50th percentile latency |   bulk |     25708.2 |     ms |
|                                        90th percentile latency |   bulk |     26744.4 |     ms |
|                                        99th percentile latency |   bulk |     27030.1 |     ms |
|                                       100th percentile latency |   bulk |     27085.8 |     ms |
|                                   50th percentile service time |   bulk |     25708.2 |     ms |
|                                   90th percentile service time |   bulk |     26744.4 |     ms |
|                                   99th percentile service time |   bulk |     27030.1 |     ms |
|                                  100th percentile service time |   bulk |     27085.8 |     ms |

Take aways

With even higher bulk size, prediction throughput is not changed, but latency increased, which means more queuing happened. This can be proved by the profile result ‘s predict_request_stats p90/p99.

Async httpclient benchmark result

bulk size: 200

profile result

{
    "models": {
        "z7cLVI0BDnAEuuAYMD8k": {
            "target_worker_nodes": [
                "66XjHiy0TluuW0WSu3EsPg"
            ],
            "worker_nodes": [
                "66XjHiy0TluuW0WSu3EsPg"
            ],
            "nodes": {
                "66XjHiy0TluuW0WSu3EsPg": {
                    "model_state": "DEPLOYED",
                    "predictor": "org.opensearch.ml.engine.algorithms.remote.RemoteModel@72193f1d",
                    "model_inference_stats": {
                        "count": 100000,
                        "max": 4298.986295,
                        "min": 927.558941,
                        "average": 3686.23973879843,
                        "p50": 3732.266947,
                        "p90": 3942.6228318999997,
                        "p99": 4051.5752754
                    },
                    "predict_request_stats": {
                        "count": 100000,
                        "max": 4349.452111,
                        "min": 1238.687734,
                        "average": 3689.50676293382,
                        "p50": 3733.923404,
                        "p90": 3943.9578763,
                        "p99": 4053.8236341399997
                    }
                }
            }
        }
    }
}

benchmark result

|                                                  Segment count |        |         177 |        |
|                                                 Min Throughput |   bulk |          44 | docs/s |
|                                                Mean Throughput |   bulk |      372.27 | docs/s |
|                                              Median Throughput |   bulk |      380.18 | docs/s |
|                                                 Max Throughput |   bulk |      383.27 | docs/s |
|                                        50th percentile latency |   bulk |     4123.04 |     ms |
|                                        90th percentile latency |   bulk |     4213.35 |     ms |
|                                        99th percentile latency |   bulk |     4642.52 |     ms |
|                                       100th percentile latency |   bulk |     6369.56 |     ms |
|                                   50th percentile service time |   bulk |     4123.04 |     ms |
|                                   90th percentile service time |   bulk |     4213.35 |     ms |
|                                   99th percentile service time |   bulk |     4642.52 |     ms |
|                                  100th percentile service time |   bulk |     6369.56 |     ms |

SageMaker CPU usage

Screenshot 2024-01-29 at 16 14 02 With Async httpclient and 200 bulk size, the prediction throughput reaches 1.6k% which means SageMaker CPU is fully utilized.

Latency comparison

E2E latency is also dropped by 37% with same bulk size in async httpclient.
The reason is the predict task is no longer waiting in the ml-commons predict_thread_pool queue, the waiting time is eliminated.

sync httpclient

  • bulk size 200 has 90%ile e2e latency: 6686.9 ms

Async httpclient

  • bulk size 200 has 90%ile e2e latency: 4213.35 ms

@juntezhang
Copy link

juntezhang commented Mar 7, 2024

Looking forward to this improvement! The expected improvements are very promising.

@ylwu-amzn
Copy link
Collaborator

@zane-neo Can you help test fine tune the thread pool size can help for sync client?

@zane-neo
Copy link
Collaborator Author

@ylwu-amzn, fine tune the thread pool size definitely can improve the sync http client performance but this is not optimal, threads need system resources and also more threads will increase the thread context switch overhead, in the end this will reach to a new performance bottleneck. Using async httpclient can make sure no system resources consumption and can handle very high performance so I think we should go this way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request v2.14.0
Projects
None yet
Development

No branches or pull requests

5 participants