-
Notifications
You must be signed in to change notification settings - Fork 144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Exclude yield/reply time from first token latency metric #973
Conversation
While metrics are OK for small number of requests, when megaservice is handling many (hundreds of) _parallel_ requests, it was reporting clearly (~10%) larger first token latency, than the client receiving the tokens from the megaservice. Getting the time before token is yielded, means that reported first token latency can be slightly shorter than it actually is. However, testing with ChatQnA shows latencies to be clearly closer to ones seen by the client (within couple of percent) and typically smaller (i.e. logical). PS. Doing the metrics timing after yielding the token, meant that also time for sending the reply to the client and waiting that to complete, was included to the token time. I suspect that with lot of parallel requests, processing often had switched to other megaservice request processing threads, and getting control back to yielding thread for timing, could be delayed much longer than sending the response to client took. Signed-off-by: Eero Tamminen <[email protected]>
Codecov ReportAll modified and coverable lines are covered by tests ✅
|
There's something wrong with CI. This PR does not do anything could cause CI fails, and It uses Docker compose for testing, but is setting up kubernetes package repos, which fail:
It's installing obsolete JavaScript packages although Comps repo does not have any JS code:
It's testing JS UI, which is timing out, although this did not change anything that could impact the UI:
|
@Spycsh Could you review this (2 line change) PR? @lvliang-intel Any idea when CI would be fixed? Thanks! |
Description
While metrics are OK for small number of requests, when megaservice is handling many (hundreds of) parallel requests, it was reporting clearly (~10%) larger first token latency, than the client receiving the tokens from the megaservice.
Changing timings to be done before token is yielded, means that reported first token latency can be slightly shorter than it actually is. However, testing with ChatQnA shows latencies to be clearly closer to ones seen by the client (within couple of percent) and typically smaller (i.e. logical).
Issues
First token latency inaccuracy. Number being larger than what client sees is obviously incorrect, which throws doubt also on other metrics.
Type of change
Dependencies
n/a
Tests
Tested manually with HPA scaled ChatQnA, with benchmark constantly sending (up to 1000) parallel requests.
Notes
Doing the metrics timing after yielding the token, meant that also time for sending the reply to the client and waiting that to complete, was included to the token time.
I suspect that with lot of parallel requests, processing often had switched to other megaservice request processing threads, and getting control back to yielding thread, or function context, could be delayed much longer than what sending the response to client takes.