Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exclude yield/reply time from first token latency metric #973

Merged
merged 1 commit into from
Dec 6, 2024

Conversation

eero-t
Copy link
Contributor

@eero-t eero-t commented Dec 4, 2024

Description

While metrics are OK for small number of requests, when megaservice is handling many (hundreds of) parallel requests, it was reporting clearly (~10%) larger first token latency, than the client receiving the tokens from the megaservice.

Changing timings to be done before token is yielded, means that reported first token latency can be slightly shorter than it actually is. However, testing with ChatQnA shows latencies to be clearly closer to ones seen by the client (within couple of percent) and typically smaller (i.e. logical).

Issues

First token latency inaccuracy. Number being larger than what client sees is obviously incorrect, which throws doubt also on other metrics.

Type of change

  • Bug fix (non-breaking change which fixes an issue)

Dependencies

n/a

Tests

Tested manually with HPA scaled ChatQnA, with benchmark constantly sending (up to 1000) parallel requests.

Notes

Doing the metrics timing after yielding the token, meant that also time for sending the reply to the client and waiting that to complete, was included to the token time.

I suspect that with lot of parallel requests, processing often had switched to other megaservice request processing threads, and getting control back to yielding thread, or function context, could be delayed much longer than what sending the response to client takes.

While metrics are OK for small number of requests, when megaservice
is handling many (hundreds of) _parallel_ requests, it was reporting
clearly (~10%) larger first token latency, than the client receiving
the tokens from the megaservice.

Getting the time before token is yielded, means that reported first
token latency can be slightly shorter than it actually is. However,
testing with ChatQnA shows latencies to be clearly closer to ones seen
by the client (within couple of percent) and typically smaller (i.e.
logical).

PS. Doing the metrics timing after yielding the token, meant that also
time for sending the reply to the client and waiting that to complete,
was included to the token time.  I suspect that with lot of parallel
requests, processing often had switched to other megaservice request
processing threads, and getting control back to yielding thread for
timing, could be delayed much longer than sending the response to
client took.

Signed-off-by: Eero Tamminen <[email protected]>
@eero-t eero-t requested a review from lvliang-intel as a code owner December 4, 2024 16:39
Copy link

codecov bot commented Dec 4, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Files with missing lines Coverage Δ
comps/cores/mega/orchestrator.py 92.46% <100.00%> (ø)

@eero-t
Copy link
Contributor Author

eero-t commented Dec 4, 2024

There's something wrong with CI. This PR does not do anything could cause CI fails, and example-test does really weird thing...

It uses Docker compose for testing, but is setting up kubernetes package repos, which fail:

Hit:4 https://baltocdn.com/helm/stable/debian all InRelease
...
Err:7 https://prod-cdn.packages.k8s.io/repositories/isv:/kubernetes:/core:/stable:/v1.29/deb  InRelease
...
 W: An error occurred during the signature verification. The repository is not updated and the previous index files will be used. GPG error: https://prod-cdn.packages.k8s.io/repositories/isv:/kubernetes:/core:/stable:/v1.29/deb  InRelease: The following signatures were invalid: EXPKEYSIG 234654DA9A296436 isv:kubernetes OBS Project <isv:[email protected]>

It's installing obsolete JavaScript packages although Comps repo does not have any JS code:

npm warn deprecated [email protected]: This module is not supported, and leaks memory. Do not use it. Check out lru-cache if you want a good and tested way to coalesce async requests by a key value, which is much more comprehensive and powerful.
npm warn deprecated [email protected]: Rimraf versions prior to v4 are no longer supported
npm warn deprecated @humanwhocodes/[email protected]: Use @eslint/config-array instead
npm warn deprecated [email protected]: Rimraf versions prior to v4 are no longer supported
npm warn deprecated [email protected]: Glob versions prior to v9 are no longer supported
npm warn deprecated @humanwhocodes/[email protected]: Use @eslint/object-schema instead
npm warn deprecated [email protected]: This version is no longer supported. Please see https://eslint.org/version-support for other options.
added 394 packages, and audited 395 packages in 59s
133 packages are looking for funding
  run `npm fund` for details
3 low severity vulnerabilities

It's testing JS UI, which is timing out, although this did not change anything that could impact the UI:

Test timeout of 30000ms exceeded.
    Error: page.waitForSelector: Test timeout of 30000ms exceeded.
    Call log:
      - waiting for locator('.notification') to be visible
      25 | // Helper function: Check notification text
      26 | async function checkNotificationText(page: Page, expectedText: string) {
    > 27 | 	const notification = await page.waitForSelector(".notification");
         | 	                                ^
      28 | 	const notificationText = await notification.textContent();
      29 | 	expect(notificationText).toContain(expectedText);
      30 | }
        at checkNotificationText (/home/sdp/GenAIComps-actions-runner/_work/GenAIComps/GenAIExamples/ChatQnA/ui/svelte/tests/chatQnA.spec.ts:27:34)
        at pasteLink (/home/sdp/GenAIComps-actions-runner/_work/GenAIComps/GenAIExamples/ChatQnA/ui/svelte/tests/chatQnA.spec.ts:53:8)
        at /home/sdp/GenAIComps-actions-runner/_work/GenAIComps/GenAIExamples/ChatQnA/ui/svelte/tests/chatQnA.spec.ts:80:3
  1 failed
    [webkit] › chatQnA.spec.ts:75:2 › Upload file › should paste link ──────────────────────────────
  2 passed (2.1m)
[TEST INFO]: ---------frontend test failed---------
Error: Process completed with exit code 1.

@eero-t
Copy link
Contributor Author

eero-t commented Dec 5, 2024

@Spycsh Could you review this (2 line change) PR?

@lvliang-intel Any idea when CI would be fixed?

Thanks!

@lvliang-intel lvliang-intel merged commit 5663e16 into opea-project:main Dec 6, 2024
15 checks passed
@eero-t eero-t deleted the token-metrics branch December 9, 2024 12:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants