Exclude yield/reply time from first token latency metric #973

eero-t · 2024-12-04T16:39:20Z

Description

While metrics are OK for small number of requests, when megaservice is handling many (hundreds of) parallel requests, it was reporting clearly (~10%) larger first token latency, than the client receiving the tokens from the megaservice.

Changing timings to be done before token is yielded, means that reported first token latency can be slightly shorter than it actually is. However, testing with ChatQnA shows latencies to be clearly closer to ones seen by the client (within couple of percent) and typically smaller (i.e. logical).

Issues

First token latency inaccuracy. Number being larger than what client sees is obviously incorrect, which throws doubt also on other metrics.

Type of change

Bug fix (non-breaking change which fixes an issue)

Dependencies

n/a

Tests

Tested manually with HPA scaled ChatQnA, with benchmark constantly sending (up to 1000) parallel requests.

Notes

Doing the metrics timing after yielding the token, meant that also time for sending the reply to the client and waiting that to complete, was included to the token time.

I suspect that with lot of parallel requests, processing often had switched to other megaservice request processing threads, and getting control back to yielding thread, or function context, could be delayed much longer than what sending the response to client takes.

While metrics are OK for small number of requests, when megaservice is handling many (hundreds of) _parallel_ requests, it was reporting clearly (~10%) larger first token latency, than the client receiving the tokens from the megaservice. Getting the time before token is yielded, means that reported first token latency can be slightly shorter than it actually is. However, testing with ChatQnA shows latencies to be clearly closer to ones seen by the client (within couple of percent) and typically smaller (i.e. logical). PS. Doing the metrics timing after yielding the token, meant that also time for sending the reply to the client and waiting that to complete, was included to the token time. I suspect that with lot of parallel requests, processing often had switched to other megaservice request processing threads, and getting control back to yielding thread for timing, could be delayed much longer than sending the response to client took. Signed-off-by: Eero Tamminen <[email protected]>

codecov · 2024-12-04T16:42:23Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Files with missing lines	Coverage Δ
comps/cores/mega/orchestrator.py	`92.46% <100.00%> (ø)`

eero-t · 2024-12-04T17:13:34Z

There's something wrong with CI. This PR does not do anything could cause CI fails, and example-test does really weird thing...

It uses Docker compose for testing, but is setting up kubernetes package repos, which fail:

Hit:4 https://baltocdn.com/helm/stable/debian all InRelease
...
Err:7 https://prod-cdn.packages.k8s.io/repositories/isv:/kubernetes:/core:/stable:/v1.29/deb  InRelease
...
 W: An error occurred during the signature verification. The repository is not updated and the previous index files will be used. GPG error: https://prod-cdn.packages.k8s.io/repositories/isv:/kubernetes:/core:/stable:/v1.29/deb  InRelease: The following signatures were invalid: EXPKEYSIG 234654DA9A296436 isv:kubernetes OBS Project <isv:[email protected]>

It's installing obsolete JavaScript packages although Comps repo does not have any JS code:

npm warn deprecated [email protected]: This module is not supported, and leaks memory. Do not use it. Check out lru-cache if you want a good and tested way to coalesce async requests by a key value, which is much more comprehensive and powerful.
npm warn deprecated [email protected]: Rimraf versions prior to v4 are no longer supported
npm warn deprecated @humanwhocodes/[email protected]: Use @eslint/config-array instead
npm warn deprecated [email protected]: Rimraf versions prior to v4 are no longer supported
npm warn deprecated [email protected]: Glob versions prior to v9 are no longer supported
npm warn deprecated @humanwhocodes/[email protected]: Use @eslint/object-schema instead
npm warn deprecated [email protected]: This version is no longer supported. Please see https://eslint.org/version-support for other options.
added 394 packages, and audited 395 packages in 59s
133 packages are looking for funding
  run `npm fund` for details
3 low severity vulnerabilities

It's testing JS UI, which is timing out, although this did not change anything that could impact the UI:

Test timeout of 30000ms exceeded.
    Error: page.waitForSelector: Test timeout of 30000ms exceeded.
    Call log:
      - waiting for locator('.notification') to be visible
      25 | // Helper function: Check notification text
      26 | async function checkNotificationText(page: Page, expectedText: string) {
    > 27 | 	const notification = await page.waitForSelector(".notification");
         | 	                                ^
      28 | 	const notificationText = await notification.textContent();
      29 | 	expect(notificationText).toContain(expectedText);
      30 | }
        at checkNotificationText (/home/sdp/GenAIComps-actions-runner/_work/GenAIComps/GenAIExamples/ChatQnA/ui/svelte/tests/chatQnA.spec.ts:27:34)
        at pasteLink (/home/sdp/GenAIComps-actions-runner/_work/GenAIComps/GenAIExamples/ChatQnA/ui/svelte/tests/chatQnA.spec.ts:53:8)
        at /home/sdp/GenAIComps-actions-runner/_work/GenAIComps/GenAIExamples/ChatQnA/ui/svelte/tests/chatQnA.spec.ts:80:3
  1 failed
    [webkit] › chatQnA.spec.ts:75:2 › Upload file › should paste link ──────────────────────────────
  2 passed (2.1m)
[TEST INFO]: ---------frontend test failed---------
Error: Process completed with exit code 1.

eero-t · 2024-12-05T10:40:11Z

@Spycsh Could you review this (2 line change) PR?

@lvliang-intel Any idea when CI would be fixed?

Thanks!

eero-t requested a review from lvliang-intel as a code owner December 4, 2024 16:39

Spycsh approved these changes Dec 6, 2024

View reviewed changes

lvliang-intel approved these changes Dec 6, 2024

View reviewed changes

lvliang-intel merged commit 5663e16 into opea-project:main Dec 6, 2024
15 checks passed

eero-t deleted the token-metrics branch December 9, 2024 12:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exclude yield/reply time from first token latency metric #973

Exclude yield/reply time from first token latency metric #973

eero-t commented Dec 4, 2024 •

edited

Loading

codecov bot commented Dec 4, 2024 •

edited

Loading

eero-t commented Dec 4, 2024

eero-t commented Dec 5, 2024

Exclude yield/reply time from first token latency metric #973

Exclude yield/reply time from first token latency metric #973

Conversation

eero-t commented Dec 4, 2024 • edited Loading

Description

Issues

Type of change

Dependencies

Tests

Notes

codecov bot commented Dec 4, 2024 • edited Loading

Codecov Report

eero-t commented Dec 4, 2024

eero-t commented Dec 5, 2024

eero-t commented Dec 4, 2024 •

edited

Loading

codecov bot commented Dec 4, 2024 •

edited

Loading