[V1][Metrics] Add request_success_total counter, labelled with finish reason #12579

markmc · 2025-01-30T16:52:27Z

Follow on from #12561, part of #10582

Also, add a RequestFinishedReason enum so that we have type-checkable constants, and to make the representation more compact. Resolves a couple of TODOs. Ideally CompletionOutput would use this too, but that would involve updates to v0.

Example:

vllm:request_success_total{finished_reason="stop",model_name="meta-llama/Llama-3.1-8B-Instruct"} 2.0
vllm:request_success_total{finished_reason="length",model_name="meta-llama/Llama-3.1-8B-Instruct"} 199.0
vllm:request_success_total{finished_reason="abort",model_name="meta-llama/Llama-3.1-8B-Instruct"} 0.0

So that we have type-checkable constants, and to make the representation more compact. Resolves a couple of TODOs. Ideally CompletionOutput would use this too, but that would involve updates to v0. Signed-off-by: Mark McLoughlin <[email protected]>

… reason Signed-off-by: Mark McLoughlin <[email protected]>

github-actions · 2025-01-30T16:52:39Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

vllm/v1/request.py

njhill

Thanks @markmc, sorry I didn't look at this sooner (I just saw it now due to conflicts with another PR I'm prepping to merge!).

I have a couple of comments, have opened a small follow-on PR #12760 for your consideration!

njhill · 2025-02-05T02:16:26Z

vllm/v1/engine/__init__.py

@@ -13,6 +13,23 @@
    from vllm.sampling_params import SamplingParams


+class RequestFinishedReason(enum.IntEnum):


WDYT about a shorter name FinishReason since it's used in quite a few places?

Sure, sounds good!

njhill · 2025-02-05T02:19:15Z

vllm/v1/engine/__init__.py

+    ABORT = 2
+
+    def __str__(self):
+        return self.name.lower()


This will create a new string every time it's accessed (and convert to lower). We could instead have a global lookup like:

FINISH_REASON_STRINGS = ("stop", "length", "abort") class RequestFinishedReason(enum.IntEnum): # ... def __str__(self): return FINISH_REASON_STRINGS[self.value]

I think we should also mention here that the specific string names form part of the API and so shouldn't be changed (i.e. not just arbitrary identifiers).

Yep, great catch on the optimization 👍

Small follow-on to vllm-project#12579 Signed-off-by: Nick Hill <[email protected]>

… reason (vllm-project#12579) Signed-off-by: Mark McLoughlin <[email protected]> Signed-off-by: Felix Marty <[email protected]>

Follow on from vllm-project#12579, part of vllm-project#10582. Add the following: - vllm:e2e_request_latency_seconds - vllm:request_queue_time_seconds - vllm:request_inference_time_seconds - vllm:request_prefill_time_seconds - vllm:request_decode_time_seconds e2e_request_latency is calculated relative to the arrival_time timestamp recorded by the frontend. For the rest ... we want to capture (in histograms) precise pre-request timing intervals between certain events in the engine core: ``` << queued timestamp >> [ queue interval ] << scheduled timestamp >> [ prefill interval ] << new token timestamp (FIRST) >> [ inter-token interval ] << new token timestamp >> [ decode interval (relative to first token time) [ inference interval (relative to scheduled time) << new token timestamp (FINISHED) >> ``` We want to collect these metrics in the frontend process, to keep the engine core freed up as much as possible. We need to calculate these intervals based on timestamps recorded by the engine core. Engine core will include these timestamps in EngineCoreOutput (per request) as a sequence of timestamped events, and the frontend will calculate intervals and log them. Where we record these timestamped events: - QUEUED: scheduler add_request() - SCHEDULED: scheduler schedule() There is an implicit NEW_TOKENS timestamp based on an initialization timestamp recorded on EngineCoreOutputs. Signed-off-by: Mark McLoughlin <[email protected]>

… reason (vllm-project#12579) Signed-off-by: Mark McLoughlin <[email protected]>

Follow on from vllm-project#12579, part of vllm-project#10582. Add the following: - vllm:e2e_request_latency_seconds - vllm:request_queue_time_seconds - vllm:request_inference_time_seconds - vllm:request_prefill_time_seconds - vllm:request_decode_time_seconds e2e_request_latency is calculated relative to the arrival_time timestamp recorded by the frontend. For the rest ... we want to capture (in histograms) precise pre-request timing intervals between certain events in the engine core: ``` << queued timestamp >> [ queue interval ] << scheduled timestamp >> [ prefill interval ] << new token timestamp (FIRST) >> [ inter-token interval ] << new token timestamp >> [ decode interval (relative to first token time) [ inference interval (relative to scheduled time) << new token timestamp (FINISHED) >> ``` We want to collect these metrics in the frontend process, to keep the engine core freed up as much as possible. We need to calculate these intervals based on timestamps recorded by the engine core. Engine core will include these timestamps in EngineCoreOutput (per request) as a sequence of timestamped events, and the frontend will calculate intervals and log them. Where we record these timestamped events: - QUEUED: scheduler add_request() - SCHEDULED: scheduler schedule() There is an implicit NEW_TOKENS timestamp based on an initialization timestamp recorded on EngineCoreOutputs. Signed-off-by: Mark McLoughlin <[email protected]>

… reason (vllm-project#12579) Signed-off-by: Mark McLoughlin <[email protected]>

markmc added 2 commits January 30, 2025 11:44

[V1][Metrics] Add request_success_total counter, labelled with finish…

21120f7

… reason Signed-off-by: Mark McLoughlin <[email protected]>

markmc requested review from DarkLight1337, robertgshaw2-redhat, simon-mo, WoosukKwon, njhill, ywang96, comaniac and alexm-redhat as code owners January 30, 2025 16:52

robertgshaw2-redhat reviewed Jan 30, 2025

View reviewed changes

vllm/v1/request.py Show resolved Hide resolved

markmc mentioned this pull request Feb 1, 2025

[V1][Metrics] Add several request timing histograms #12644

Merged

robertgshaw2-redhat added the ready ONLY add when PR is ready to merge/full CI is needed label Feb 1, 2025

mergify bot added the v1 label Feb 1, 2025

markmc mentioned this pull request Feb 4, 2025

[Feature][v1]: Add metrics support #10582

Open

1 task

robertgshaw2-redhat approved these changes Feb 5, 2025

View reviewed changes

robertgshaw2-redhat merged commit 233df6f into vllm-project:main Feb 5, 2025
60 checks passed

njhill reviewed Feb 5, 2025

View reviewed changes

njhill added a commit to njhill/vllm that referenced this pull request Feb 5, 2025

[V1][Misc] Shorten FinishReason enum and use constant strings

76ee894

Small follow-on to vllm-project#12579 Signed-off-by: Nick Hill <[email protected]>

njhill mentioned this pull request Feb 5, 2025

[V1][Misc] Shorten FinishReason enum and use constant strings #12760

Merged

markmc deleted the markmc/metrics-v1-prometheus-logger-6 branch February 5, 2025 11:52

ShangmingCai pushed a commit to ShangmingCai/vllm that referenced this pull request Feb 10, 2025

[V1][Metrics] Add request_success_total counter, labelled with finish…

0425dfa

… reason (vllm-project#12579) Signed-off-by: Mark McLoughlin <[email protected]>

panf2333 pushed a commit to yottalabsai/vllm that referenced this pull request Feb 18, 2025

[V1][Metrics] Add request_success_total counter, labelled with finish…

343fac6

… reason (vllm-project#12579) Signed-off-by: Mark McLoughlin <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[V1][Metrics] Add request_success_total counter, labelled with finish reason #12579

[V1][Metrics] Add request_success_total counter, labelled with finish reason #12579

markmc commented Jan 30, 2025 •

edited by github-actions bot

Loading

github-actions bot commented Jan 30, 2025

njhill left a comment •

edited

Loading

njhill Feb 5, 2025

markmc Feb 5, 2025

njhill Feb 5, 2025

markmc Feb 5, 2025

		@@ -13,6 +13,23 @@
		from vllm.sampling_params import SamplingParams


		class RequestFinishedReason(enum.IntEnum):

[V1][Metrics] Add request_success_total counter, labelled with finish reason #12579

[V1][Metrics] Add request_success_total counter, labelled with finish reason #12579

Conversation

markmc commented Jan 30, 2025 • edited by github-actions bot Loading

github-actions bot commented Jan 30, 2025

njhill left a comment • edited Loading

Choose a reason for hiding this comment

njhill Feb 5, 2025

Choose a reason for hiding this comment

markmc Feb 5, 2025

Choose a reason for hiding this comment

njhill Feb 5, 2025

Choose a reason for hiding this comment

markmc Feb 5, 2025

Choose a reason for hiding this comment

markmc commented Jan 30, 2025 •

edited by github-actions bot

Loading

njhill left a comment •

edited

Loading