-
-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[V1][Metrics] Add request_success_total counter, labelled with finish reason #12579
[V1][Metrics] Add request_success_total counter, labelled with finish reason #12579
Conversation
So that we have type-checkable constants, and to make the representation more compact. Resolves a couple of TODOs. Ideally CompletionOutput would use this too, but that would involve updates to v0. Signed-off-by: Mark McLoughlin <[email protected]>
… reason Signed-off-by: Mark McLoughlin <[email protected]>
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -13,6 +13,23 @@ | |||
from vllm.sampling_params import SamplingParams | |||
|
|||
|
|||
class RequestFinishedReason(enum.IntEnum): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WDYT about a shorter name FinishReason
since it's used in quite a few places?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, sounds good!
ABORT = 2 | ||
|
||
def __str__(self): | ||
return self.name.lower() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will create a new string every time it's accessed (and convert to lower). We could instead have a global lookup like:
FINISH_REASON_STRINGS = ("stop", "length", "abort")
class RequestFinishedReason(enum.IntEnum):
# ...
def __str__(self):
return FINISH_REASON_STRINGS[self.value]
I think we should also mention here that the specific string names form part of the API and so shouldn't be changed (i.e. not just arbitrary identifiers).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, great catch on the optimization 👍
Small follow-on to vllm-project#12579 Signed-off-by: Nick Hill <[email protected]>
… reason (vllm-project#12579) Signed-off-by: Mark McLoughlin <[email protected]> Signed-off-by: Felix Marty <[email protected]>
Follow on from vllm-project#12579, part of vllm-project#10582. Add the following: - vllm:e2e_request_latency_seconds - vllm:request_queue_time_seconds - vllm:request_inference_time_seconds - vllm:request_prefill_time_seconds - vllm:request_decode_time_seconds e2e_request_latency is calculated relative to the arrival_time timestamp recorded by the frontend. For the rest ... we want to capture (in histograms) precise pre-request timing intervals between certain events in the engine core: ``` << queued timestamp >> [ queue interval ] << scheduled timestamp >> [ prefill interval ] << new token timestamp (FIRST) >> [ inter-token interval ] << new token timestamp >> [ decode interval (relative to first token time) [ inference interval (relative to scheduled time) << new token timestamp (FINISHED) >> ``` We want to collect these metrics in the frontend process, to keep the engine core freed up as much as possible. We need to calculate these intervals based on timestamps recorded by the engine core. Engine core will include these timestamps in EngineCoreOutput (per request) as a sequence of timestamped events, and the frontend will calculate intervals and log them. Where we record these timestamped events: - QUEUED: scheduler add_request() - SCHEDULED: scheduler schedule() There is an implicit NEW_TOKENS timestamp based on an initialization timestamp recorded on EngineCoreOutputs. Signed-off-by: Mark McLoughlin <[email protected]>
Follow on from vllm-project#12579, part of vllm-project#10582. Add the following: - vllm:e2e_request_latency_seconds - vllm:request_queue_time_seconds - vllm:request_inference_time_seconds - vllm:request_prefill_time_seconds - vllm:request_decode_time_seconds e2e_request_latency is calculated relative to the arrival_time timestamp recorded by the frontend. For the rest ... we want to capture (in histograms) precise pre-request timing intervals between certain events in the engine core: ``` << queued timestamp >> [ queue interval ] << scheduled timestamp >> [ prefill interval ] << new token timestamp (FIRST) >> [ inter-token interval ] << new token timestamp >> [ decode interval (relative to first token time) [ inference interval (relative to scheduled time) << new token timestamp (FINISHED) >> ``` We want to collect these metrics in the frontend process, to keep the engine core freed up as much as possible. We need to calculate these intervals based on timestamps recorded by the engine core. Engine core will include these timestamps in EngineCoreOutput (per request) as a sequence of timestamped events, and the frontend will calculate intervals and log them. Where we record these timestamped events: - QUEUED: scheduler add_request() - SCHEDULED: scheduler schedule() There is an implicit NEW_TOKENS timestamp based on an initialization timestamp recorded on EngineCoreOutputs. Signed-off-by: Mark McLoughlin <[email protected]>
… reason (vllm-project#12579) Signed-off-by: Mark McLoughlin <[email protected]>
Follow on from vllm-project#12579, part of vllm-project#10582. Add the following: - vllm:e2e_request_latency_seconds - vllm:request_queue_time_seconds - vllm:request_inference_time_seconds - vllm:request_prefill_time_seconds - vllm:request_decode_time_seconds e2e_request_latency is calculated relative to the arrival_time timestamp recorded by the frontend. For the rest ... we want to capture (in histograms) precise pre-request timing intervals between certain events in the engine core: ``` << queued timestamp >> [ queue interval ] << scheduled timestamp >> [ prefill interval ] << new token timestamp (FIRST) >> [ inter-token interval ] << new token timestamp >> [ decode interval (relative to first token time) [ inference interval (relative to scheduled time) << new token timestamp (FINISHED) >> ``` We want to collect these metrics in the frontend process, to keep the engine core freed up as much as possible. We need to calculate these intervals based on timestamps recorded by the engine core. Engine core will include these timestamps in EngineCoreOutput (per request) as a sequence of timestamped events, and the frontend will calculate intervals and log them. Where we record these timestamped events: - QUEUED: scheduler add_request() - SCHEDULED: scheduler schedule() There is an implicit NEW_TOKENS timestamp based on an initialization timestamp recorded on EngineCoreOutputs. Signed-off-by: Mark McLoughlin <[email protected]>
… reason (vllm-project#12579) Signed-off-by: Mark McLoughlin <[email protected]>
Follow on from #12561, part of #10582
Also, add a
RequestFinishedReason
enum so that we have type-checkable constants, and to make the representation more compact. Resolves a couple of TODOs. IdeallyCompletionOutput
would use this too, but that would involve updates to v0.Example: