Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[V1][Metrics] Add request_success_total counter, labelled with finish reason #12579

Conversation

markmc
Copy link
Contributor

@markmc markmc commented Jan 30, 2025

Follow on from #12561, part of #10582

Also, add a RequestFinishedReason enum so that we have type-checkable constants, and to make the representation more compact. Resolves a couple of TODOs. Ideally CompletionOutput would use this too, but that would involve updates to v0.

Example:

vllm:request_success_total{finished_reason="stop",model_name="meta-llama/Llama-3.1-8B-Instruct"} 2.0
vllm:request_success_total{finished_reason="length",model_name="meta-llama/Llama-3.1-8B-Instruct"} 199.0
vllm:request_success_total{finished_reason="abort",model_name="meta-llama/Llama-3.1-8B-Instruct"} 0.0

So that we have type-checkable constants, and to make the representation
more compact. Resolves a couple of TODOs.

Ideally CompletionOutput would use this too, but that would involve
updates to v0.

Signed-off-by: Mark McLoughlin <[email protected]>
Copy link

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

  • Add ready label to the PR
  • Enable auto-merge.

🚀

@robertgshaw2-redhat robertgshaw2-redhat added the ready ONLY add when PR is ready to merge/full CI is needed label Feb 1, 2025
@mergify mergify bot added the v1 label Feb 1, 2025
@robertgshaw2-redhat robertgshaw2-redhat merged commit 233df6f into vllm-project:main Feb 5, 2025
60 checks passed
Copy link
Member

@njhill njhill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @markmc, sorry I didn't look at this sooner (I just saw it now due to conflicts with another PR I'm prepping to merge!).

I have a couple of comments, have opened a small follow-on PR #12760 for your consideration!

@@ -13,6 +13,23 @@
from vllm.sampling_params import SamplingParams


class RequestFinishedReason(enum.IntEnum):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WDYT about a shorter name FinishReason since it's used in quite a few places?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, sounds good!

ABORT = 2

def __str__(self):
return self.name.lower()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will create a new string every time it's accessed (and convert to lower). We could instead have a global lookup like:

FINISH_REASON_STRINGS = ("stop", "length", "abort")

class RequestFinishedReason(enum.IntEnum):
    # ...

    def __str__(self):
        return FINISH_REASON_STRINGS[self.value]

I think we should also mention here that the specific string names form part of the API and so shouldn't be changed (i.e. not just arbitrary identifiers).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, great catch on the optimization 👍

njhill added a commit to njhill/vllm that referenced this pull request Feb 5, 2025
@markmc markmc deleted the markmc/metrics-v1-prometheus-logger-6 branch February 5, 2025 11:52
fxmarty-amd pushed a commit to fxmarty-amd/vllm that referenced this pull request Feb 7, 2025
markmc added a commit to markmc/vllm that referenced this pull request Feb 7, 2025
Follow on from vllm-project#12579, part of vllm-project#10582.

Add the following:

- vllm:e2e_request_latency_seconds
- vllm:request_queue_time_seconds
- vllm:request_inference_time_seconds
- vllm:request_prefill_time_seconds
- vllm:request_decode_time_seconds

e2e_request_latency is calculated relative to the arrival_time
timestamp recorded by the frontend.

For the rest ... we want to capture (in histograms) precise
pre-request timing intervals between certain events in the engine
core:

```
  << queued timestamp >>
    [ queue interval ]
  << scheduled timestamp >>
    [ prefill interval ]
  << new token timestamp (FIRST) >>
    [ inter-token interval ]
  << new token timestamp >>
    [ decode interval (relative to first token time)
    [ inference interval (relative to scheduled time)
  << new token timestamp (FINISHED) >>
```

We want to collect these metrics in the frontend process, to keep the
engine core freed up as much as possible. We need to calculate these
intervals based on timestamps recorded by the engine core.

Engine core will include these timestamps in EngineCoreOutput (per
request) as a sequence of timestamped events, and the frontend will
calculate intervals and log them. Where we record these timestamped
events:

- QUEUED: scheduler add_request()
- SCHEDULED: scheduler schedule()

There is an implicit NEW_TOKENS timestamp based on an initialization
timestamp recorded on EngineCoreOutputs.

Signed-off-by: Mark McLoughlin <[email protected]>
markmc added a commit to markmc/vllm that referenced this pull request Feb 10, 2025
Follow on from vllm-project#12579, part of vllm-project#10582.

Add the following:

- vllm:e2e_request_latency_seconds
- vllm:request_queue_time_seconds
- vllm:request_inference_time_seconds
- vllm:request_prefill_time_seconds
- vllm:request_decode_time_seconds

e2e_request_latency is calculated relative to the arrival_time
timestamp recorded by the frontend.

For the rest ... we want to capture (in histograms) precise
pre-request timing intervals between certain events in the engine
core:

```
  << queued timestamp >>
    [ queue interval ]
  << scheduled timestamp >>
    [ prefill interval ]
  << new token timestamp (FIRST) >>
    [ inter-token interval ]
  << new token timestamp >>
    [ decode interval (relative to first token time)
    [ inference interval (relative to scheduled time)
  << new token timestamp (FINISHED) >>
```

We want to collect these metrics in the frontend process, to keep the
engine core freed up as much as possible. We need to calculate these
intervals based on timestamps recorded by the engine core.

Engine core will include these timestamps in EngineCoreOutput (per
request) as a sequence of timestamped events, and the frontend will
calculate intervals and log them. Where we record these timestamped
events:

- QUEUED: scheduler add_request()
- SCHEDULED: scheduler schedule()

There is an implicit NEW_TOKENS timestamp based on an initialization
timestamp recorded on EngineCoreOutputs.

Signed-off-by: Mark McLoughlin <[email protected]>
ShangmingCai pushed a commit to ShangmingCai/vllm that referenced this pull request Feb 10, 2025
markmc added a commit to markmc/vllm that referenced this pull request Feb 11, 2025
Follow on from vllm-project#12579, part of vllm-project#10582.

Add the following:

- vllm:e2e_request_latency_seconds
- vllm:request_queue_time_seconds
- vllm:request_inference_time_seconds
- vllm:request_prefill_time_seconds
- vllm:request_decode_time_seconds

e2e_request_latency is calculated relative to the arrival_time
timestamp recorded by the frontend.

For the rest ... we want to capture (in histograms) precise
pre-request timing intervals between certain events in the engine
core:

```
  << queued timestamp >>
    [ queue interval ]
  << scheduled timestamp >>
    [ prefill interval ]
  << new token timestamp (FIRST) >>
    [ inter-token interval ]
  << new token timestamp >>
    [ decode interval (relative to first token time)
    [ inference interval (relative to scheduled time)
  << new token timestamp (FINISHED) >>
```

We want to collect these metrics in the frontend process, to keep the
engine core freed up as much as possible. We need to calculate these
intervals based on timestamps recorded by the engine core.

Engine core will include these timestamps in EngineCoreOutput (per
request) as a sequence of timestamped events, and the frontend will
calculate intervals and log them. Where we record these timestamped
events:

- QUEUED: scheduler add_request()
- SCHEDULED: scheduler schedule()

There is an implicit NEW_TOKENS timestamp based on an initialization
timestamp recorded on EngineCoreOutputs.

Signed-off-by: Mark McLoughlin <[email protected]>
panf2333 pushed a commit to yottalabsai/vllm that referenced this pull request Feb 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready ONLY add when PR is ready to merge/full CI is needed v1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants