Upstream sync 2024 05 19 #249

robertgshaw2-redhat · 2024-05-19T15:32:09Z

Upstream sync 2024 05 25 (#249)

SUMMARY:
Merge commits from vllm-project@c7f2cf2 to vllm-project@f68470e

Note that vllm-project@c7f2cf2 is NOT included in this merge.

PR Checklist (Click to Expand)

Thank you for your contribution to vLLM! Before submitting the pull request, please ensure the PR meets the following criteria. This helps vLLM maintain the code quality and improve the efficiency of the review process.

PR Title and Classification

Only specific types of PRs will be reviewed. The PR title is prefixed appropriately to indicate the type of change. Please use one of the following:

[Bugfix] for bug fixes.
[CI/Build] for build or continuous integration improvements.
[Doc] for documentation fixes and improvements.
[Model] for adding a new model or improving an existing model. Model name should appear in the title.
[Frontend] For changes on the vLLM frontend (e.g., OpenAI API server, LLM class, etc.)
[Kernel] for changes affecting CUDA kernels or other compute kernels.
[Core] for changes in the core vLLM logic (e.g., LLMEngine, AsyncLLMEngine, Scheduler, etc.)
[Hardware][Vendor] for hardware-specific changes. Vendor name should appear in the prefix (e.g., [Hardware][AMD]).
[Misc] for PRs that do not fit the above categories. Please use this sparingly.

Note: If the PR spans more than one category, please include all relevant prefixes.

Code Quality

The PR need to meet the following code quality standards:

We adhere to Google Python style guide and Google C++ style guide.
Pass all linter checks. Please use format.sh to format your code.
The code need to be well-documented to ensure future contributors can easily understand the code.
Include sufficient tests to ensure the project to stay correct and robust. This includes both unit tests and integration tests.
Please add documentation to docs/source/ if the PR modifies the user-facing behaviors of vLLM. It helps vLLM user understand and utilize the new features or changes.

Notes for Large Changes

Please keep the changes as concise as possible. For major architectural changes (>500 LOC excluding kernel/data/config/test), we would expect a GitHub issue (RFC) discussing the technical design and justification. Otherwise, we will tag it with rfc-required and might not go through the PR.

What to Expect for the Reviews

The goal of the vLLM team is to be a transparent reviewing machine. We would like to make the review process transparent and efficient and make sure no contributor feel confused or frustrated. However, the vLLM team is small, so we need to prioritize some PRs over others. Here is what you can expect from the review process:

After the PR is submitted, the PR will be assigned to a reviewer. Every reviewer will pick up the PRs based on their expertise and availability.
After the PR is assigned, the reviewer will provide status update every 2-3 days. If the PR is not reviewed within 7 days, please feel free to ping the reviewer or the vLLM team.
After the review, the reviewer will put an action-required label on the PR if there are changes required. The contributor should address the comments and ping the reviewer to re-review the PR.
Please respond to all comments within a reasonable time frame. If a comment isn't clear or you disagree with a suggestion, feel free to ask for clarification or discuss the suggestion.

Thank You

Finally, thank you for taking the time to read these guidelines and for your interest in contributing to vLLM. Your contributions make vLLM a great tool for everyone!

Previously FP8 static scaling works if the scales are overestimating the maxima of all activation tensors during computation. However this will not always be the case even if the scales were calibrated very carefully. For example, with the activations in my checkpoint https://huggingface.co/pcmoritz/Mixtral-8x7B-v0.1-fp8-act-scale (which was calibrated on https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k), I'm getting the following mostly random performance on MMLU: | Groups |Version|Filter|n-shot|Metric|Value | |Stderr| |------------------|-------|------|-----:|------|-----:|---|-----:| |mmlu |N/A |none | 0|acc |0.2295|± |0.0035| | - humanities |N/A |none | 5|acc |0.2421|± |0.0062| | - other |N/A |none | 5|acc |0.2398|± |0.0076| | - social_sciences|N/A |none | 5|acc |0.2171|± |0.0074| | - stem |N/A |none | 5|acc |0.2125|± |0.0073| With the fix in this PR where the scaled activations are clamped between [-std::numeric_limits<c10::Float8_e4m3fn>::max(), std::numeric_limits<c10::Float8_e4m3fn>::max()] to make sure there are no NaNs, the performance is | Groups |Version|Filter|n-shot|Metric|Value | |Stderr| |------------------|-------|------|-----:|------|-----:|---|-----:| |mmlu |N/A |none | 0|acc |0.7008|± |0.0036| | - humanities |N/A |none | 5|acc |0.6453|± |0.0065| | - other |N/A |none | 5|acc |0.7692|± |0.0072| | - social_sciences|N/A |none | 5|acc |0.8083|± |0.0070| | - stem |N/A |none | 5|acc |0.6115|± |0.0083| This is not perfect yet but is getting very close to the FP16 / dynamic activation scale performance.

…ct#4607)

…ct#4642)

…ora (vllm-project#4609)

vllm-project#4648)

Co-authored-by: Lei Wen <[email protected]> Co-authored-by: Cade Daniel <[email protected]> Co-authored-by: Cody Yu <[email protected]>

…-project#4660) [Core][Distributed] support both cpu and device tensor in broadcast tensor dict (vllm-project#4660)

Previously FP8 static scaling works if the scales are overestimating the maxima of all activation tensors during computation. However this will not always be the case even if the scales were calibrated very carefully. For example, with the activations in my checkpoint https://huggingface.co/pcmoritz/Mixtral-8x7B-v0.1-fp8-act-scale (which was calibrated on https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k), I'm getting the following mostly random performance on MMLU: | Groups |Version|Filter|n-shot|Metric|Value | |Stderr| |------------------|-------|------|-----:|------|-----:|---|-----:| |mmlu |N/A |none | 0|acc |0.2295|± |0.0035| | - humanities |N/A |none | 5|acc |0.2421|± |0.0062| | - other |N/A |none | 5|acc |0.2398|± |0.0076| | - social_sciences|N/A |none | 5|acc |0.2171|± |0.0074| | - stem |N/A |none | 5|acc |0.2125|± |0.0073| With the fix in this PR where the scaled activations are clamped between [-std::numeric_limits<c10::Float8_e4m3fn>::max(), std::numeric_limits<c10::Float8_e4m3fn>::max()] to make sure there are no NaNs, the performance is | Groups |Version|Filter|n-shot|Metric|Value | |Stderr| |------------------|-------|------|-----:|------|-----:|---|-----:| |mmlu |N/A |none | 0|acc |0.7008|± |0.0036| | - humanities |N/A |none | 5|acc |0.6453|± |0.0065| | - other |N/A |none | 5|acc |0.7692|± |0.0072| | - social_sciences|N/A |none | 5|acc |0.8083|± |0.0070| | - stem |N/A |none | 5|acc |0.6115|± |0.0083| This is not perfect yet but is getting very close to the FP16 / dynamic activation scale performance.

…llm-project#4573)

…to swap (vllm-project#4659)

…project#4592) Co-authored-by: Cade Daniel <[email protected]>

…gprobs (vllm-project#4672)

…llm-project#4626)

…project#4400) Co-authored-by: Michael Goin <[email protected]>

…-project#4705)

Co-authored-by: miloice <[email protected]>

…e had 4 nvcc threads and N max_jobs for N VCPUs. All notes in vllm upstream suggest that this will overload the cpu. Seeing build times > 1hr at current, so trying this

andy-neuma

cool.

andy-neuma · 2024-05-31T14:27:01Z

looks good, but can you update "remote push" workflow for python 3.10? it is still set to use "tmp", around line 55

test_skip_list: neuralmagic/tests/skip-for-remote-push-tmp.txt

derekk-nm

I could only look at a fraction of the file changes. there are a few known issues, but I'm approving.

avoid failure in automation. not sure why this is failing. passes locally including when i setup the env in the exact same way

zhaoyang-star and others added 30 commits May 19, 2024 15:00

Disable cuda version check in vllm-openai image (vllm-project#4530)

1337ced

[Bugfix] Fix asyncio.Task not being subscriptable (vllm-project#4623)

7d1afa9

[CI] use ccache actions properly in release workflow (vllm-project#4629)

76d1c0a

[CI] Add retry for agent lost (vllm-project#4633)

8c3136e

Update lm-format-enforcer to 0.10.1 (vllm-project#4631)

5749888

[Core][Optimization] change python dict to pytorch tensor (vllm-proje…

a542de1

…ct#4607)

[Build/CI] Fixing 'docker run' to re-enable AMD CI tests. (vllm-proje…

a3ff2ae

…ct#4642)

[Bugfix] Fixed error in slice_lora_b for MergedQKVParallelLinearWithL…

e4ab5c6

…ora (vllm-project#4609)

[Core][Optimization] change copy-on-write from dict[int, list] to list (

fd69572

vllm-project#4648)

[Bug fix][Core] fixup ngram not setup correctly (vllm-project#4551)

8673ad0

Co-authored-by: Lei Wen <[email protected]> Co-authored-by: Cade Daniel <[email protected]> Co-authored-by: Cody Yu <[email protected]>

[Core][Distributed] support cpu&device in broadcast tensor dict (vllm…

3fc0fa0

…-project#4660) [Core][Distributed] support both cpu and device tensor in broadcast tensor dict (vllm-project#4660)

[Core] Optimize sampler get_logprobs (vllm-project#4594)

43bc7e9

[Bugfix][Kernel] allow non-power-of-2 for prefix prefill with alibi (v…

f64e4e4

…llm-project#4573)

[Misc] Add get_name method to attention backends (vllm-project#4685)

e06c2d6

[Core] Faster startup for LoRA enabled models (vllm-project#4634)

01d4ceb

[Core][Optimization] change python dict to pytorch tensor for blocks …

8afd8f7

…to swap (vllm-project#4659)

[CI/Test] fix swap test for multi gpu (vllm-project#4689)

1fe8d9c

[Misc] Use vllm-flash-attn instead of flash-attn (vllm-project#4686)

b5967c4

[Dynamic Spec Decoding] Auto-disable by the running queue size (vllm-…

4a85263

…project#4592) Co-authored-by: Cade Daniel <[email protected]>

[Speculative decoding] [Bugfix] Fix overallocation in ngram + spec lo…

edd9e90

…gprobs (vllm-project#4672)

[Bugfix] Fine-tune gptq_marlin configs to be more similar to marlin (v…

32314e5

…llm-project#4626)

[Frontend] add tok/s speed metric to llm class when using tqdm (vllm-…

b0d3937

…project#4400) Co-authored-by: Michael Goin <[email protected]>

[Frontend] Move async logic outside of constructor (vllm-project#4674)

294e480

[Misc] Remove unnecessary ModelRunner imports (vllm-project#4703)

04a0387

[Misc] Set block size at initialization & Fix test_model_runner (vllm…

fff9c2c

…-project#4705)

[ROCm] Add support for Punica kernels on AMD GPUs (vllm-project#3140)

396a546

Co-authored-by: miloice <[email protected]>

[Bugfix] Fix CLI arguments in OpenAI server docs (vllm-project#4709)

0c85c21

[Bugfix] Update grafana.json (vllm-project#4711)

631605d

robertgshaw2-redhat and others added 18 commits May 27, 2024 22:13

skip shared state loader

2059e61

updated build test to use 4 nvcc threads by default. We previously, w…

9642aef

…e had 4 nvcc threads and N max_jobs for N VCPUs. All notes in vllm upstream suggest that this will overload the cpu. Seeing build times > 1hr at current, so trying this

tweaked to fix benchmark

2dad479

updated workflow to run longer

3bdfeb4

Merge branch 'main' into upstream-sync-2024-05-19

3800a1c

updated skip lists to skip sharded state loader

f1199dc

verified that test multiproc workers is passing locally

ee7e65a

fixed the sampling params issue

b73a142

fixed other sampling_params issue

8225ddd

Merge branch 'main' into upstream-sync-2024-05-19

c386e32

format

098e08a

confirmed basic correctness test working

7d32b8a

updated score for marlin 2:4

748d0e1

Merge branch 'main' into upstream-sync-2024-05-19

cd648c6

Disable flaky marlin model

9785c41

Increase benchmark server timeout to 15 minutes

3507552

Merge branch 'main' into upstream-sync-2024-05-19

1d6af5a

Merge branch 'main' into upstream-sync-2024-05-19

96fbf17

andy-neuma approved these changes May 31, 2024

View reviewed changes

derekk-nm approved these changes May 31, 2024

View reviewed changes

robertgshaw2-redhat and others added 7 commits June 1, 2024 20:55

reduce number of prompts and models in basic server correctness

db69b5c

Merge branch 'nm-vllm-main' into upstream-sync-2024-05-19

0654a43

fixed workflows

3ba575c

removed basic server correctness from release

43c0adc

Update test_compressed.py

50ac573

Update test_compressed.py (#277)

1802833

avoid failure in automation. not sure why this is failing. passes locally including when i setup the env in the exact same way

nit in setup.py

2c52fee

robertgshaw2-redhat merged commit fec3563 into main Jun 3, 2024
12 checks passed

robertgshaw2-redhat deleted the upstream-sync-2024-05-19 branch June 3, 2024 15:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upstream sync 2024 05 19 #249

Upstream sync 2024 05 19 #249

robertgshaw2-redhat commented May 19, 2024 •

edited

Loading

andy-neuma left a comment

andy-neuma commented May 31, 2024

derekk-nm left a comment

Upstream sync 2024 05 19 #249

Upstream sync 2024 05 19 #249

Conversation

robertgshaw2-redhat commented May 19, 2024 • edited Loading

PR Title and Classification

Code Quality

Notes for Large Changes

What to Expect for the Reviews

Thank You

andy-neuma left a comment

Choose a reason for hiding this comment

andy-neuma commented May 31, 2024

derekk-nm left a comment

Choose a reason for hiding this comment

robertgshaw2-redhat commented May 19, 2024 •

edited

Loading