Skip to content
This repository has been archived by the owner on Oct 11, 2024. It is now read-only.

Sparsity #1

Merged
merged 12 commits into from
Feb 1, 2024
Merged

Sparsity #1

merged 12 commits into from
Feb 1, 2024

Conversation

robertgshaw2-redhat
Copy link
Collaborator

@robertgshaw2-redhat robertgshaw2-redhat commented Jan 27, 2024

Refactored @alexm-nm's work in the old private repo.

In addition to Alex's original work, I avoid materializing the dense matrices on CPU during weight loading (previous iteration materialized the entire model). Now, we create the dense matrices for loading on the fly.

Improvements:

  • Currently, when we load the QKV matrix, we unpack and repack 3 times. We could make loading happen faster by unpacking during the first shard load and only repacking once all shards have been loaded
  • CUDAGraphs are not working (fixed by https://github.com/neuralmagic/nm_gpu/pull/15)

Copy link
Collaborator

@alexm-redhat alexm-redhat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for the port

vllm/config.py Show resolved Hide resolved
vllm/config.py Show resolved Hide resolved
vllm/engine/arg_utils.py Outdated Show resolved Hide resolved
vllm/model_executor/layers/linear.py Outdated Show resolved Hide resolved
vllm/model_executor/layers/linear.py Show resolved Hide resolved
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants