Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HFQuantizer implementation for compressed-tensors library #31704

Merged
merged 42 commits into from
Sep 25, 2024

Conversation

bfineran
Copy link
Contributor

This PR adds an HFQuantizer for the compressed-tensors library.

Supported quantization features include:

  • FP8, INT4, INT8 (for Q/DQ arbitrary precision is allowed for INT)
  • Activation quantization (static)
  • Dynamic per-token activation quantization
  • Supports quantization of arbitrary layer types

Compressed tensors supports running in a Q/DQ format with transformer models and running compressed within vLLM (running compressed with transformers on the roadmap).

This initial PR includes a HFQuantizer and QuantizationConfig inplementations as well as a simple test. Documentation is being added.

To run with this branch, compressed-tensors needs to be currently installed from a development branch (can be released to pypi prior to landing this PR pending support from the transformers team) - pip install https://github.com/neuralmagic/compressed-tensors.git@rename_config.

Happy to provide any further information needed.

Sample model load:

from transformers import AutoModelForCausalLM
compressed_tensors_model = AutoModelForCausalLM.from_pretrained("nm-testing/tinyllama-oneshot-w4a16-group128-v3")

Suggested Reviewers:
@SunMarc @younesbelkada

@SunMarc
Copy link
Member

SunMarc commented Jul 10, 2024

Hi @bfineran, thanks for contributing and sorry for the delay ! From the PR, I see that we are decompressing the model at after loading the quantized model in order to run it. What would it take to run compressed model on transformers ? I don't mind merging this first but I want to make sure that we enable users to quantized model using compressed-library in the end. Happy to discuss more on how to collaborate together over slack if you want !

@bfineran
Copy link
Contributor Author

Hi @SunMarc right now running compressed is WIP - we've prioritized a very flexible Q/DQ environment to enable a wide range of quantization settings and will likely roll out running compressed scenario by scenario. Will reach out over slack to discuss more about the project and will also update with additional documentation soon.

Copy link
Member

@SunMarc SunMarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your great work @bfineran ! The PR is mostly good on my side.
This new quantizer relies heavily on your own functions, so you will need to be a bit careful to not introduce breaking changes in the future (e.g. apply_quantization_config ) with transformers. However, I guess it can makes sense since compressed-tensors deals with many compressed format + no need to commit to transformers.
There is just a blocker on the loading part of the quantizer that needs to be fixed !

src/transformers/quantizers/auto.py Show resolved Hide resolved
src/transformers/quantizers/auto.py Show resolved Hide resolved
src/transformers/utils/quantization_config.py Outdated Show resolved Hide resolved
@SunMarc
Copy link
Member

SunMarc commented Sep 5, 2024

Can you add the following to CompressedTensorsConfig ? This way we can print the quantization config

    def to_diff_dict(self) -> Dict[str, Any]:
        """
        Removes all attributes from config which correspond to the default config attributes for better readability and
        serializes to a Python dictionary.
        Returns:
            `Dict[str, Any]`: Dictionary of all the attributes that make up this configuration instance,
        """
        config_dict = self.to_dict()

        # get the default config dict
        default_config_dict = CompressedTensorsConfig().to_dict()

        serializable_config_dict = {}

        # only serialize values that differ from the default config
        for key, value in config_dict.items():
            if value != default_config_dict[key]:
                serializable_config_dict[key] = value

        return serializable_config_dict

Copy link
Member

@SunMarc SunMarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your patience and for iterating @bfineran @Satrat ! This looks good ! I left a few minor comments. I would be nice on your side to add test to check if everything works as expected also. We delegate a lot of stuff on your side as we are using the apply_quantization_config function !

@SunMarc
Copy link
Member

SunMarc commented Sep 13, 2024

Gentle ping @ArthurZucker

@SunMarc
Copy link
Member

SunMarc commented Sep 13, 2024

Can you rebase on main @Satrat ? The failing might are probably due to that

@jvlinsta
Copy link

Eagerly awaiting this! Great work @neuralmagic team ;)

@Satrat
Copy link
Contributor

Satrat commented Sep 17, 2024

Can you rebase on main @Satrat ? The failing might are probably due to that

@SunMarc done! Looks like the tests are passing now after rebasing

@hyaticua
Copy link

Very eagerly awaiting this merge. Thanks to everyone in involved!

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I think it would be cool to improve the doc a bit and mention:

src/transformers/utils/quantization_config.py Show resolved Hide resolved
Comment on lines +1062 to +1063
format (`str`, *optional*, defaults to `"dense"`):
format the model is represented as
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the available formats?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ArthurZucker this includes the different compression formats, depending on how the model is quantized/saved on disk, including:

  1. dense
  2. int-quantized
  3. float-quantized
  4. pack-quantized
  5. marlin-24

```


## More Coming Soon!
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you have example use-cases of the config's different parameters and why they should be use would be awesome here as well!

@dsikka
Copy link
Contributor

dsikka commented Sep 24, 2024

@ArthurZucker Thank you for your feedback! We've updated the compressed_tensors.md addressing the aforementioned points.

Copy link
Member

@SunMarc SunMarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for iterating on the docs ! LGTM ! Since I see that you release a new version of compressed_tensors, I'll merge this PR ! After this PR is merged @Satrat, could you modify the quantization_config of neural magic models to use compressed_tensors as we talked before ?

@SunMarc SunMarc merged commit 574a9e1 into huggingface:main Sep 25, 2024
24 checks passed
avishaiElmakies pushed a commit to avishaiElmakies/transformers that referenced this pull request Sep 25, 2024
…e#31704)

* Add compressed-tensors HFQuantizer implementation

* flag serializable as False

* run

* revive lines deleted by ruff

* fixes to load+save from sparseml, edit config to quantization_config, and load back

* address satrat comment

* compressed_tensors to compressed-tensors and revert back is_serializable

* rename quant_method from sparseml to compressed-tensors

* tests

* edit tests

* clean up tests

* make style

* cleanup

* cleanup

* add test skip for when compressed tensors is not installed

* remove pydantic import + style

* delay torch import in test

* initial docs

* update main init for compressed tensors config

* make fix-copies

* docstring

* remove fill_docstring

* Apply suggestions from code review

Co-authored-by: Marc Sun <[email protected]>

* review comments

* review comments

* comments - suppress warnings on state dict load, tests, fixes

* bug-fix - remove unnecessary call to apply quant lifecycle

* run_compressed compatability

* revert changes not needed for compression

* no longer need unexpected keys fn

* unexpected keys not needed either

* Apply suggestions from code review

Co-authored-by: Marc Sun <[email protected]>

* add to_diff_dict

* update docs and expand testing

* Update _toctree.yml with compressed-tensors

* Update src/transformers/utils/quantization_config.py

Co-authored-by: Arthur <[email protected]>

* update doc

* add note about saving a loaded model

---------

Co-authored-by: George Ohashi <[email protected]>
Co-authored-by: Marc Sun <[email protected]>
Co-authored-by: Sara Adkins <[email protected]>
Co-authored-by: Sara Adkins <[email protected]>
Co-authored-by: Arthur <[email protected]>
Co-authored-by: Dipika Sikka <[email protected]>
Co-authored-by: Dipika <[email protected]>
amyeroberts pushed a commit to amyeroberts/transformers that referenced this pull request Oct 2, 2024
…e#31704)

* Add compressed-tensors HFQuantizer implementation

* flag serializable as False

* run

* revive lines deleted by ruff

* fixes to load+save from sparseml, edit config to quantization_config, and load back

* address satrat comment

* compressed_tensors to compressed-tensors and revert back is_serializable

* rename quant_method from sparseml to compressed-tensors

* tests

* edit tests

* clean up tests

* make style

* cleanup

* cleanup

* add test skip for when compressed tensors is not installed

* remove pydantic import + style

* delay torch import in test

* initial docs

* update main init for compressed tensors config

* make fix-copies

* docstring

* remove fill_docstring

* Apply suggestions from code review

Co-authored-by: Marc Sun <[email protected]>

* review comments

* review comments

* comments - suppress warnings on state dict load, tests, fixes

* bug-fix - remove unnecessary call to apply quant lifecycle

* run_compressed compatability

* revert changes not needed for compression

* no longer need unexpected keys fn

* unexpected keys not needed either

* Apply suggestions from code review

Co-authored-by: Marc Sun <[email protected]>

* add to_diff_dict

* update docs and expand testing

* Update _toctree.yml with compressed-tensors

* Update src/transformers/utils/quantization_config.py

Co-authored-by: Arthur <[email protected]>

* update doc

* add note about saving a loaded model

---------

Co-authored-by: George Ohashi <[email protected]>
Co-authored-by: Marc Sun <[email protected]>
Co-authored-by: Sara Adkins <[email protected]>
Co-authored-by: Sara Adkins <[email protected]>
Co-authored-by: Arthur <[email protected]>
Co-authored-by: Dipika Sikka <[email protected]>
Co-authored-by: Dipika <[email protected]>
ArthurZucker added a commit that referenced this pull request Oct 10, 2024
* add sdpa to OPT

* chore: remove redundant whitespace in OPTDecoder class

* fixup

* bug fix

* add sdpa and attention generate test

* fixup

* Refactor OPTAttention forward method for improved readability and maintainability

* undo refactor for _shape and key,val states

* add OPT to doc, fixup didn't find it for some reason

* change order

* change default attn_implemntation in testing to eager

* [run-slow] opt

* change test_eager_matches_sdpa_generate to the one llama

* Update default attention implementation in testing common

* [run-slow] opt

* remove uneeded print

* [run-slow] opt

* refactor model testers to have attn_implementation="eager"

* [run-slow] opt

* convert test_eager_matches_sdpa_generate to opt-350M

* bug fix when creating mask for opt

* [run-slow] opt

* if layer head mask default to eager

* if head mask is not none fall to eager

* [run-slow] opt

* Update src/transformers/models/opt/modeling_opt.py

Co-authored-by: amyeroberts <[email protected]>

* Clean up Unpack imports (#33631)

clean up Unpack imports

* Fix DPT /Dinov2 sdpa regression on main (#33660)

* fallback to eager if output attentions.

* fix copies

* handle dependency errors in check_imports (#33622)

* handle dependency errors in check_imports

* change log level to warning

* add back self.max_position_embeddings = config.max_position_embeddings (#33550)

* add back self.max_position_embeddings = config.max_position_embeddings

* fix-copies

* Fix Llava conversion for LlavaQwen2ForCausalLM with Clip vision tower (#33613)

fix llavaqwen2 model conversion

* Uniformize kwargs for Udop processor and update docs (#33628)

* Add optional kwargs and uniformize udop

* cleanup Unpack

* nit Udop

* Generation: deprecate `PreTrainedModel` inheriting from `GenerationMixin`  (#33203)

* Enable BNB multi-backend support (#31098)

* enable cpu bnb path

* fix style

* fix code style

* fix 4 bit path

* Update src/transformers/utils/import_utils.py

Co-authored-by: Aarni Koskela <[email protected]>

* add multi backend refactor tests

* fix style

* tweak 4bit quantizer + fix corresponding tests

* tweak 8bit quantizer + *try* fixing corresponding tests

* fix dequant bnb 8bit

* account for Intel CPU in variability of expected outputs

* enable cpu and xpu device map

* further tweaks to account for Intel CPU

* fix autocast to work with both cpu + cuda

* fix comments

* fix comments

* switch to testing_utils.torch_device

* allow for xpu in multi-gpu tests

* fix tests 4bit for CPU NF4

* fix bug with is_torch_xpu_available needing to be called as func

* avoid issue where test reports attr err due to other failure

* fix formatting

* fix typo from resolving of merge conflict

* polish based on last PR review

Co-authored-by: Marc Sun <[email protected]>

* fix CI

* Update src/transformers/integrations/integration_utils.py

Co-authored-by: Arthur <[email protected]>

* Update src/transformers/integrations/integration_utils.py

Co-authored-by: Arthur <[email protected]>

* fix error log

* fix error msg

* add \n in error log

* make quality

* rm bnb cuda restriction in doc

* cpu model don't need dispatch

* fix doc

* fix style

* check cuda avaliable in testing

* fix tests

* Update docs/source/en/model_doc/chameleon.md

Co-authored-by: Marc Sun <[email protected]>

* Update docs/source/en/model_doc/llava_next.md

Co-authored-by: Aarni Koskela <[email protected]>

* Update tests/quantization/bnb/test_4bit.py

Co-authored-by: Aarni Koskela <[email protected]>

* Update tests/quantization/bnb/test_4bit.py

Co-authored-by: Aarni Koskela <[email protected]>

* fix doc

* fix check multibackends

* fix import sort

* remove check torch in bnb

* docs: update bitsandbytes references with multi-backend info

* docs: fix small mistakes in bnb paragraph

* run formatting

* reveret bnb check

* move bnb multi-backend check to import_utils

* Update src/transformers/utils/import_utils.py

Co-authored-by: Aarni Koskela <[email protected]>

* fix bnb check

* minor fix for bnb

* check lib first

* fix code style

* Revert "run formatting"

This reverts commit ac108c6.

* fix format

* give warning when bnb version is low and no cuda found]

* fix device assignment check to be multi-device capable

* address akx feedback on get_avlbl_dev fn

* revert partially, as we don't want the function that public, as docs would be too much (enforced)

---------

Co-authored-by: Aarni Koskela <[email protected]>
Co-authored-by: Titus von Koeller <[email protected]>
Co-authored-by: Marc Sun <[email protected]>
Co-authored-by: Arthur <[email protected]>

* Fix error string after refactoring into get_chat_template (#33652)

* Fix error string after refactoring into get_chat_template

* Take suggestion from CR

Co-authored-by: Matt <[email protected]>

---------

Co-authored-by: Matt <[email protected]>

* uniformize git processor (#33668)

* uniformize git processor

* update doctring

* Modular `transformers`: modularity and inheritance for new model additions (#33248)

* update exampel

* update

* push the converted diff files for testing and ci

* correct one example

* fix class attributes and docstring

* nits

* oups

* fixed config!

* update

* nitd

* class attributes are not matched against the other, this is missing

* fixed overwriting self.xxx now onto the attributes I think

* partial fix, now order with docstring

* fix docstring order?

* more fixes

* update

* fix missing docstrings!

* examples don't all work yet

* fixup

* nit

* updated

* hick

* update

* delete

* update

* update

* update

* fix

* all default

* no local import

* fix more diff

* some fix related to "safe imports"

* push fixed

* add helper!

* style

* add a check

* all by default

* add the

* update

* FINALLY!

* nit

* fix config dependencies

* man that is it

* fix fix

* update diffs

* fix the last issue

* re-default to all

* alll the fixes

* nice

* fix properties vs setter

* fixup

* updates

* update dependencies

* make sure to install what needs to be installed

* fixup

* quick fix for now

* fix!

* fixup

* update

* update

* updates

* whitespaces

* nit

* fix

* simplify everything, and make it file agnostic (should work for image processors)

* style

* finish fixing all import issues

* fixup

* empty modeling should not be written!

* Add logic to find who depends on what

* update

* cleanup

* update

* update gemma to support positions

* some small nits

* this is the correct docstring for gemma2

* fix merging of docstrings

* update

* fixup

* update

* take doc into account

* styling

* update

* fix hidden activation

* more fixes

* final fixes!

* fixup

* fixup instruct  blip video

* update

* fix bugs

* align gemma2 with the rest as well

* updats

* revert

* update

* more reversiom

* grind

* more

* arf

* update

* order will matter

* finish del stuff

* update

* rename to modular

* fixup

* nits

* update makefile

* fixup

* update order of the checks!

* fix

* fix docstring that has a call inside

* fiix conversion check

* style

* add some initial documentation

* update

* update doc

* some fixup

* updates

* yups

* Mostly todo gimme a minut

* update

* fixup

* revert some stuff

* Review docs for the modular transformers (#33472)

Docs

* good update

* fixup

* mmm current updates lead to this code

* okay, this fixes it

* cool

* fixes

* update

* nit

* updates

* nits

* fix doc

* update

* revert bad changes

* update

* updates

* proper update

* update

* update?

* up

* update

* cool

* nits

* nits

* bon bon

* fix

* ?

* minimise changes

* update

* update

* update

* updates?

* fixed gemma2

* kind of a hack

* nits

* update

* remove `diffs` in favor of `modular`

* fix make fix copies

---------

Co-authored-by: Lysandre Debut <[email protected]>

* Fix CIs post merging modular transformers (#33681)

update

* Fixed docstring for cohere model regarding unavailability of prune_he… (#33253)

* Fixed docstring for cohere model regarding unavailability of prune_head() methods

The docstring mentions that cohere model supports prune_heads() methods. I have fixed the docstring by explicitly mentioning that it doesn't support that functionality.

* Update src/transformers/models/cohere/modeling_cohere.py

---------

Co-authored-by: Lysandre Debut <[email protected]>

* Generation tests: update imagegpt input name, remove unused functions (#33663)

* Improve Error Messaging for Flash Attention 2 on CPU (#33655)

Update flash-attn error message on CPU

Rebased to latest branch

* Gemma2: fix config initialization (`cache_implementation`) (#33684)

* Fix ByteLevel alphabet missing when Sequence pretokenizer is used (#33556)

* Fix ByteLevel alphabet missing when Sequence pretokenizer is used

* Fixed formatting with `ruff`.

* Uniformize kwargs for image-text-to-text processors (#32544)

* uniformize FUYU processor kwargs

* Uniformize instructblip processor kwargs

* Fix processor kwargs and tests Fuyu, InstructBlip, Kosmos2

* Uniformize llava_next processor

* Fix save_load test for processor with chat_template only as extra init args

* Fix import Unpack

* Fix Fuyu Processor import

* Fix FuyuProcessor import

* Fix FuyuProcessor

* Add defaults for specific kwargs kosmos2

* Fix Udop to return BatchFeature instead of BatchEncoding and uniformize kwargs

* Add tests processor Udop

* remove Copied from in processing Udop as change of input orders caused by BatchEncoding -> BatchFeature

* Fix overwrite tests kwargs processors

* Add warnings and BC for changes in processor inputs order, change docs, add BC for text_pair as arg for Udop

* Fix processing test fuyu

* remove unnecessary pad_token check in instructblip ProcessorTest

* Fix BC tests and cleanup

* FIx imports fuyu

* Uniformize Pix2Struct

* Fix wrong name for FuyuProcessorKwargs

* Fix slow tests reversed inputs align fuyu llava-next, change udop warning

* Fix wrong logging import udop

* Add check images text input order

* Fix copies

* change text pair handling when positional arg

* rebase on main, fix imports in test_processing_common

* remove optional args and udop uniformization from this PR

* fix failing tests

* remove unnecessary test, fix processing utils and test processing common

* cleanup Unpack

* cleanup

* fix conflict grounding dino

* 🚨🚨 Setting default behavior of assisted decoding (#33657)

* tests: fix pytorch tensor placement errors (#33485)

This commit fixes the following errors:
* Fix "expected all tensors to be on the same device" error
* Fix "can't convert device type tensor to numpy"

According to pytorch documentation torch.Tensor.numpy(force=False)
performs conversion only if tensor is on CPU (plus few other restrictions)
which is not the case. For our case we need force=True since we just
need a data and don't care about tensors coherency.

Fixes: #33517
See: https://pytorch.org/docs/2.4/generated/torch.Tensor.numpy.html

Signed-off-by: Dmitry Rogozhkin <[email protected]>

* bump tokenizers, fix added tokens fast (#32535)

* update based on tokenizers release

* update

* nits

* update

* revert re addition

* don't break that yet

* fmt

* revert unwanted

* update tokenizers version

* update dep table

* update

* update in conversion script as well

* some fix

* revert

* fully revert

* fix training

* remove set trace

* fixup

* update

* update

* [Pixtral] Improve docs, rename model (#33491)

* Improve docs, rename model

* Fix style

* Update repo id

* fix code quality after merge

* HFQuantizer implementation for compressed-tensors library (#31704)

* Add compressed-tensors HFQuantizer implementation

* flag serializable as False

* run

* revive lines deleted by ruff

* fixes to load+save from sparseml, edit config to quantization_config, and load back

* address satrat comment

* compressed_tensors to compressed-tensors and revert back is_serializable

* rename quant_method from sparseml to compressed-tensors

* tests

* edit tests

* clean up tests

* make style

* cleanup

* cleanup

* add test skip for when compressed tensors is not installed

* remove pydantic import + style

* delay torch import in test

* initial docs

* update main init for compressed tensors config

* make fix-copies

* docstring

* remove fill_docstring

* Apply suggestions from code review

Co-authored-by: Marc Sun <[email protected]>

* review comments

* review comments

* comments - suppress warnings on state dict load, tests, fixes

* bug-fix - remove unnecessary call to apply quant lifecycle

* run_compressed compatability

* revert changes not needed for compression

* no longer need unexpected keys fn

* unexpected keys not needed either

* Apply suggestions from code review

Co-authored-by: Marc Sun <[email protected]>

* add to_diff_dict

* update docs and expand testing

* Update _toctree.yml with compressed-tensors

* Update src/transformers/utils/quantization_config.py

Co-authored-by: Arthur <[email protected]>

* update doc

* add note about saving a loaded model

---------

Co-authored-by: George Ohashi <[email protected]>
Co-authored-by: Marc Sun <[email protected]>
Co-authored-by: Sara Adkins <[email protected]>
Co-authored-by: Sara Adkins <[email protected]>
Co-authored-by: Arthur <[email protected]>
Co-authored-by: Dipika Sikka <[email protected]>
Co-authored-by: Dipika <[email protected]>

* update model card for opt

* add batch size to inference table

* [slow-run] opt

* [run-slow] opt

---------

Signed-off-by: Dmitry Rogozhkin <[email protected]>
Co-authored-by: Avishai Elmakies <[email protected]>
Co-authored-by: amyeroberts <[email protected]>
Co-authored-by: Pablo Montalvo <[email protected]>
Co-authored-by: chengchengpei <[email protected]>
Co-authored-by: Isotr0py <[email protected]>
Co-authored-by: Yoni Gozlan <[email protected]>
Co-authored-by: Joao Gante <[email protected]>
Co-authored-by: jiqing-feng <[email protected]>
Co-authored-by: Aarni Koskela <[email protected]>
Co-authored-by: Titus von Koeller <[email protected]>
Co-authored-by: Marc Sun <[email protected]>
Co-authored-by: Arthur <[email protected]>
Co-authored-by: Tibor Reiss <[email protected]>
Co-authored-by: Matt <[email protected]>
Co-authored-by: Lysandre Debut <[email protected]>
Co-authored-by: Muhammad Naufil <[email protected]>
Co-authored-by: sizhky <[email protected]>
Co-authored-by: Umar Butler <[email protected]>
Co-authored-by: Jonathan Mamou <[email protected]>
Co-authored-by: Dmitry Rogozhkin <[email protected]>
Co-authored-by: NielsRogge <[email protected]>
Co-authored-by: Arthur Zucker <[email protected]>
Co-authored-by: Benjamin Fineran <[email protected]>
Co-authored-by: George Ohashi <[email protected]>
Co-authored-by: Sara Adkins <[email protected]>
Co-authored-by: Sara Adkins <[email protected]>
Co-authored-by: Dipika Sikka <[email protected]>
Co-authored-by: Dipika <[email protected]>
NielsRogge added a commit to NielsRogge/transformers that referenced this pull request Oct 21, 2024
* add sdpa to OPT

* chore: remove redundant whitespace in OPTDecoder class

* fixup

* bug fix

* add sdpa and attention generate test

* fixup

* Refactor OPTAttention forward method for improved readability and maintainability

* undo refactor for _shape and key,val states

* add OPT to doc, fixup didn't find it for some reason

* change order

* change default attn_implemntation in testing to eager

* [run-slow] opt

* change test_eager_matches_sdpa_generate to the one llama

* Update default attention implementation in testing common

* [run-slow] opt

* remove uneeded print

* [run-slow] opt

* refactor model testers to have attn_implementation="eager"

* [run-slow] opt

* convert test_eager_matches_sdpa_generate to opt-350M

* bug fix when creating mask for opt

* [run-slow] opt

* if layer head mask default to eager

* if head mask is not none fall to eager

* [run-slow] opt

* Update src/transformers/models/opt/modeling_opt.py

Co-authored-by: amyeroberts <[email protected]>

* Clean up Unpack imports (huggingface#33631)

clean up Unpack imports

* Fix DPT /Dinov2 sdpa regression on main (huggingface#33660)

* fallback to eager if output attentions.

* fix copies

* handle dependency errors in check_imports (huggingface#33622)

* handle dependency errors in check_imports

* change log level to warning

* add back self.max_position_embeddings = config.max_position_embeddings (huggingface#33550)

* add back self.max_position_embeddings = config.max_position_embeddings

* fix-copies

* Fix Llava conversion for LlavaQwen2ForCausalLM with Clip vision tower (huggingface#33613)

fix llavaqwen2 model conversion

* Uniformize kwargs for Udop processor and update docs (huggingface#33628)

* Add optional kwargs and uniformize udop

* cleanup Unpack

* nit Udop

* Generation: deprecate `PreTrainedModel` inheriting from `GenerationMixin`  (huggingface#33203)

* Enable BNB multi-backend support (huggingface#31098)

* enable cpu bnb path

* fix style

* fix code style

* fix 4 bit path

* Update src/transformers/utils/import_utils.py

Co-authored-by: Aarni Koskela <[email protected]>

* add multi backend refactor tests

* fix style

* tweak 4bit quantizer + fix corresponding tests

* tweak 8bit quantizer + *try* fixing corresponding tests

* fix dequant bnb 8bit

* account for Intel CPU in variability of expected outputs

* enable cpu and xpu device map

* further tweaks to account for Intel CPU

* fix autocast to work with both cpu + cuda

* fix comments

* fix comments

* switch to testing_utils.torch_device

* allow for xpu in multi-gpu tests

* fix tests 4bit for CPU NF4

* fix bug with is_torch_xpu_available needing to be called as func

* avoid issue where test reports attr err due to other failure

* fix formatting

* fix typo from resolving of merge conflict

* polish based on last PR review

Co-authored-by: Marc Sun <[email protected]>

* fix CI

* Update src/transformers/integrations/integration_utils.py

Co-authored-by: Arthur <[email protected]>

* Update src/transformers/integrations/integration_utils.py

Co-authored-by: Arthur <[email protected]>

* fix error log

* fix error msg

* add \n in error log

* make quality

* rm bnb cuda restriction in doc

* cpu model don't need dispatch

* fix doc

* fix style

* check cuda avaliable in testing

* fix tests

* Update docs/source/en/model_doc/chameleon.md

Co-authored-by: Marc Sun <[email protected]>

* Update docs/source/en/model_doc/llava_next.md

Co-authored-by: Aarni Koskela <[email protected]>

* Update tests/quantization/bnb/test_4bit.py

Co-authored-by: Aarni Koskela <[email protected]>

* Update tests/quantization/bnb/test_4bit.py

Co-authored-by: Aarni Koskela <[email protected]>

* fix doc

* fix check multibackends

* fix import sort

* remove check torch in bnb

* docs: update bitsandbytes references with multi-backend info

* docs: fix small mistakes in bnb paragraph

* run formatting

* reveret bnb check

* move bnb multi-backend check to import_utils

* Update src/transformers/utils/import_utils.py

Co-authored-by: Aarni Koskela <[email protected]>

* fix bnb check

* minor fix for bnb

* check lib first

* fix code style

* Revert "run formatting"

This reverts commit ac108c6.

* fix format

* give warning when bnb version is low and no cuda found]

* fix device assignment check to be multi-device capable

* address akx feedback on get_avlbl_dev fn

* revert partially, as we don't want the function that public, as docs would be too much (enforced)

---------

Co-authored-by: Aarni Koskela <[email protected]>
Co-authored-by: Titus von Koeller <[email protected]>
Co-authored-by: Marc Sun <[email protected]>
Co-authored-by: Arthur <[email protected]>

* Fix error string after refactoring into get_chat_template (huggingface#33652)

* Fix error string after refactoring into get_chat_template

* Take suggestion from CR

Co-authored-by: Matt <[email protected]>

---------

Co-authored-by: Matt <[email protected]>

* uniformize git processor (huggingface#33668)

* uniformize git processor

* update doctring

* Modular `transformers`: modularity and inheritance for new model additions (huggingface#33248)

* update exampel

* update

* push the converted diff files for testing and ci

* correct one example

* fix class attributes and docstring

* nits

* oups

* fixed config!

* update

* nitd

* class attributes are not matched against the other, this is missing

* fixed overwriting self.xxx now onto the attributes I think

* partial fix, now order with docstring

* fix docstring order?

* more fixes

* update

* fix missing docstrings!

* examples don't all work yet

* fixup

* nit

* updated

* hick

* update

* delete

* update

* update

* update

* fix

* all default

* no local import

* fix more diff

* some fix related to "safe imports"

* push fixed

* add helper!

* style

* add a check

* all by default

* add the

* update

* FINALLY!

* nit

* fix config dependencies

* man that is it

* fix fix

* update diffs

* fix the last issue

* re-default to all

* alll the fixes

* nice

* fix properties vs setter

* fixup

* updates

* update dependencies

* make sure to install what needs to be installed

* fixup

* quick fix for now

* fix!

* fixup

* update

* update

* updates

* whitespaces

* nit

* fix

* simplify everything, and make it file agnostic (should work for image processors)

* style

* finish fixing all import issues

* fixup

* empty modeling should not be written!

* Add logic to find who depends on what

* update

* cleanup

* update

* update gemma to support positions

* some small nits

* this is the correct docstring for gemma2

* fix merging of docstrings

* update

* fixup

* update

* take doc into account

* styling

* update

* fix hidden activation

* more fixes

* final fixes!

* fixup

* fixup instruct  blip video

* update

* fix bugs

* align gemma2 with the rest as well

* updats

* revert

* update

* more reversiom

* grind

* more

* arf

* update

* order will matter

* finish del stuff

* update

* rename to modular

* fixup

* nits

* update makefile

* fixup

* update order of the checks!

* fix

* fix docstring that has a call inside

* fiix conversion check

* style

* add some initial documentation

* update

* update doc

* some fixup

* updates

* yups

* Mostly todo gimme a minut

* update

* fixup

* revert some stuff

* Review docs for the modular transformers (huggingface#33472)

Docs

* good update

* fixup

* mmm current updates lead to this code

* okay, this fixes it

* cool

* fixes

* update

* nit

* updates

* nits

* fix doc

* update

* revert bad changes

* update

* updates

* proper update

* update

* update?

* up

* update

* cool

* nits

* nits

* bon bon

* fix

* ?

* minimise changes

* update

* update

* update

* updates?

* fixed gemma2

* kind of a hack

* nits

* update

* remove `diffs` in favor of `modular`

* fix make fix copies

---------

Co-authored-by: Lysandre Debut <[email protected]>

* Fix CIs post merging modular transformers (huggingface#33681)

update

* Fixed docstring for cohere model regarding unavailability of prune_he… (huggingface#33253)

* Fixed docstring for cohere model regarding unavailability of prune_head() methods

The docstring mentions that cohere model supports prune_heads() methods. I have fixed the docstring by explicitly mentioning that it doesn't support that functionality.

* Update src/transformers/models/cohere/modeling_cohere.py

---------

Co-authored-by: Lysandre Debut <[email protected]>

* Generation tests: update imagegpt input name, remove unused functions (huggingface#33663)

* Improve Error Messaging for Flash Attention 2 on CPU (huggingface#33655)

Update flash-attn error message on CPU

Rebased to latest branch

* Gemma2: fix config initialization (`cache_implementation`) (huggingface#33684)

* Fix ByteLevel alphabet missing when Sequence pretokenizer is used (huggingface#33556)

* Fix ByteLevel alphabet missing when Sequence pretokenizer is used

* Fixed formatting with `ruff`.

* Uniformize kwargs for image-text-to-text processors (huggingface#32544)

* uniformize FUYU processor kwargs

* Uniformize instructblip processor kwargs

* Fix processor kwargs and tests Fuyu, InstructBlip, Kosmos2

* Uniformize llava_next processor

* Fix save_load test for processor with chat_template only as extra init args

* Fix import Unpack

* Fix Fuyu Processor import

* Fix FuyuProcessor import

* Fix FuyuProcessor

* Add defaults for specific kwargs kosmos2

* Fix Udop to return BatchFeature instead of BatchEncoding and uniformize kwargs

* Add tests processor Udop

* remove Copied from in processing Udop as change of input orders caused by BatchEncoding -> BatchFeature

* Fix overwrite tests kwargs processors

* Add warnings and BC for changes in processor inputs order, change docs, add BC for text_pair as arg for Udop

* Fix processing test fuyu

* remove unnecessary pad_token check in instructblip ProcessorTest

* Fix BC tests and cleanup

* FIx imports fuyu

* Uniformize Pix2Struct

* Fix wrong name for FuyuProcessorKwargs

* Fix slow tests reversed inputs align fuyu llava-next, change udop warning

* Fix wrong logging import udop

* Add check images text input order

* Fix copies

* change text pair handling when positional arg

* rebase on main, fix imports in test_processing_common

* remove optional args and udop uniformization from this PR

* fix failing tests

* remove unnecessary test, fix processing utils and test processing common

* cleanup Unpack

* cleanup

* fix conflict grounding dino

* 🚨🚨 Setting default behavior of assisted decoding (huggingface#33657)

* tests: fix pytorch tensor placement errors (huggingface#33485)

This commit fixes the following errors:
* Fix "expected all tensors to be on the same device" error
* Fix "can't convert device type tensor to numpy"

According to pytorch documentation torch.Tensor.numpy(force=False)
performs conversion only if tensor is on CPU (plus few other restrictions)
which is not the case. For our case we need force=True since we just
need a data and don't care about tensors coherency.

Fixes: huggingface#33517
See: https://pytorch.org/docs/2.4/generated/torch.Tensor.numpy.html

Signed-off-by: Dmitry Rogozhkin <[email protected]>

* bump tokenizers, fix added tokens fast (huggingface#32535)

* update based on tokenizers release

* update

* nits

* update

* revert re addition

* don't break that yet

* fmt

* revert unwanted

* update tokenizers version

* update dep table

* update

* update in conversion script as well

* some fix

* revert

* fully revert

* fix training

* remove set trace

* fixup

* update

* update

* [Pixtral] Improve docs, rename model (huggingface#33491)

* Improve docs, rename model

* Fix style

* Update repo id

* fix code quality after merge

* HFQuantizer implementation for compressed-tensors library (huggingface#31704)

* Add compressed-tensors HFQuantizer implementation

* flag serializable as False

* run

* revive lines deleted by ruff

* fixes to load+save from sparseml, edit config to quantization_config, and load back

* address satrat comment

* compressed_tensors to compressed-tensors and revert back is_serializable

* rename quant_method from sparseml to compressed-tensors

* tests

* edit tests

* clean up tests

* make style

* cleanup

* cleanup

* add test skip for when compressed tensors is not installed

* remove pydantic import + style

* delay torch import in test

* initial docs

* update main init for compressed tensors config

* make fix-copies

* docstring

* remove fill_docstring

* Apply suggestions from code review

Co-authored-by: Marc Sun <[email protected]>

* review comments

* review comments

* comments - suppress warnings on state dict load, tests, fixes

* bug-fix - remove unnecessary call to apply quant lifecycle

* run_compressed compatability

* revert changes not needed for compression

* no longer need unexpected keys fn

* unexpected keys not needed either

* Apply suggestions from code review

Co-authored-by: Marc Sun <[email protected]>

* add to_diff_dict

* update docs and expand testing

* Update _toctree.yml with compressed-tensors

* Update src/transformers/utils/quantization_config.py

Co-authored-by: Arthur <[email protected]>

* update doc

* add note about saving a loaded model

---------

Co-authored-by: George Ohashi <[email protected]>
Co-authored-by: Marc Sun <[email protected]>
Co-authored-by: Sara Adkins <[email protected]>
Co-authored-by: Sara Adkins <[email protected]>
Co-authored-by: Arthur <[email protected]>
Co-authored-by: Dipika Sikka <[email protected]>
Co-authored-by: Dipika <[email protected]>

* update model card for opt

* add batch size to inference table

* [slow-run] opt

* [run-slow] opt

---------

Signed-off-by: Dmitry Rogozhkin <[email protected]>
Co-authored-by: Avishai Elmakies <[email protected]>
Co-authored-by: amyeroberts <[email protected]>
Co-authored-by: Pablo Montalvo <[email protected]>
Co-authored-by: chengchengpei <[email protected]>
Co-authored-by: Isotr0py <[email protected]>
Co-authored-by: Yoni Gozlan <[email protected]>
Co-authored-by: Joao Gante <[email protected]>
Co-authored-by: jiqing-feng <[email protected]>
Co-authored-by: Aarni Koskela <[email protected]>
Co-authored-by: Titus von Koeller <[email protected]>
Co-authored-by: Marc Sun <[email protected]>
Co-authored-by: Arthur <[email protected]>
Co-authored-by: Tibor Reiss <[email protected]>
Co-authored-by: Matt <[email protected]>
Co-authored-by: Lysandre Debut <[email protected]>
Co-authored-by: Muhammad Naufil <[email protected]>
Co-authored-by: sizhky <[email protected]>
Co-authored-by: Umar Butler <[email protected]>
Co-authored-by: Jonathan Mamou <[email protected]>
Co-authored-by: Dmitry Rogozhkin <[email protected]>
Co-authored-by: NielsRogge <[email protected]>
Co-authored-by: Arthur Zucker <[email protected]>
Co-authored-by: Benjamin Fineran <[email protected]>
Co-authored-by: George Ohashi <[email protected]>
Co-authored-by: Sara Adkins <[email protected]>
Co-authored-by: Sara Adkins <[email protected]>
Co-authored-by: Dipika Sikka <[email protected]>
Co-authored-by: Dipika <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants