From 71ee4f4f76a1835fc695508e7e8f2cace72e45a5 Mon Sep 17 00:00:00 2001
From: Yao Lu <fdyaolu@gmail.com>
Date: Sun, 7 Jul 2024 01:45:54 -0700
Subject: [PATCH] Rest API to download lora adapter on router
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Init

fix: cleanup

Add load testing

Refactored gRPC interface
Added validation logic

ValidationError was not correctly handled

Use axum

feat: Docker image

feat: Add AML deployment

Update aml deployment

feat: Improve error handling

feat: Add arguments to CLI

v0.1.0

fix(validation): Fix error messages

feat(router): Add max_waiting_tokens

Create LICENSE (#2)

feat(server): Use safetensors

Co-authored-by: OlivierDehaene <23298448+OlivierDehaene@users.noreply.github.com>

feat(client): Simplify sharded logic

feat(server): Support bitsandbytes

feat(server): Support all AutoModelForCausalLM on a best effort basis

feat: Use json formatter by default in docker image

fix(models): Revert buggy support for AutoModel

feat(server): Support generic AutoModelForCausalLM

feat(server): Support AutoModelForSeq2SeqLM

feat(launcher): Pass CUDA_VISIBLE_DEVICES to the shard

feat(server): Improved doc

fix(server): Fix Transformers fork version

feat(server): Clarify CausalLMBatch concatenate method

feat(rust): Update to 1.65

fix(router): Fix HTTP status codes

fix(readme): Typo

fix(router): Handle tokenizer errors

feat(server): Support Galactica (#4)

fix(batching): Avoid theoretical hang in batcher loop (#5)

- Avoid theoretical hang in batcher loop
- Avoid a couple of clones in the router generate method
- Keep attention mask tensors as integers
- Remove num_heads attribute

Co-authored-by: OlivierDehaene <Olivier.dehaene@gmail.com>

feat(server): Add model tests (#6)

fix(server): Only pad to multiple of 8 on GPUs

feat: Support stop sequences (#7)

feat: Return logprobs (#8)

feat(launcher): Add integration tests (#9)

fix(server): Fix stop sequences (#11)

fix(server): Check for device type correctly when determining initial padding (#16)

AFAIK there is no torch device type called "gpu".

fix(router): Include special tokens when tokenizing (#14)

There's currently a discrepancy in the tokenization between the router
and python server code. The latter includes special tokens but former
does not.

This results in a token count mismatch for seq2seq models such as mt0
where the tokenizer emits an EOS token at the end.

This in turn results in some unexpected/incorrect output, in particular
when batch concatenation is involved, because the python code uses the
input length passed from the router for each row.

As far as I can tell, it is better to include this token in the encoder
`input_ids`, so I guess it's best to just adjust on the router side.

feat(router): Add const parameters to validation logic  (#15)

I noticed some opportunity to collapse some of the logic, in case you
are interested.

fix(server): Use cleanup_tokenization_spaces=False for lossless decoding (#13)

Fixes #12 in the easiest way I could think of.

feat(launcher): Log server stdout (#19)

Co-authored-by: Nick Hill <nickhill@us.ibm.com>

fix(server): Minor refactorization using new_zeros (#24)

- Fix some type hints, in particular base tokenizer class
- Make use of `tensor.new_zero/empty` methods
- Simplify env var string parsing in launcher

fix(router): Obey max batch size (#23)

feat(server): Support SantaCoder (#26)

fix(server): Fix position ids (#28)

feat(docker): Make the image compatible with api-inference (#29)

fix(docker): fix api-inference deployment (#30)

fix(router): fix api-inference deployment (#31)

fix(dockerfile): fix docker build (#32)

feat(bloom): use torch.nn.Linear and torch.nn.GELU (#33)

feat(router): Remove second lock from batcher hot path (#27)

@njhill

feat: Support sampling seeding (#37)

Co-authored-by: Yannic Kilcher <yk@users.noreply.github.com>

feat: Add token streaming using ServerSideEvents support (#36)

Add token streaming using ServerSideEvents (SSE).

The signature of the SSE events is:

```rust
struct Details {
    finish_reason: String,
    generated_tokens: u32,
    seed: Option<u64>,
}

struct StreamResponse {
    token: Token,
    generated_text: Option<String>,
    details: Option<Details>,
}

struct ErrorResponse {
    error: String,
}
```

Revert "feat: Add token streaming using ServerSideEvents support" (#40)

Reverts huggingface/text-generation-inference#36

fix(server): fix seeding on gpu (#42)

fix(server): fix seeding with multiple shards (#44)

feat: Add token streaming using ServerSideEvents support (#41)

fix(server): fix quantization for sharded models (#45)

feat(server): Support GPT-Neox (#39)

feat(ci): Docker build and push (#46)

feat(server): allow gpt-neox models with odd vocab sizes to be sharded (#48)

feat(server): support repetition penalty (#47)

feat(server): allow the server to use a local weight cache (#49)

fix(server): allow greedy repetition penalty (#51)

feat(router): use background task to manage request queue (#52)

Co-authored-by: Nick Hill <nickhill@us.ibm.com>

breaking(router): modify /generate API to only return generated text (#50)

@njhill, @yk FYI

generated_text was concatenated to the user prompt for legacy reason. We
want to remove this behaviour as we don't think it is useful and even
detrimonial to usability.

We also remove the unused Vec.

feat(router): refactor API and add openAPI schemas (#53)

feat(docs): Clarify installation steps (#54)

Adds some bits for first-time users (like me 😄 )

feat(ci): push to AML registry (#56)

fix(server): better handling of inference mode (#57)

V0.2.1 (#58)

feat(server): support t5 (#59)

fix(docker): increase shm size (#60)

fixed SSE naming (#61)

https://en.wikipedia.org/wiki/Server-sent_events

feat: add distributed tracing (#62)

feat: add safetensors conversion (#63)

feat(server): improve download logging (#66)

feat(launcher): add disable_custom_kernels arg (#67)

feat(router): add max_total_tokens and empty_input validation (#68)

closes #65

fix(launcher): copy current env vars to subprocesses (#70)

closes #69

feat(router): add prometheus metrics scrape endpoint (#71)

v0.3.0 (#72)

feat(router): add cors allow origin options (#73)

feat(server): enable hf-transfer (#76)

fix(server): remove position_ids from galactica forward (#82)

closes #80

feat(server): pre-allocate max attention mask (#75)

v0.3.1 (#84)

feat(server): add special token bool (#85)

fix(docs): fix openapi schema (#86)

fix(server): fix token_is_special (#87)

feat(router): add legacy route for api-inference support (#88)

feat(router): ask hf.co for pipelinetag to decide on compat_return_full_text (#89)

feat(router): add api-inference headers (#91)

feat(server): add logits watermark (#90)

feat(server): update to hf_transfer==0.1.2 (#93)

feat(ci): improve CI speed (#94)

fix(launcher): add router parameters to launcher (#95)

feat(server): fix transformers commit (#96)

v0.3.2 (#97)

fix(server): fix generate_stream by forcing tokens to be decoded correctly (#100)

feat: allow local models (#101)

closes #99

feat: add supported models (#102)

feat(clients): Python client (#103)

fix(server): fix galactica batch (#106)

closes #105

feat(launcher): allow parsing num_shard from CUDA_VISIBLE_DEVICES (#107)

feat(launcher): default num_shard to CUDA_VISIBLE_DEVICES if possible (#108)

fix(python-client): stream not set on the sync client (#109)

fix(server): fix index out of range for watermarking (#110)

feat: support typical sampling (#114)

closes #112

fix(server): do not warp prefill logits (#116)

feat(router): support left truncation (#115)

closes #111

feat(router): add best_of parameter (#117)

feat(python-client): add new parameters (#118)

v0.4.0 (#119)

feat: add OpenAssistant/oasst-sft-1-pythia-12b to the list of supported models (#122)

…ed models

fix(server): revert gpt-neox optims (#123)

fix(server): add position ids to neox (#126)

fix(server): use server tokenizer as gt (#128)

fix(python-client): relax dependencies (#129)

feat(python-client): add cookies to Client constructors and requests (#132)

I have a use case where we need to pass cookies (for auth reasons) to an
internally hosted server.

Note: I couldn't get the client tests to pass - do you need to have an
HF token?

```python
FAILED tests/test_client.py::test_generate - text_generation.errors.BadRequestError: Authorization header is correct, but the token seems invalid
```

feat(ci): add ci paths (#134)

feat: Add note about NVIDIA drivers (#64)

Co-authored-by: OlivierDehaene <olivier@huggingface.co>

feat(python-client): release v0.4.0 (#135)

feat(python-client): add CI (#136)

feat(server): flash neoX (#133)

fix(server): fix flash-neox scores warping (#137)

feat(server): cleanup flash neox loading (#139)

v0.4.1 (#140)

fix(server): Avoid using try/except to determine kind of AutoModel (#142)

feat(server): Add mypy-protobuf (#141)

Generates .pyi files for protobuf stubs which provide strong typing
information. Very helpful for IDE auto-completion, etc.

feat(server): clear cache on error (#143)

feat(server): reduce mlp and attn in one op for flash neox (#145)

feat: aws sagemaker compatible image (#147)

The only difference is that now it pushes to
registry.internal.huggingface.tech/api-inference/community/text-generation-inference/sagemaker:...
instead of
registry.internal.huggingface.tech/api-inference/community/text-generation-inference:sagemaker-...

---------

Co-authored-by: Philipp Schmid <32632186+philschmid@users.noreply.github.com>

fix(ci): fix sagemaker action (#148)

feat(benchmark): tui based benchmarking tool (#149)

fix(server): fix flash neox rotary embeddings (#150)

v0.4.2 (#151)

v0.4.3 (#152)

feat(server): flash santacoder (#153)

docs(readme): provide link Logits Warper README (#154)

fix(server): fix escape characters in stop sequence (#155)

feat(docker): improve flash_attention caching (#160)

feat(launcher): allow disabling hf_transfer (#161)

fix(rust-client): use join_all instead of select_all to hopefully fix nccl issues (#162)

fix(router): use buckets for metrics histograms (#163)

feat(router): make router input validation optional (#164)

feat(server): add flash attention llama (#144)

feat(server): support OPT models (#55)

OPT models do not all have a `tokenizer.json` file on the hub at the
moment. Can't merge for now.

v0.5.0 (#168)

feat(server): optimize decode for sane tokenizers (#170)

feat(server): support sharded santacoder (#167)

fix(launcher): revert change on shard errors (#173)

fix(ci): fix CVE in github-slug-action (#174)

feat(ci): add image signing with cosign (#175)

feat(ci): add Trivy and scan docker image (#178)

feat(ci): use large runners (#179)

feat(ci): faster scanning (#180)

fix(ci): fix ci permissions (#181)

fea(dockerfile): better layer caching (#159)

fix(ci): fix cosign error (#183)

fix(docker): fix docker image (#184)

fix(docker): fix image (#185)

fix(docker): revert dockerfile changes (#186)

fix(docker): fix docker image dependencies (#187)

fix(router): fix truncation (#190)

closes #189

feat(python-client): get list of currently deployed tgi models using the inference API (#191)

feat(router): add info route (#196)

close #125

feat(server): support quantization for flash models (#200)

closes #197

feat(server): check cuda capability when importing flash models (#201)

close #198

fix(server): fix hf_transfer issue with private repos (#203)

fix(docker): remove unused dependencies (#205)

fix(router): add auth token to get model info (#207)

feat(router): add git sha to info route (#208)

feat(router): drop requests when client closes the channel (#202)

fix(ci): fix sha in docker image (#212)

feat(server): flash attention past key value optimizations (#213)

feat(router): add device and dtype info (#215)

fix(server): fix past key values logic (#216)

@njhill fyi

fix(server): cleanup new flash past_key_values logic (#217)

fix(server): fix flash causal (#218)

fix(server): fix flash causal (#219)

fix(server): fix flash batch filtering (#220)

misc: update to rust 1.69 (#221)

v0.6.0 (#222)

feat(server): reduce memory requirement (#214)

chore(server): update huggingface-hub (#227)

feat(router): use number of tokens in batch as input for dynamic batching (#226)

Co-authored-by: Nick Hill <nickhill@us.ibm.com>

feat(router): add endpoint info to /info route (#228)

chore(server): update safetensors version (#235)

fix(python-client): add auth headers to is supported requests (#234)

Starting some routing tests. (#233)

fix(benchmarking): fix benchmarking tool

chore(launcher): refactor logic (#242)

Hopefully it's cleaner

feat(router): add tests to validation (#237)

feat(router): new healthcheck that skips the queue (#244)

Co-authored-by: OlivierDehaene <23298448+OlivierDehaene@users.noreply.github.com>
Co-authored-by: OlivierDehaene <olivier@huggingface.co>

fix(server): fix reshaping of bloom past_key_values in concatenate() (#252)

Introduced in #214

Fixes #249

fix(server): Small tidy of code from recent changes (#251)

remaining_decode_tokens was calculated twice in Seq2SeqLMBatch.filter()

chore(server): update transformers (#250)

feat(server): add watermarking tests (#248)

feat(docker): add nvidia env vars (#255)

doc(launcher): add more docs to the `launcher` itself and link in the README (#257)

feat(benchmark): add support for private tokenizers (#262)

Adding docs on how dynamic batching works. (#258)

This PR starts the minimal possible amount of explanation I could think
of. It tries to explain how dynamic batching occurs, the interactions
with past key values and ignores the padding problem.

Maybe some drawings could help too but I kept it to text for now.

chore(github): add templates (#264)

fix(server): fix typo in tokenizers decode (#269)

closes #268

feat(server): support hf endpoint weight layout (#266)

fix(launcher): pass weights cache override to the download process (#274)

closes #273

fix(launcher): handle hub branches (#278)

fix(server): Removes the parallelism in file convertion (during download) (#275)

feat(launcher): Improve error message when download process fails. (#276)

fix(server): fix convert (#284)

chore: add `flash-attention` to docker ignore (#287)

included when building docker locally.
(Where the local dirs might have the flash-attention folder.)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

fea(server): decrease convert RAM requirements (#286)

fix(dockerfile): fix nvidia env vars (#297)

Fixes #291

feat(router): Adding response schema for compat_generate (#292)

feat(docker): add benchmarking tool to docker image (#298)

fix(docker): fix docker build (#299)

feat(server): optim flash causal lm decode_token (#285)

fix(docker): fix nvidia env vars (#305)

fix(docker): remove nvidia require cuda env (#310)

feat(server): shard token decode (#303)

feat(server): use float16 (#304)

fix(docker): remove CUDA_VERSION

feat(server): use cuda graph in logits warping (#302)

fix(server): fix multinomial implem in Sampling

feat(server): GPTQ quantization (step1) (#277)

Changes only the type from `bool` to `Option<Enum>` pretty much
everywhere.
- Use `Optional[str]` in Python (easier to manage than importing type
everywhere). Except for the cli to get proper validation
- Updated all models to handle gracefully new values. (Error out if
unknown value, or gptq since not implemented).

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

chore(docker): use nvidia base image (#318)

fix(docker): remove quantize default

fix(docker): use ubuntu20.04

Hotfixes for santacoder/bigcode. (#294)

Hotfixes:

- Uses `model_type`=`gpt_bigcode` for more general usage.
- Hotfixes linked lm_head vs wte_embedding (safetensors file do not
contain the key, correctly when the file is sharded, where as pytorch
copies the tensor)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

---------

Co-authored-by: Ubuntu <ubuntu@ip-172-31-41-161.ec2.internal>
Co-authored-by: OlivierDehaene <olivier@huggingface.co>

Lifting check_unitialized. (#325)

Lifting check_unitialized.

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

Removing dead variables. (#327)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

feat(ci): custom gpu runners (#328)

Single place for TP layers + Dropout Layer Norm + FastLinear (#329)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

feat: add snapshot testing (#282)

feat(integration-tests): improve comparison and health checks (#336)

fix(server): fix decode token (#334)

Fixes #333

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

fix: set MODEL_ID in sagemaker-entrypoint script (#343)

feat(server): Support BLOOMChat-176B (#348) (#351)

@njhill,
temporary workaround to be able to run our CI as secrets are not
available to runners run by external contributors. I will ask around to
see if there is a better way.

Co-authored-by: Nick Hill <nickhill@us.ibm.com>

fix(server): fix init for flash causal lm (#352)

Fixes #347

fix(server): t5 cannot run in f16 (#356)

Fix #349

fix(ci): fix security group (#359)

Switch security group used for ci
(open outbound rules)

Signed-off-by: Raphael <oOraph@users.noreply.github.com>
Co-authored-by: Raphael <oOraph@users.noreply.github.com>

feat: add nightly load testing (#358)

chore(sever): update requirements (#357)

Fixes #338

feat(server): support fp16 for t5 (#360)

Fixes #349

feat(server): do not use device_map auto on single GPU (#362)

feat(server): support trust_remote_code (#363)

feat(router): log input/ouput at debug level (#364)

@njhill FYI

v0.7.0 (#353)

feat: decrease IPC proto size (#367)

Closes #307 #308

feat(benchmarker): add summary tables (#368)

feat(server): support vectorized warpers in flash causal lm (#317)

Co-authored-by: Joel Lamy-Poirier <joel.lamy-poirier@servicenow.com>

Fix issue when load AutoModelForSeq2SeqLM model (#370)

fix(launcher): parse num cuda devices from CUDA_VISIBLE_DEVICES and NVIDIA_VISIBLE_DEVICES

fix(launcher): parse num cuda devices from CUDA_VISIBLE_DEVICES and NVIDIA_VISIBLE_DEVICES

fix(server): fix quantization

feat(server): support RefinedWeb models (#379)

v0.8.0

increase health checks

feat(server): add retry on download (#384)

fix(server): fix bnb quantization for CausalLM models (#385)

v0.8.1

fix(server): fix has_position_ids (#395)

Fix #389

feat(server): remove trust_remote_code requirement for falcon models (#396)

feat(server): load santacoder/starcoder models with safetensors (#393)

Fix #366

v0.8.2

feat(sagemaker): add trust remote code to entrypoint (#394)

feat(launcher): parse oom signal (#404)

feat(server): only compute prefill logprobs when asked (#406)

Close #288

feat(server): batch tokenization for flash causal lm (#411)

chore: update openapi schema

feat(server): Rework model loading (#344)

Reworked the loading logic. Idea is to use cleaner loading code:

- Remove need for `no_init_weights`
- Remove all weird `bnb_linear` and `load_weights` and
`post_load_weights`.

New code layout:

- New class `Weights` in charge of handling loading the weights from
multiple files into appropiate tensors (potentially sharded)
- TP layers now are "shells", they contain the code to know what kind of
sharding we need + eventual `all_reduce`. They do not inherit from
linear, but they contain some kind of Linear instead
- the contained linear can be either FastLinear, BnbLinear or GPTq
Linear next.
- All modeling code is explictly made for sharding, process group is
just no-ops for non sharded code (removes a lot of test cases)

![Screenshot from 2023-05-19
23-19-59](https://github.com/huggingface/text-generation-inference/assets/204321/9a802654-74a3-488c-87a8-073743a6143f)

---------

Co-authored-by: Ubuntu <ubuntu@ip-172-31-41-161.taildb5d.ts.net>
Co-authored-by: Ubuntu <ubuntu@ip-172-31-41-161.ec2.internal>
Co-authored-by: OlivierDehaene <olivier@huggingface.co>
Co-authored-by: OlivierDehaene <23298448+OlivierDehaene@users.noreply.github.com>

feat(server): optimize dist ops (#434)

docs(launcher): fix CUDA_VISIBLE_DEVICES helper comment (#441)

It solves a typo in the comment sections referencing the environment
variable `CUDA_VISIBLE_DEVICES`. No misspelling references to this
variable have been found in code logic leading to undefined behaviour or
bugs. This PR is not expected to perform any code logic modification.

fix(makefile): Fix typo and use POSIX comparison in the makefile (#443)

This PR fixes:
- The usage of non posix comparison which may fail depending on the
shell used (`=` will always work, `==` only with bash)
- Typo in the env variable name displayed in the error message
`BUILD_EXTENSION` instead of `BUILD_EXTENSIONS`

<!-- Remove if not applicable -->

Fixes #422

feat(server): pre-allocate past key values for flash causal LM (#412)

feat(router): add ngrok integration (#453)

feat(server): improve flash attention import errors (#465)

@lewtun, is this enough?

Closes #458
Closes #456

fix(server): fix warpers on CPU (#472)

Closes #471

fix(server): Fixing T5 in case the names are mixed up. (#475)

feat(server): Update convert logic. (#483)

Should be more robust to shared tensors (ok when using
      `from_pretrained). But forcing us to add new checks in our loading
      code (since the chosen key to keep might be different from
      `transformers`).

---------

Co-authored-by: Ubuntu <ubuntu@ip-172-31-41-161.ec2.internal>

feat(server): Adding new ignore_rule for conversion. (#485)

fix(router): add timeout on flume sends (#488)

feat(server): Add inference support for GPTQ (llama + falcon tested) + Quantization script (#438)

Let's start discussing implementation.

- Need to expose the quantization scripts (either included here or add
doc on how to use https://github.com/qwopqwop200/GPTQ-for-LLaMa)
- Make sure GPTQ works for multiple models (priority to Falcon).

Currently it means that every place we use `get_{tensor|sharded}` to
check for quantization.

My idea is to reintegrate as much as possible into `utils/layer.py` by
expanding `load_multi` to be a bit more generic.
This might require some thinking, but ultimately the
`qweight,qzeros,scales,g_idx` should be in a single place, and
independant of bias presence.

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

---------

Co-authored-by: Ubuntu <ubuntu@ip-172-31-41-161.ec2.internal>
Co-authored-by: OlivierDehaene <olivier@huggingface.co>

fix(server): Do not init process group if already initialized (#388)

feat(router): add header option to disable buffering for the generate_stream response (#498)

generate_stream endpoint response stream.

Problem: If a model is run behind a proxy server such as nginx that has
buffering enabled then the response stream from generate_stream gets
aggregated into a single response which basically disables streaming.
Instead of getting a chunked response where each token is presented over
time the response presents everything all at once.

Solution: This change adds the `X-Accel-Buffering` http header which
disables buffering for the generate_stream response, allowing the
response to stream properly.

feat(server): add paged attention to flash models (#516)

Closes #478

feat(router): arg validation (#519)

feat: Add the option to force another dtype than `f16`. (#513)

fix(launcher): fix issue where launcher does not properly report shard failures (#522)

v0.9.0 (#525)

feat(server): Add Non flash MPT. (#514)

This adds a non flash version of MPT.
Flash is harder because we need to create a bias ready cuda kernel of
flash attention.

Fixes
https://github.com/huggingface/text-generation-inference/issues/361
Fixes
https://github.com/huggingface/text-generation-inference/issues/491
Fixes
https://github.com/huggingface/text-generation-inference/issues/290

fix: Update server/Makefile to include Makefile-vllm (#520)

For consistency and ease of use (you can just run `make` to install vllm
without any extra steps).

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

docs(benchmarker): Adding some help for the options in `text-generation-benchmark`. (#462)

fix(server): Handle loading from local files for MPT (#534)

This PR allows the MPT model to be loaded from local files. Without this
change, an exception will be thrown by `hf_hub_download` function if
`model_id` is a local path.

fix(server): avoid errors for very small top_p values (#544)

See https://github.com/huggingface/transformers/pull/24111

I didn't add validation to the `__init__` method since it's not done for
other values/warpers.

feat(server): use latest flash attention commit (#543)

@njhill FYI

feat(router): add argument for hostname in router (#545) (#550)

In title. Adds argument `--hostname` in router to support something like
`--hostname ::`. Tested with

```commandline
cargo run -- --port 8080 --hostname ::
curl -I -X GET 'http://[::1]:8080/health'  # failed before this commit
```

Trigger CI

---------

Co-authored-by: Phil Chen <philchen2000@gmail.com>

fix(server): decrease memory fragmentation (#557)

v0.9.1 (#558)

fix(server): harden the weights choice to save on disk. (#561)

- Look at `transformers` base class to check for
  `_key_to_ignore_on_load_missing` or `_tied_weights` which are the
  standard attributes to select the keys to NOT save on disk (since they
  are ignored)

- Modified safetensors code (to be reflected in safetensors even if it's
  an internal function).

- Will not work for trust_remote_code=True repos (like santacoder).

Should help with :
https://github.com/huggingface/text-generation-inference/issues/555
and : https://github.com/huggingface/text-generation-inference/pull/501
and https://github.com/huggingface/text-generation-inference/issues/556
and
https://github.com/huggingface/text-generation-inference/issues/482#issuecomment-1623713593

feat: better errors for warmup and TP (#575)

Close #571

fix(server): Fixing RW code (it's remote code so the Arch checking doesn't work to see which weights to keep). (#579)

Fixes #555

feat(server): Support for env value for GPTQ_BITS and GPTQ_GROUPSIZE. (#580)

Some models are already converted, and do not have those values in the
file, this enables users to use them with less friction.

Went for pure env based because adding flags would end up (imo) very
tedious to maintain. There's a lot of sanitation to do: those flags
would be errors if not used in conjuction with `--quantize gptq`.
Then the flags need to exist in the launcher and the server passing them
all throughout all function calls.

This PR is intended as an easy escape hatch, not the defacto method to
use gptq in TGI.

Fixes #500

chore: migrate ci region for more availability. (#581)

fix(server): T5 weights names. (#582)

Fixes #541

fix(server): Adding logger import to t5_modeling.py (#585)

Logger is referenced during the apex importing but is not imported,
causing a NameError

fix(server): Bug fixes for GPTQ_BITS environment variable passthrough (#590)

This fixes a typo and extends the GPTP_BITS environment variables
through to the second method which requires the same logic. Please let
me know if there's anything I've misunderstood in this change.

Thanks @Narsil for the original fix.

feat(server): Implements sharding for non divisible `vocab_size`. (#583)

- The code is relatively easy (just disable the checks on Embedding and
Head)

This cannot be done in the same easy fashion for hidden_dim/head_dim.
It's relatively easy on some models (classic MHA) but it would make the
other
models (MQA) much more complex, and GPTQ quantization another quite
hairy piece
of code.

feat(server): empty cache on errors

GPTQ Env vars: catch correct type of error (#596)

When passing in environment variables like gptq_bits, we still get
errors thrown from TGI because the try/catch block is catching the wrong
type of error. This PR aims to fix that.

@Narsil - let me know if this is how you want this formatted. My Python
is a little shaky, so I hope this syntax is correct.

feat(launcher): add arg validation and drop subprocess (#595)

feat(router): explicit warning if revision is not set (#608)

docs: README: Add logo + baseline (#611)

![image](https://github.com/huggingface/text-generation-inference/assets/3841370/58177321-479f-4ad1-b3bc-cec027423984)

fix(server): blacklist local files (#609)

Close #589 #602

v0.9.2 (#616)

fix(server): empty_cache when stopped

fix(launcher): Rename `b-float16` to `bfloat16` in the launcher arg (#621)

fea(launcher): debug logs (#623)

feat(server): Reworking the quantization script so it's still universal (not llama specific) (#587)

but should work on more configurations (no need for 2 GPUs, less RAM
usage).

Reworking the quantization script so it's still universal (not llama
specific)

but should work on more configurations (no need for 2 GPUs, less RAM
usage).

Still need to investigate the potential differences in quantization
results.

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

feat(server): flash attention v2 (#624)

feat(server): add support for llamav2 (#633)

v0.9.3 (#634)

fix(server): fix llamav2 config (#635)

feat(server): auto max_batch_total_tokens for flash att models (#630)

feat(router): ngrok edge (#642)

docs: Update README.md (#639)

docs: Update README.md (#643)

Add trust_remote_code to quantize script (#647)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes a bug appeared with MR #587 fixing issue #552.
See the discussion in #552.

With MR #587 the trust_remote_code variable is not passed to
AutoModelForCausalLM, but is found in the function signature. This
prevents models like falcon to be quantized, because trust_remote_code
is required. This MR fixes the issue.

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [X] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [X] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.
@Narsil
<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

 -->

fix(server): llama v2 GPTQ (#648)

As per title & reported
https://github.com/huggingface/text-generation-inference/issues/601#issuecomment-1641435956
https://huggingface.co/TheBloke/Llama-2-70B-chat-GPTQ/discussions/5

Test it:

```
GPTQ_BITS=4 GPTQ_GROUPSIZE=1 text-generation-launcher --model-id TheBloke/Llama-2-70B-chat-GPTQ --port 8080 --num-shard 4 --quantize gptq
```
&
```
curl 127.0.0.1:8080/generate \
    -X POST \
    -d '{"inputs":"hey llama","parameters":{"max_new_tokens":256}}' \
    -H 'Content-Type: application/json'
```

fix(server): Fixing non parameters in quantize script `bigcode/starcoder` was an example. (#661)

fix(server): use mem_get_info to get kv cache size (#664)

Close
https://github.com/huggingface/text-generation-inference/issues/649
Close
https://github.com/huggingface/text-generation-inference/issues/651
Close
https://github.com/huggingface/text-generation-inference/issues/653
Close #636

feat(server): Add exllama GPTQ CUDA kernel support #553 (#666)

Just trying to get the integration tests to pass.

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

---------

Co-authored-by: Felix Marty <9808326+fxmarty@users.noreply.github.com>

Directly load GPTBigCode to specified device (#618)

This PR directly load GPTBigCode to specified device, avoiding moving
model between devices.

This PR directly load GPTBigCode to specified device, avoiding moving
model between devices.

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [x] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

@OlivierDehaene OR @Narsil

feat(server): add local prom and health routes if running w/ ngrok

feat: add cuda memory fraction (#659)

Close #673

fix(server): fix exllama buffers (#689)

Close #683

feat(server): Using `quantize_config.json` instead of GPTQ_BITS env variables. (#671)

- Current PR is not great because we're side stepping the
  `Weights.__init__` but Weights shouldn't requires anything related
  to the config or the model_id as it aims to be a simple Wrapper
  over multi file loading.
- Ideal solution would be to use something like Rust enum
  ```
  enum Quantize{
    Bitandbytes(Bitsandbytes),
    GPTQ(bits: usize, groupsize: usize)
  ```
  And passing that around during load. Unfortunately we don't
  have access to this, so for now, side-stepping seems easier.

- Re-enabling groupsize<0 with exllama (confirmed it works.)

Helps #601

In next steps we should make sure our quantization script uses that
format and make it standard.

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

docs(README): update readme

fix(server): fix quantization python requirements (#708)

fix(server): fix missing datasets in quantize

feat(server): support new falcon config (#712)

v0.9.4 (#713)

Add section about TGI on other AI hardware accelerators in README (#715)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

As per title.

- [x] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

docs: Add hardware section to TOC in README (#721)

feat(server): update vllm version (#723)

chore: update license to HFOIL (#725)

v1.0.0 (#727)

Local gptq support. (#738)

Redoes #719

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

Fix typing in `Model.generate_token` (#733)

This PR fixes a minor type annotation issue in the signature of
`Model.generate_token`.

All existing overrides of `Model.generate_token` return
`Tuple[List[Generation], Optional[B]]`:

https://github.com/huggingface/text-generation-inference/blob/3ef5ffbc6400370ff2e1546550a6bad3ac61b079/server/text_generation_server/models/causal_lm.py#L535-L537

https://github.com/huggingface/text-generation-inference/blob/3ef5ffbc6400370ff2e1546550a6bad3ac61b079/server/text_generation_server/models/flash_causal_lm.py#L802-L804

https://github.com/huggingface/text-generation-inference/blob/3ef5ffbc6400370ff2e1546550a6bad3ac61b079/server/text_generation_server/models/seq2seq_lm.py#L589-L591

I suspect that back in 017a2a8c when `GeneratedText` and `Generation`
were separated, the function signature was not updated.

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [x] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

CC @OlivierDehaene

Adding Rope scaling. (#741)

- Adds Rope NTK scaling.

Done because
https://github.com/huggingface/text-generation-inference/pull/529 was
closed
Took some code from
https://github.com/huggingface/transformers/pull/24653

- `--rope-scaling` and `--rope-factor` are added separately. I
considered having a single one and parsing something line ("linear:4.0"
, or "dynamic") but decided against
it because it would push more parsing+validation a bit everywhere (both
in the launcher and the server).

Fixes #512

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

chore: fix typo in mpt_modeling.py (#737)

Fixed typo.
<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

implemetation -> implementation

- [x] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [x] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

fix(server): Failing quantize config after local read. (#743)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

Typo fix. (#746)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

add FastLinear import (#750)

Fixes #749

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

Co-authored-by: p_spozzhang <p_spozzhang@tencent.com>

feat(server): Add native support for PEFT Lora models (#762)

- Will detect `peft` model by finding `adapter_config.json`.
- This triggers a totally dedicated `download-weights` path
- This path, loads the adapter config, finds the base model_id
- It loads the base_model
- Then peft_model
- Then `merge_and_unload()`
- Then `save_pretrained(.., safe_serialization=True)
- Add back the config + tokenizer.merge_and_unload()`
- Then `save_pretrained(.., safe_serialization=True)
- Add back the config + tokenizer.
- The chosen location is a **local folder with the name of the user
  chosen model id**

PROs:

- Easier than to expect user to merge manually
- Barely any change outside of `download-weights` command.
- This means everything will work in a single load.
- Should enable out of the box SM + HFE

CONs:

- Creates a local merged model in unusual location, potentially
  not saved across docker reloads, or ovewriting some files if the PEFT
  itself was local and containing other files in addition to the lora

Alternatives considered:
- Add `local_files_only=True` every where (discard because of massive
  code change for not a good enough reason)
- Return something to `launcher` about the new model-id (a cleaner
  location for this new model), but it would
  introduce new communication somewhere where we didn't need it before.
- Using the HF cache folder and *stopping* the flow after
  `download-weights` and asking user to restart with the actual local
  model location

Fix #482

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

This should prevent the PyTorch overriding. (#767)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

fix build tokenizer in quantize and remove duplicate import (#768)

Fixes #732
And remove duplicate AutoTokenizer import

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

Merge BNB 4bit. (#770)

See #626
<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

---------

Co-authored-by: krzim <zimmerk4@live.com>

Fix dynamic rope. (#783)

Typo

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

Fixing non 4bits quantization. (#785)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)
Fixes #784

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

Update __init__.py (#794)

Fixes #787
<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

Llama change. (#793)

Reflecting
https://github.com/huggingface/transformers/pull/24998

Current status wants to make sure integration tests *are* broken with
this.

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

Setup for doc-builder and docs for TGI (#740)

I added ToC for docs v1 & started setting up for doc-builder. cc @Narsil
@osanseviero

---------

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
Co-authored-by: osanseviero <osanseviero@gmail.com>
Co-authored-by: Mishig <mishig.davaadorj@coloradocollege.edu>

Use destructuring in router arguments to avoid '.0' (#798)

This is purely code style - not anything important.
Instead of writing `req.0` all over we can use
[descructuring](https://doc.rust-lang.org/rust-by-example/flow_control/match/destructuring/destructure_structures.html)
to access the contained value that we actually want.

(Destructuring in function parameters
[here](https://doc.rust-lang.org/reference/items/functions.html#function-parameters))

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [x] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

@OlivierDehaene

Fix gated docs (#805)

Minor docs style fixes (#806)

Added CLI docs (#799)

Added docs for CLI

[docs] Build docs only when doc files change (#812)

Build docs only when change happens in `docs/source`

See for example
https://github.com/huggingface/api-inference/blob/main/.github/workflows/build_documentation.yml#L3-L8

Added ChatUI Screenshot to Docs (#823)

cc @osanseviero

Upgrade transformers (fix protobuf==3.20 issue) (#795)

Fixes #531

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

Added streaming for InferenceClient (#821)

Co-authored-by: Lucain <lucainp@gmail.com>
Co-authored-by: Omar Sanseviero <osanseviero@gmail.com>

Version 1.0.1 (#836)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

Have snippets in Python/JavaScript in quicktour (#809)

![Screenshot from 2023-08-10
14-20-25](https://github.com/huggingface/text-generation-inference/assets/7246357/e16d0d41-be63-4d06-8093-30540df91419)

---------

Co-authored-by: Merve Noyan <merveenoyan@gmail.com>

Added two more features in  readme.md file (#831)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

Fix rope dynamic + factor (#822)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes #816

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

fix: LlamaTokenizerFast to AutoTokenizer at flash_llama.py (#619)

A few tokenizer_config in huggingface use LlamaTokenizer, so I think I
would have selected `LlamaTokenizer` before.

For a few cases where you're using a llama structure but not a llama
tokenizer, why not make it to call the AutoTokenizer in exception
handling.

In the case of `decapoda-research/llama-7b-hf`, LLamaTokenizer is still
being used in config.json, so it should be called through`
LlamaTokenizer`.
Also, if an exception is thrown by LlamaTokenizer, it will cause
`LlamaTokenzierFast` to be called from AutoTokenizer.

Fixes # 560

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [x] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

@Narsil

README edit -- running the service with no GPU or CUDA support (#773)

One-line addition to the README to show how to run the service on a
machine without GPUs or CUDA support (e.g., for local prototyping)

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

Fix `tokenizers==0.13.4` . (#838)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

Update README.md (#848)

@Narsil

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

Fixing watermark. (#851)

Fixes #843
<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

Misc improvements for InferenceClient docs (#852)

List of changes

- No need to specify `model` in `text_generation` if it's already
specified in `InferenceClient`
- I separated the explanation of `stream=True` and `details=True`
- I found the details explanation a bit repetitive (it says two times
what it returns), so removed a sentence
- Add mention of async client

"Fix" for rw-1b. (#860)

- New "falcon" layout on this repo
- No alibi
- `transformers` already modifying cache layout in our stead (same
  modifications).
- Output is garbage. Not sure why.

Does not fix #826 but it's a step.

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

Upgrading versions of python client. (#862)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

Adding Idefics multi modal model. (#842)

Co-Authored-By: Victor Sanh <victorsanh@gmail.com>

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

---------

Co-authored-by: Victor Sanh <victorsanh@gmail.com>

Add streaming guide (#858)

Co-authored-by: Lucain <lucainp@gmail.com>
Co-authored-by: Merve Noyan <merveenoyan@gmail.com>

Adding small benchmark script. (#881)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

Upgrade version number in docs. (#910)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

Added gradio example to docs (#867)

cc @osanseviero

---------

Co-authored-by: Omar Sanseviero <osanseviero@gmail.com>

Supporting code llama. (#918)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

Fixing the lora adaptation on docker. (#935)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

Rebased #617 (#868)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

---------

Co-authored-by: Vincent Brouwers <vincent.brouwers@ing.com>

New release. (#941)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

Fix f180 (#951)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

Update version in docs (#957)

Fix Falcon weight mapping for H2O.ai checkpoints (#953)

During the safetensor conversion, duplicate weights are removed.
However, which of the duplicates gets removed, differs per checkpoint.
In some, like `h2oai/h2ogpt-oig-oasst1-falcon-40b`, the weight
`transformer.word_embeddings.weightSafetensor` gets removed. In others,
`lm_head.weight` gets removed. Long story long, we need to support both.

Originally, f018143 mapped `lm_head` to `word_embeddings`. Then ac736fd
switched this around. This commit merges them and allows for both.

- [x] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [x] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

@Narsil, you wrote both commits I referenced in this PR. I think you'll
understand this change :)

Fixing top_k tokens when k ends up < 0 (#966)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

small fix on idefics (#954)

transposing the fixes from
https://github.com/huggingface/transformers/pull/25787

Backport https://github.com/vllm-project/vllm/pull/936 (#977)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

chore(client): Support Pydantic 2 (#900)

This should allow users to use either Pydantic 2 or Pydantic 1.

I couldn't run all tests locally because I reran them too often and got
rate limited, but I believe this is sufficient.

docs: typo in streaming.js (#971)

Looks like an error

Disabling exllama on old compute. (#986)

Disabling exllama on old compute.

Exllama + T4 don't play nice together, this will disable it right away
to avoid issues at runtime.

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

chore: sync text-generation version from 0.3.0 to 0.6.0 with pyproject.toml (#950)

sync the version for text-generation.

docs: Flash Attention Conceptual Guide (#892)

PR for conceptual guide on flash attention. I will add more info unless
I'm told otherwise.

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
Co-authored-by: Omar Sanseviero <osanseviero@gmail.com>

docs: Remove redundant content from stream guide (#884)

Co-authored-by: OlivierDehaene <olivier@huggingface.co>

Fix exllama wronfully loading (#990)

The
[changes](https://github.com/huggingface/text-generation-inference/pull/986/files#diff-b72e45030214e50c8ff6e3be837057b3f3368b9779fd942ca680f949fe069eafR176)
disabling exllama on old compute had unintended consequences of not
setting `use_exllama` to `False` if `HAS_EXLLAMA` equals `False` **and**
`CAN_EXLLAMA` equals `False`. This fixes this.

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [X] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

@OlivierDehaene @Narsil

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

add transformers gptq support (#963)

Proposal to fix
https://github.com/huggingface/text-generation-inference/issues/962

Safetensors conceptual guide (#905)

IDK what else to add in this guide, I looked for relevant code in TGI
codebase and saw that it's used in quantization as well (maybe I could
add that?)

Fix __call__ vs forward. (#993)

Fix __call__ vs forward.

To reproduce error just launch:
TheBloke/WizardLM-Uncensored-Falcon-7B-GPTQ with gptq (it fails because
falcon code uses `__call__` instead for `forward` calls)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

Paged Attention Conceptual Guide (#901)

fit for baichuan models (#981)

As more and more people begin to use Baichuan's open-source models, the
influence of Baichuan models is growing, especially in China. Many
community members are interested in adding support for Baichuan models
to TGI. Meanwhile, Baichuan is a very open company, and in the future,
it plans to open-source more and more models, taking all this into
consideration, we would like to add support for the Baichuan model to
TGI. To do this, we need to make some changes, which we hope can be
merged into the main branch of TGI. In the future, we would be happy to
help maintain support for Baichuan models in TGI. We sincerely hope that
our pull request can be accepted. Thank you.

By the way, the changes of this time mainly for supporting Baichuan-7B.

---------

Co-authored-by: xiaoyuze <xiaoyuze@baichuan.com>
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

Tensor Parallelism conceptual guide (#886)

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
Co-authored-by: Omar Sanseviero <osanseviero@gmail.com>
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>

Quantization docs (#911)

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>

Unsupported model serving docs (#906)

Co-authored-by: Omar Sanseviero <osanseviero@gmail.com>
Co-authored-by: Mishig <mishig.davaadorj@coloradocollege.edu>
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
Co-authored-by: OlivierDehaene <olivier@huggingface.co>

enable bfloat16 for cpu (#1034)

if there's no cuda. disable custom kernels

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

Fix missing arguments in Galactica's from_pb (#1022)

Fixes #1004

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

Fixing t5 loading. (#1042)

Fixes #1038

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

Add AWQ quantization inference support (#1019) (#1054)

Fixes
https://github.com/huggingface/text-generation-inference/issues/781

This PR (partially) adds support for AWQ quantization for inference.
More information on AWQ [here](https://arxiv.org/abs/2306.00978). In
general, AWQ is faster and more accurate than GPTQ, which is currently
supported by TGI.

This PR installs 4-bit GEMM custom CUDA kernels released by AWQ authors
(in `requirements.txt`, just one line change).

Quick way to test this PR would be bring up TGI as follows:

```
text-generation-server download-weights abhinavkulkarni/codellama-CodeLlama-7b-Python-hf-w4-g128-awq

text-generation-launcher \
--huggingface-hub-cache ~/.cache/huggingface/hub/ \
--model-id abhinavkulkarni/codellama-CodeLlama-7b-Python-hf-w4-g128-awq \
--trust-remote-code --port 8080 \
--max-input-length 2048 --max-total-tokens 4096 --max-batch-prefill-tokens 4096 \
--quantize awq
```

Please note:
* This PR was tested with FlashAttention v2 and vLLM.
* This PR adds support for AWQ inference, not quantizing the models.
That needs to be done outside of TGI, instructions

[here](https://github.com/mit-han-lab/llm-awq/tree/f084f40bd996f3cf3a0633c1ad7d9d476c318aaa).
* This PR only adds support for `FlashLlama` models for now.
* Multi-GPU setup has not been tested.
* No integration tests have been added so far, will add later if
maintainers are interested in this change.
* This PR can be tested on any of the models released

[here](https://huggingface.co/abhinavkulkarni?sort_models=downloads#models).

Please refer to the linked issue for benchmarks for

[abhinavkulkarni/meta-llama-Llama-2-7b-chat-hf-w4-g128-awq](https://huggingface.co/abhinavkulkarni/meta-llama-Llama-2-7b-chat-hf-w4-g128-awq)
vs

[TheBloke/Llama-2-7b-Chat-GPTQ](https://huggingface.co/TheBloke/Llama-2-7b-Chat-GPTQ).

Please note, AWQ has released faster (and in case of Llama, fused)
kernels for 4-bit GEMM, currently at the top of the `main` branch at
https://github.com/mit-han-lab/llm-awq, but this PR uses an older commit
that has been tested to work. We can switch to latest commit later on.

@OlivierDehaene OR @Narsil

---------

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

---------

Co-authored-by: Abhinav M Kulkarni <abhinavkulkarni@gmail.com>
Co-authored-by: Abhinav Kulkarni <abhinav@concentric.ai>

Fix GQA llama + AWQ (#1061)

Fixes #1056

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

support local model config file (#1058)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Support local config file to avoid unexpected `discard_names`, which
causes #1057.

In the case of launching local mode without `model.safetensors` file,
the original code will result `discard_names = []` when
`hf_hub_download` throws an connection error.
```python
    try:
        import transformers
        import json

        config_filename = hf_hub_download(model_id, revision=revision, filename="config.json")
        with open(config_filename, "r") as f:
            config = json.load(f)
        architecture = config["architectures"][0]

        class_ = getattr(transformers, architecture)

        # Name for this varible depends on transformers version.
        discard_names = getattr(class_, "_tied_weights_keys", [])
        discard_names.extend(getattr(class_, "_keys_to_ignore_on_load_missing", []))

    except Exception as e:
        discard_names = []
```

The expected `_tied_weights_keys` of OPT-1.3b is `["lm_head.weight"]`,
and its tied weight `"model.decoder.embed_tokens.weight"` will be kept
in the safetensors conversion. But the above empty `discard_names` will
lead to `"lm_head.weight"` being kept and
`"model.decoder.embed_tokens.weight"` being discard in the subsequent
method `_remove_duplicate_names`, which causes error #1057.

So add a local mode branch to get the expected `discard_names` like
follows. This modification also applies to other models

```python
        if is_local_model:
            config_filename = os.path.join(model_id, "config.json")
        else:
            config_filename = hf_hub_download(model_id, revision=revision, filename="config.json")
```

In addition, when `_tied_weights_keys` or
`_keys_to_ignore_on_load_missing` is `None`, the above code will also
throw an error unexpectedly. This is fixed in PR #1052

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [x] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [x] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
N/A
- [ ] Did you write any new necessary tests?  N/A

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

@Narsil

fix discard_names bug in safetensors convertion (#1052)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Model Class attributes `_tied_weights_keys`, `
_keys_to_ignore_on_load_missing` can only be `None` or a List.
`getattr(class_, "_keys_to_ignore_on_load_missing", [])` will return
`None` if `_keys_to_ignore_on_load_missing` is None, and
`discard_names.extend(None)` will trigger an exception, even though
`_tied_weights_keys` exists.

@OlivierDehaene  @Narsil

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

Install curl to be able to perform more advanced healthchecks (#1033)

Install curl within base image, negligible regarding the image volume
and will allow to easily perform a better health check. Not sure about
the failing github actions though. Should I fix something ?

Signed-off-by: Raphael <oOraph@users.noreply.github.com>
Co-authored-by: Raphael <oOraph@users.noreply.github.com>

Fix position ids logic instantiation of idefics vision part (#1064)

Problem and fix is described here:
https://huggingface.co/HuggingFaceM4/idefics-9b/discussions/9

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

Fix top_n_tokens returning non-log probs for some models (#1023)

I made an embarrassing mistake where I accidentally passed normal
softmax probabilities into `batch_top_tokens` for `CausalLM` and
`Seq2SeqLM`.

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

@Narsil

Preping 1.1.0 (#1066)

Upgrade all relevant versions and dependencies.

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

Added note on weight-cache-override (#994)

Added note on serving supported models from a different folder without
re-downloading them.

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

Support eetq weight only quantization (#1068)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

---------

Co-authored-by: zhaosida <zhaosida@corp.netease.com>

Remove the stripping of the prefix space (and any other mangling that tokenizers might do). (#1065)

Superseed #1024

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

---------

Co-authored-by: bangoz <ch_xie@pku.edu.cn>

feat: format code (#1070)

Complete FastLinear.load parameters in OPTDecoder initialization (#1060)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

`FastLinear.load` requires 4 parameters, but in the following only 3 are
given. This PR fix this.

```python
        if config.word_embed_proj_dim != config.hidden_size:
            self.project_out = FastLinear.load(
                config, prefix="model.decoder.project_out", bias=False
            )
        else:
            self.project_out = None

        if config.word_embed_proj_dim != config.hidden_size:
            self.project_in = FastLinear.load(
                config, prefix="model.decoder.project_in", bias=False
```

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

Automatic docs for TGI (#1045)

I had to open this PR since I initially worked from my fork, and it
requires a handful of work to trigger a new github action on my fork's
specific branch (couldn't find a way, at least, despite trying all of
them).

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

feat: add mistral model (#1071)

update readme

Fix launcher.md (#1075)

Adding a new line to escape between heading and codeblock. However, it
is a hotfix and I will work on a permanent solution on
https://github.com/huggingface/doc-builder

Update launcher.md to wrap code blocks (#1076)

Wrap code blocks in `launcher` doc page

using https://github.com/huggingface/doc-builder/pull/420

https://moon-ci-docs.huggingface.co/docs/text-generation-inference/pr_1076/en/basic_tutorials/launcher

<img width="800" alt="image"
src="https://github.com/huggingface/text-generation-inference/assets/11827707/cb240198-411f-4d22-9f6e-8f70f2c6dcab">

Fixing eetq dockerfile. (#1081)

Fixes #1079
<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

Fix window_size_left for flash attention v1 (#1089)

This fixes flash attention v1 which was always
NotImplementedError("window_size_left is only available with flash attn
v2").

Currently flash_llama_modeling.py doesn't override the default value of
window_size_left when calling attention(..) (line 282). This means that
window_size_left will always be the default of -1, but flash attention
v1 throws an exception if `window_size_left != 0`.

To fix this, we should be checking `window_size_left != -1` before
throwing the NotImplementedError.

Fixes #1084

- [x] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

@OlivierDehaene OR @Narsil

raise exception on invalid images (#999)

This PR is meant to handle cases in which the images provided are
invalid.

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

@Narsil

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

[Doc page] Fix launcher page highlighting (#1080)

<img width="800" alt="Screenshot 2023-09-28 at 22 38 15"
src="https://github.com/huggingface/text-generation-inference/assets/11827707/1f07c356-2c3c-4ff0-8ca5-54a032b05d48">

<img width="800" alt="image"
src="https://github.com/huggingface/text-generation-inference/assets/11827707/87fe750d-c26e-4801-95cc-86859a2df52d">

Handling bloom prefix. (#1090)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

Update idefics_image_processing.py (#1091)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

fixed command line arguments in docs (#1092)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Just removed `--` from the arguments.
With `--` bitsandbytes and bitsandbytes-nf4 are considered an option
which they are not

- [x] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

Adding titles to CLI doc. (#1094)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

Receive base64 encoded images for idefics. (#1096)

Fix #1095

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

Modify the default for `max_new_tokens`. (#1097)

Now clients which do not specify a max_length will be implying
`max_new_tokens = max_total_tokens - input_length`.
This is a serious change, but which seems more in line with what users
expect from standing server.

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

---------

Co-authored-by: OlivierDehaene <olivier@huggingface.co>

fix: type hint typo in tokens.py (#1102)

Fixing a list type hint definition (I believe this was a typo).

Allows backward compatibility with Python 3.8 (relevant for
JetPack-enabled systems).

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [x] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [x] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

Fixing GPTQ exllama kernel usage. (#1101)

Fixes #1098
<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

Adding yarn support. (#1099)

Fixes #1017

Not sure if there's a mistake here but

- NousResearch/Yarn-Llama-2-7b-128k seems to be working fine
- TheBloke/Yarn-Llama-2-13B-128K-GPTQ outputs garbage

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

Hotfixing idefics base64 parsing. (#1103)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

Prepare for v1.1.1 (#1100)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

Remove some content from the README in favour of the documentation (#958)

fix: force one of max_new_tokens or truncate with slow tokenizer

Fix link in preparing_model.md (#1140)

Fixes a link in doc

Fix calling cuda() on load_in_8bit (#1153)

This PR addresses an issue where calling `model = model.cuda()` would
throw a ValueError when `quantize` is set to "bitsandbytes".

```
> File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 147, in serve_inner
    model = get_model(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py", line 295, in get_model
    return CausalLM(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/causal_lm.py", line 515, in __init__
    model = model.cuda()
  File "/opt/conda/lib/python3.9/site-packages/transformers/modeling_utils.py", line 1998, in cuda
    raise ValueError(
ValueError: Calling `cuda()` is not supported for `4-bit` or `8-bit` quantized models. Please use the model as it is, since the model has already been set to the correct devices and casted to the correct `dtype`.
```

Co-authored-by: mmnga <mmnga1mmnga@gmail.com>

Fix: Replace view() with reshape() in neox_modeling.py to resolve RuntimeError (#1155)

fix: EETQLinear with bias in layers.py (#1176)

fix: remove useless token (#1179)

This token is not used by your action.
Secret is removed from the repository.

See #1049

---------

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Wang, Yi <yi.a.wang@intel.com>

Fix link to quantization page in preparing_model.md (#1187)

feat: paged attention v2 (#1183)

feat: remove flume (#1184)

fix: better warmup error

Adding the video -> moving the architecture picture lower (#1239)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

Narsil patch 1 (#1241)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

Update README.md (#1242)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

Fix link in quantization guide (#1246)

v1.1.1

hotfix 1.1.1

fix: do not leak inputs on error (#1228)

Close #1225

Update README.md (#1272)

Fix missing `trust_remote_code` flag for AutoTokenizer in utils.peft (#1270)

Peft loading function was missing the
`trust_remote_code=trust_remote_code` argument causing the custom
tokenizer code to be not found.

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [x] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

@Narsil

Load PEFT weights from local directory (#1260)

Enables PEFT weights to be loaded from a local directory, as opposed to
a hf hub repository. It is a continuation of the work in PR
https://github.com/huggingface/text-generation-inference/pull/762

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes #1259

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [x] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
Pull Request section? **Yes but I don't know how to run the tests for
this repo, and it doesn't look like this code is covered anyway**
- [x] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
to it if that's the case. **Yes, @Narsil asked for a PR in [this
comment](https://github.com/huggingface/text-generation-inference/pull/762#issuecomment-1728089505)**
- [x] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
**I didn't see any documentation added to the [original
PR](https://github.com/huggingface/text-generation-inference/pull/762),
and am not sure where this belongs. Let me know and I can add some**
- [x] Did you write any new necessary tests? **I didn't see any existing
test coverage for this python module**

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

@Narsil

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@Narsil

 -->

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

chore: update to torch 2.1.0 (#1182)

Close #1142

Fix IDEFICS dtype (#1214)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

This forces the use of `bfloat16` for IDEFICS. The issue is that with
`float16` the 80b model gives garbage output. Let me know if this
solution is not appropriate and I'll adjust accordingly. For the details
see below.

The current behaviour:
```sh
$ curl 127.0.0.1:8080/generate -X POST -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' -H 'Content-Type: application/json'
{"generated_text":""}
```

On closer inspection with:
```python
import requests

headers = { "Content-Type": "application/json"}

query = "What is Deep Learning?"
data = {
    "inputs": query,
    "parameters": {
        "max_new_tokens": 10,
        "return_full_text": True,
        "decoder_input_details": True,
        "do_sample": False,
    },
}

api_url = "http://127.0.0.1:8080"
response = requests.post(api_url + "/generate", headers=headers, json=data).json()

for i in ['prefill', 'tokens']:
    print(f'### {i}')
    print(repr(''.join([t['text'] for t in response['details'][i]])))
```

Prints:
```
'<s>WhatisDeepLearning?'
'<unk><unk><unk><unk><unk><unk><unk><unk><unk><unk>'
```

With the change in this PR it prints:
```
'<s>WhatisDeepLearning?'
'\n\nDeep Learning is a subset of machine'
```

Note, using the Transformers implementation (with
`IdeficsForVisionText2Text.from_pretrained`) produces the latter
(correct) output as well.
This only happens with the 80b model, the 9b model is not as sensitive
to the dtype (as also mentioned in the code).

The reason for "forcing" this in the IDEFICS init method, is because if
quantization is used, then the dtype cannot be set explicitly. And since
it's left as `None`, it's set to `float16` by default
[here](https://github.com/huggingface/text-generation-inference/blob/96a982ad8fc232479384476b1596a880697cc1d0/server/text_generation_server/models/__init__.py#L90).
I.e. there's no other way to manually change the dtype if someone is
using quantization:
```sh
$ docker run .... ghcr.io/huggingface/text-generation-inference:latest --model-id HuggingFaceM4/idefics-80b-instruct --dtype bfloat16 --quantize bitsandbytes-nf4
.....
2023-10-31T12:42:26.710401Z  INFO shard-manager: text_generation_launcher: Starting shard rank=0
2023-10-31T12:42:30.315734Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:

Traceback (most recent call last):

  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 80, in serve
    raise RuntimeError(

RuntimeError: Only 1 can be set between `dtype` and `quantize`, as they both decide how goes the final model.
 rank=0
Error: ShardCannotStart
2023-10-31T12:42:30.414010Z ERROR text_generation_launcher: Shard 0 failed to start
2023-10-31T12:42:30.414044Z  INFO text_generation_launcher: Shutting down shards
```

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [x] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

@Narsil what do you think?

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

Reduce race condition on file system for test

Exllama v2 (#1211)

See #1165

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

---------

Co-authored-by: Florian Zimmermeister <flozi00.fz@gmail.com>
Co-authored-by: Ubuntu <ubuntu@ip-172-31-24-153.ec2.internal>

Add RoCm support (#1243)

This PR adds support for AMD Instinct MI210 & MI250 GPUs, with paged
attention and FAv2 support.

Remaining items to discuss, on top of possible others:
* Should we have a
`ghcr.io/huggingface/text-generation-inference:1.1.0+rocm` hosted image,
or is it too early?
* Should we set up a CI on MI210/MI250? I don't have access to the
runners of TGI though.
* Are we comfortable with those changes being directly in TGI, or do we
need a fork?

---------

Co-authored-by: Felix Marty <felix@hf.co>
Co-authored-by: OlivierDehaene <olivier@huggingface.co>
Co-authored-by: Your Name <you@example.com>

`make install-flash-attn-v2-cuda` should work like `make install-flash-attn-v2` used to work. (#1294)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

Let each model resolve their own default dtype. (#1287)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

Make GPTQ test less flaky (#1295)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

v1.2.0

Fix AMD documentation (#1307)

As per title

Add a stale bot. (#1313)

Speculative (#1308)

feat: mixtral (#1328)

chore: formatting

v1.3.0

v1.3.1

feat: add quant to mixtral (#1337)

v1.3.2

fix: default max_new_tokens to 100

fix: fix gpt-q params loading

feat: add more latency metrics in forward (#1346)

fix: fix triton OutOfResources import

fix: fix quant linear autotune

fix: slice stopping criteria buffer

fix: only keep stop sequence buffer if we have some

fix: max_past default value must be -1, not 0 (#1348)

v1.3.3

feat: relax mistral requirements (#1351)

Close #1253
Close #1279

fix: fix logic if sliding window key is not present in config (#1352)

fix: fix offline (#1341) (#1347)

@oOraph

---------

Signed-off-by: Raphael Glon <oOraph@users.noreply.github.com>
Co-authored-by: Raphael Glon <oOraph@users.noreply.github.com>

fix: fix gpt-q with groupsize = -1 (#1358)

Peft safetensors. (#1364)

Works by removing adapter_model.safetensors from being detected as the
core model file (which skips the real peft detection).

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

docs: Change URL for Habana Gaudi support in doc (#1343)

feat: update exllamav2 kernels (#1370)

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

Fix local load for peft (#1373)

local directory overloaded still needs the directory to locate the
weights files correctly.

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

v1.3.4

docs: update required CUDA version to 12.2

fix: fix local loading for .bin models (#1419)

Fix missing make target platform for local install: 'install-flash-attention-v2'  (#1414)

fix: follow base model for tokenizer in router (#1424)

Close #1422

Fix local load for Medusa (#1420)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Close #1418
Close #1415

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

Return prompt vs generated tokens. (#1436)

Fixes #637

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

feat: supports openai chat completions API (#1427)

This PR adds support to make TGI a drop in replacement for OpenAI
clients by exposing the same HTTP interface.

Notes
- TGI inits a single model at startup so the `model` field is unused in
HTTP requests.
- `max_tokens` and `stream` should work as expected but other params may
be (unimplemented or not supported)

General approach
- fetch the `tokenizer_config` at startup from the hub
- pass `tokenizer_config` into `Infer` so we have it at request time
- use the `chat_template` on the config to format chat request
- parse jinja template and render chat string
- pass inputs into existing generate function
- wrap generation output in expected structure before returning

```bash
curl localhost:3000/v1/chat/completions \
    -X POST \
    -d '{
  "model": "tgi",
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful assistant."
    },
    {
      "role": "user",
      "content": "What is deep learning?"
    }
  ],
  "stream": true,
  "max_tokens": 20
}' \
    -H 'Content-Type: application/json'
```

It is also possible to use the `openai` python library and change the
base url

```python
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:3000/v1",
    api_key="not needed for a local LLM"
)

chat_completion = client.chat.completions.create(
    model="tgi",
    messages=[
        {"role": "system", "content": "You are a helpful assistant." },
        {"role": "user", "content": "What is deep learning?"}
    ],
    stream=True
)

for message in chat_completion:
    print(message)

```

```python
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:3000/v1",
    api_key="not needed for a local LLM"
)

chat_completion = client.chat.completions.create(
    model="tgi",
    messages=[
        {"role": "system", "content": "You are a helpful assistant." },
        {"role": "user", "content": "What is deep learning?"}
    ],
    stream=False
)

print(chat_completion)
```

```bash
cd text-generation-inference/server
MASTER_ADDR=127.0.0.1 MASTER_PORT=5555 text-generation-server serve --trust-remote-code gpt2
```

***note many of the existing `chat_templates` use non standard `jinja`
(ie. adding a `raise` to the template) which will throw an error when
parsing; hence using `upstage/SOLAR-10.7B-Instruct-v1.0` since it has a
valid template
```bash
cd text-generation-inference/router
cargo run -- --tokenizer-name upstage/SOLAR-10.7B-Instruct-v1.0
```

trigger
```bash
curl localhost:3000/v1/chat/completions \
    -X POST \
    -d '{ "model": "gpt-3.5-turbo", "messages": [ { "role": "system", "content": "You are a helpful assistant." }, { "role": "user", "content": "What is the IP address of the Google DNS servers?" } ], "stream": true, "max_tokens": 20, "logprobs": true }' \
    -H 'Content-Type: application/json'
```

^ supports `stream: true` and `stream: false` requests

feat: support raise_exception, bos and eos tokens (#1450)

This PR adds support to handle the custom jinja function
`raise_exception` and passes the `bos` and `eos` tokens into the
template

Additionally this PR adds 3 tests to validate and show examples of what
can and cannot be parsed currently.

```bash
cargo test --package text-generation-router --lib -- infer::tests --nocapture

```

chore: bump rust version and annotate/fix all clippy warnings (#1455)

This PR just bumps the latest rust version and makes clippy happy

```bash
cargo clippy --all -- -D warnings
```

feat: conditionally toggle chat on invocations route (#1454)

This PR adds support for reading the `OAI_ENABLED` env var which will
changes the function called when the `/invocations` is called.

If `OAI_ENABLED=true` the `chat_completions` method is used otherwise it
defaults to `compat_generate`.

example running the router
```bash
OAI_ENABLED=true \
  cargo run -- \
  --tokenizer-name mistralai/Mistral-7B-Instruct-v0.2
```

example request
```bash
curl localhost:3000/invocations \
    -X POST \
    -d '{ "model": "tgi", "messages": [ { "role": "user", "content": "What is the IP address of the Google DNS servers?" } ], "stream": false, "max_tokens": 20, "logprobs": true, "seed": 0 }' \
    -H 'Content-Type: application/json' | jq
```

**please let me know if any naming changes are needed or if any other
routes need similar functionality.

Disable `decoder_input_details` on OpenAI-compatible chat streaming, pass temp and top-k from API (#1470)

This PR makes some minor tweaks to the new OpenAI-compatible chat
endpoint #1427 in `GenerateParameters`:
- Disables `decoder_input_details` when streaming is enabled. This was
causing all streaming chat requests to fail before, since
[`decoder_input_details`==true is not enabled when streaming
tokens](https://github.com/huggingface/text-generation-inference/blob/98e5faff9daec6170cc2b0f963f2d73cf846b341/router/src/validation.rs#L406).
- Passes through `temperature` and `top_p` hyperparameters from the API
request to `GenerateParameters`

```bash
curl localhost:8080/v1/chat/completions \
    -X POST \
    -d '{
  "model": "",
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful assistant."
    },
    {
      "role": "user",
      "content": "What is deep learning?"
    }
  ],
  "stream": true,
  "max_tokens": 20
}' \
    -H 'Content-Type: application/json'
```

Should work correctly. Currently, most recent release from `main`
returns error:
```
data:{"error":"Input validation error: `decoder_input_details` == true is not supported when streaming tokens","error_type":"validation"}
```

It's my first time contributing to this project, so I could be missing
something. Would especially appreciate @drbh's eyes on this one

Fixing non divisible embeddings. (#1476)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

Add messages api compatibility docs (#1478)

This PR adds a new page to the docs that describes the Messages API and
how to use it.

Additionally this page will contain cloud provider specific information
for enabling and using this feature. This PR includes a SageMaker
example/information.

Add a new `/tokenize` route to get the tokenized input (#1471)

Ideally this is done client side, but this is a recurring request,
therefore we implemented it.

- Runs only if rust tokenizer is present (not encumbering the main
inference pipeline is important).
- Returns simple results, ID, text (gotten with offsets from the
original string) and offsets (so users can do things like highlighting
text).

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

feat: adds phi model (#1442)

This PR adds basic modeling for phi-2

run
```bash
text-generation-server \
    serve \
    microsoft/phi-2 \
    --revision 834565c23f9b28b96ccbeabe614dd906b6db551a
```

test
```bash
curl -s localhost:3000/generate \
   -X POST \
   -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \
   -H 'Content-Type: application/json' | jq .
```

notes
- recently (~1 day ago) the Phi weights and model were updated to
accommodate adding [GQA/MQA attention to the
model.](https://github.com/huggingface/transformers/pull/28163) This
impl expects the original model format so a fixed revision is required
at the moment.
- this PR only includes a basic implementation of the model and can
later be extended for support Flash and Sharded versions as well as make
use of better optimization

fix: read stderr in download (#1486)

Update the docs

fix: show warning with tokenizer config parsing error (#1488)

This tiny PR just prints the parsing error when a tokenizer config fails
to load.

This is helpful when a chat_template wont load due to formatting issues
https://github.com/huggingface/text-generation-inference/pull/1427#issuecomment-1909226388

fix: launcher doc typos (#1473)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

---------

Co-authored-by: Andres Restrepo <andres@thelinuxkid.com>

Reinstate exl2 with tp (#1490)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

Add sealion mpt support (#1477)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

---------

Co-authored-by: Choon Meng Tan <choonmeng@aisingapore.org>
Co-authored-by: David Ong Tat-Wee <13075447+ongtw@users.noreply.github.com>

Trying to fix that flaky test. (#1491)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

Update the docs to include newer models. (#1492)

GPTQ support on ROCm (#1489)

Tested with
```
CUDA_VISIBLE_DEVICES=0 text-generation-launcher --model-id TheBloke/Llama-2-7B-Chat-GPTQ --quantize gptq
EXLLAMA_VERSION=1 CUDA_VISIBLE_DEVICES=0 text-generation-launcher --model-id TheBloke/Llama-2-7B-Chat-GPTQ --quantize gptq
CUDA_VISIBLE_DEVICES="0,1" text-generation-launcher --model-id TheBloke/Llama-2-7B-Chat-GPTQ --quantize gptq
```

all with good and identical results on MI210.

---------

Co-authored-by: Felix Marty <felix@hf.co>
Co-authored-by: OlivierDehaene <olivier@huggingface.co>
Co-authored-by: OlivierDehaene <23298448+OlivierDehaene@users.noreply.github.com>

feat: add tokenizer-config-path to launcher args (#1495)

This PR adds the `tokenizer-config-path` to the launcher and passes it
to the router

Fixes:
https://github.com/huggingface/text-generation-inference/pull/1427

v1.4.0 (#1494)

Fixing top_n_tokens. (#1497)

Superseeds #1459

The fix works as follows.
We updated next_token_chooser to return all logprbs, then
batch_top_n_tokens, now also gets accepted_ids + speculated_length (so
it knows how to interpret the flat logprobs).

We then update the code to return lists ot `Tokens` that it expects.
<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

Sending compute type from the environment instead of hardcoded string (#1504)

Sending compute type from the environment instead of hardcoded string

Using env is slow, therefore getting it from global state instead.

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

Create the compute type at launch time (if not provided in the env). (#1505)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

Modify default for max_new_tokens in python client (#1336)

Since
([#1097](https://github.com/huggingface/text-generation-inference/pull/1097))
the clients do not need to specify a max_length anymore. However, the
python client in this repo had not yet been adapted to these changes.
This PR makes it possible to use the python client and not provide
max_new_tokens.

<!-- Remove if not applicable -->

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [x] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [x] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

feat: eetq gemv optimization when batch_size <= 4 (#1502)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Add TensorRT-LLM weight-only GEMV kernel support. We extract GEMV kernel
from
[TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM/tree/main/cpp/tensorrt_llm/kernels/weightOnlyBatchedGemv)
to accelerate the decode speed of EETQ when batch_size is smaller or
equal to 4.

- Features

1. There is almost no loss of quantization accuracy.
2. The speed of decoding is 13% - 27% faster than original EETQ which
utilizes GEMM kernel.

- Test
Below is our test on 3090. Environment: torch=2.0.1, cuda=11.8, nvidia
driver: 525.78.01
prompt=1024, max_new_tokens=50

![image](https://github.com/huggingface/text-generation-inference/assets/139844877/98e63b23-23cd-452f-91bd-55ccdc9b7021)

![image](https://github.com/huggingface/text-generation-inference/assets/139844877/5c3132ff-fc1c-4b20-a83f-59b3d5f586b7)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

fix: improve messages api docs content and formatting (#1506)

This PR simply updates the messages api docs to address content changes
and make format consistent

GPTNeoX: Use static rotary embedding (#1498)

`transformers` 4.35 removed rotary embeddings from GPTNeoX's weights
([link to line
diff](https://github.com/huggingface/transformers/commit/253f9a3f9716d08a81fb305fe71f983122eb608b#diff-0e2a05d86c82e96f516db8c14070ceb36f53ca44c6bc21a9cd92ad2e777b9cf1R298)).
This applies the same fix as
https://github.com/huggingface/text-generation-inference/pull/793 which
generates them on-the-fly using the appropriate value from the config
file

Fixes
https://github.com/huggingface/text-generation-inference/issues/1460

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [x] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

@OlivierDehaene OR @Narsil

Freshen up the README.

Hotfix the / health - route. (#1515)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

Revert "Modify default for max_new_tokens in python client (#1336)"

This reverts commit 2d56f106a60c7b698705494e7539f8a7e4c85dd9.

It causes a breaking in our integrations-tests.

fix: tokenizer config should use local model path when possible (#1518)

This PR fixes the issue with loading a local tokenizer config.
Previously the default functionality would look in the current working
directory. Now if a local model path is specified we will check that
directory for the tokenizer_config.

uses tokenizer_config from hub
```
text-generation-launcher --model-id HuggingFaceH4/zephyr-7b-beta
```

use tokenizer_config from local model path
```
text-generation-launcher \
  --model-id ~/.cache/huggingface/hub/models--HuggingFaceH4--zephyr-7b-beta/snapshots/dc24cabd13eacd3ae3a5fe574bd645483a335a4a/
```

use specific tokenizer_config file
```
 text-generation-launcher \
  --model-id ~/.cache/huggingface/hub/models--HuggingFaceH4--zephyr-7b-beta/snapshots/dc24cabd13eacd3ae3a5fe574bd645483a335a4a/ \
  --tokenizer-config-path ~/.cache/huggingface/hub/models--HuggingFaceH4--zephyr-7b-beta/snapshots/dc24cabd13eacd3ae3a5fe574bd645483a335a4a/tokenizer_config.json

```

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

Updating tokenizers. (#1517)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

[docs] Fix link to Install CLI (#1526)

Attempts to fix a link from Using TGI CLI to Installation.

- [x] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [x] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

feat: add ie update to message docs (#1523)

update messages api docs and add Hugging Face Inference Endpoints
integrations section/instructions

---------

Co-authored-by: Philipp Schmid <32632186+philschmid@users.noreply.github.com>

feat: use existing add_generation_prompt variable from config in temp… (#1533)

This PR adds support to read the `add_generation_prompt` from the config
and use it in the chat template. If `add_generation_prompt` does not
exist we default to false

Impl simple mamba model (#1480)

This draft PR is a work in progress implementation of the mamba model.
This PR currently loads weights, and produces correct logits after a
single pass.

This PR still needs to correctly integrate this model so it produces
tokens as expected, and apply optimization to avoid all copies during
runtime/unnecessary operations.

[Mamba: Linear-Time Sequence Modeling with Selective State Spaces
(Albert Gu and Tri Dao)](https://arxiv.org/abs/2312.00752)
https://github.com/johnma2006/mamba-minimal

https://github.com/huggingface/candle/blob/main/candle-examples/examples/mamba-minimal/model.rs
https://github.com/huggingface/transformers/pull/28094

Notes: this dev work is currently targeting `state-spaces/mamba-130m`,
so if you want to test please use that model. Additionally when starting
the router the prefill needs to be limited: `cargo run --
--max-batch-prefill-tokens 768 --max-input-length 768`

Integration tests have been added and basic functionality such as model
loading is supported.

```bash
cd integration-tests
pytest -vv models/test_fused_kernel_mamba.py
```
- [x] add tests
- [x] load model
- [x] make simple request
- [ ] resolve warmup issue
- [ ] resolve output issues

fetching models tested during dev
```bash
text-generation-server download-weights state-spaces/mamba-130m
text-generation-server download-weights state-spaces/mamba-1.4b
text-generation-server download-weights state-spaces/mamba-2.8b
```

The server can be run
```bash
cd server
 MASTER_ADDR=127.0.0.1 MASTER_PORT=5555 python text_generation_server/cli.py serve state-spaces/mamba-2.8b
```

router
```bash
cargo run
```

make a request
```bash
curl -s localhost:3000/generate \
    -X POST \
    -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \
    -H 'Content-Type: application/json' | jq
```

response
```json
{
  "generated_text": "\n\nDeep learning is a machine learning technique that uses a deep neural network to learn from data."
}
```

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

Update to peft 0.8.2 (#1537)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

- [x] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [x] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [x] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
     to it if that's the case.
- [x] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [x] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

@OlivierDehaene OR @Narsil

feat(server): add frequency penalty (#1541)

chore: bump ci rust version (#1543)

This PR bumps the rust toolchain in CI to resolve the CI build issue

```bash
  Downloaded crossbeam-utils v0.8.19
  Downloaded crc32fast v1.3.2
error: failed to compile `text-generation-router v1.4.0 (/home/runner/work/text-generation-inference/text-generation-inference/router)`, intermediate artifacts can be found at `/home/runner/work/text-generation-inference/text-generation-inference/target`

Caused by:
  package `clap_lex v0.7.0` cannot be built because it requires rustc 1.74 or newer, while the currently active rustc version is 1.71.0
  Either upgrade to rustc 1.74 or newer, or use
  cargo update -p clap_lex@0.7.0 --precise ver
  where `ver` is the latest version of `clap_lex` supporting rustc 1.71.0
make: *** [Makefile:12: install-router] Error 101
```

ROCm AWQ support (#1514)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

This PR adds the possibility to run AWQ models with Exllama/GPTQ
kernels, specifically for ROCm devices that support Exllama kernels but
not AWQ's GEMM.

This is done by :
- un-packing, reordering and re-packing AWQ weights when `--quantize
gptq` but the model's `quant_method=awq`.
- avoiding overflows when adding 1 to zeros in exllama and triton.

Ref: https://github.com/casper-hansen/AutoAWQ/pull/313

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

feat(router): add max_batch_size (#1542)

Some hardware require a maximum batch size.

feat: experimental support for cuda graphs (#1428)

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

feat: add deserialize_with that handles strings or objects with content (#1550)

This PR adds a simple custom `deserialize_with` function that parses a
string or an object with a content property. This should help support
more token configuration files stored on the hub

Fixing glibc version in the runtime. (#1556)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

Upgrade intermediary layer for nvidia too. (#1557)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

Improving mamba runtime by using updates (#1552)

- Move float16 to bfloat16, which has less imprecisions (load test are
  failing with the update kernels + f16, all working under bf16).

  Another note, is that we are not respecting the layer norm in f32
  defined in the configuration (this is OK in my book, but that could
  impact the f16 precision)

- Moved to update kernels. Triton overhead is super high, removed by
  switching to cuda graphs works great (update cuda graph is available
  in TRT-LLM if needed, seems *exactly* like the regular ssm kernel.

- Moved inference_params struct in order to make only 2 tensors, to
  reduce the overhead of copying back and forth to the cuda graphs.

- Left over overhead seems entirely in the tokenization bit. (Still 4
  copies are paid before launching the graph)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

Small cleanup. (#1560)

Using a single `os.getenv` statement instead of multiple.
Should make truthful values easier to catch

In the end didn't move towards full CLI because modifying globals in
Python is error prone (depends on code import order).

Added an error when mamba is launched with TP.

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

Outlines guided generation (#1539)

This WIP PR starts to add grammar support via outlines, currently this
PR supports very simple regex grammars and does not optimize for
precompiling or caching grammar fsm's.

todo:
- [X] add simple outlines guidance to `NextTokenChooser`
- [X] update protos for grammar
- [X] update generation params API
- [X] constrain simple grammar
- [ ] support parsing more complex grammar into fsm
- [ ] support all outline support grammar types
- [ ] explore optimizations to avoid recompiling grammars

guided request
```bash
curl -s 'http://localhost:3000/generate' \
--header 'Content-Type: application/json' \
--data-raw '{
    "inputs": "make an email for david: \n",
    "parameters": {
        "max_new_tokens": 6,
        "grammar": "[\\w-]+@([\\w-]+\\.)+[\\w-]+"
    }
}' | jq
```
response
```json
{
  "generated_text": "david@example.com"
}
```

unguided request
```bash
curl -s 'http://localhost:3000/generate' \
--header 'Content-Type: application/json' \
--data '{
    "inputs": "make an email for david: \n",
    "parameters": {
        "max_new_tokens": 6
    }
}' | jq
```
response
```json
{
  "generated_text": "    email = 'david"
}
```

Added `name` field to OpenAI compatible API Messages (#1563)

Literally just adds the name field to the Message class.

I verified this change by building a new docker container (using the
`Dockerfile` in the repo) and trialing with a `chat_template` that uses
the `name` field.

Here's the previous behavior:

Input messages:
```
{
"messages": [
 {"role": "system", "content": "You are a succinct but helpful AI Assistant listening to a chat server.  Address everyone by @<username>"},
 {"role": "user", "name": "Aaron", "content": "Hello There!"},
 {"role": "assistant", "content": "  Hello @Aaron! How can I assist you today?"},
 {"role": "user", "name": "Sally", "content": "Hiya everyone.  Is @Aaron is this room?"}
],
  "model": "meta-llama/Llama-2-7b-chat-hf"
}
```

Response before the modification:
```
Hello @Aaron! Yes, you are in the chat room. How can I assist you today? 😊

Hiya everyone! *waves* It's great to see you all here. Is there something on your mind that you'd like to talk about or ask? I'm here to listen and help in any way I can. 🤖
```

Response after my modification:
```
Hello @Sally! Yes, @Aaron is currently in the chat room. How may I assist you today?
```

Fixes #1558

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [x] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

@Narsil

---------

Co-authored-by: Aaron Mihalik <aaron.mihalik@parsons.us>
Co-authored-by: drbh <david.richard.holtz@gmail.com>

Bugfix: eos and bos tokens positions are inconsistent (#1567)

chore: add pre-commit (#1569)

feat: add chat template struct to avoid tuple ordering errors (#1570)

v1.4.1 (#1568)

Fix mistral with length > window_size for long prefills (rotary doesn't create long enough cos, sin). (#1571)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

improve endpoint support (#1577)

small PR to add a new interface endpoint behind a feature

fix: refactor syntax to correctly include structs (#1580)

This PR fixes a compilation bug related to conditionally adding docs
behind a feature flag

fix(router): fix openapi and add jsonschema validation (#1578)

feat: add support for Gemma (#1583)

v1.4.2 (#1585)

fix: fix openapi schema (#1586)

fix: avoid default message (#1579)

This PR avoids setting a default message in order to avoid unexpected
generations

Revamp medusa implementation so that every model can benefit. (#1588)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

Support tools (#1587)

This work in progress PR begins to add support for tools. Tools relies
on grammar support and still has some unsolved challenges. Opening the
PR for visibility and feedback

Fixing x-compute-time. (#1606)

It was meant to be in seconds float
<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

Fixing guidance docs. (#1607)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

feat: starcoder2 (#1605)

feat: Qwen2 (#1608)

See #1584

---------

Co-authored-by: Cheng Kuan Yong Jason <jasoncky96@gmail.com>

v1.4.3 (#1609)

fix: Handle concurrent grammar requests (#1610)

This PR fixes parallel grammar requests, currently grammar states are
not concatenated correctly when a new request is added to the batch and
this results in incorrect generation. This PR updates the `concatenate`
function to correctly include the previous states.

fixes: #1601

Fix idefics default. (#1614)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

Fix async client timeout (#1617)

Fixes #1616

According to the [aiohttp.ClientTimeout
docs](https://docs.aiohttp.org/en/stable/client_reference.html#aiohttp.ClientTimeout),
the arguments should be in seconds. This PR removes the multiplication
by 60.

@OlivierDehaene OR @Narsil

feat: accept legacy request format and response (#1527)

This WIP PR (will) add support for legacy OpenAI `v1/completions` API.

This should allow TGI to be a drop in replacement for OpenAI when using
tools that rely on the completions api

Should fix:
https://github.com/huggingface/text-generation-inference/issues/1468

fix: add missing stop parameter for chat request (#1619)

This PR adds the missing `stop` parameter to the `ChatRequest` struct
which allows calls to specify a list of stop sequences

fix: correctly index into mask when applying grammar (#1618)

This PR fixes how the grammar mask is index when generating text and
adds a new test to ensure the grammars work with non flash models

Use a better model for the quick tour (#1639)

Falcon models are long superseded by better models like Zephyr and
OpenHermes. This PR updates the docs accordingly

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

Upgrade nix version from 0.27.1 to 0.28.0 (#1638)

Fix the following carsh when build the docker on Ubuntu22.04

```
error[E0432]: unresolved import `nix::sys::signal::Signal`
 --> launcher/src/main.rs:2:30
  |
2 | use nix::sys::signal::{self, Signal};
  |                              ^^^^^^ no `Signal` in `sys::signal`
  |
  = help: consider importing this type alias instead:
          ctrlc::Signal

error[E0432]: unresolved import `nix::unistd::Pid`
   --> launcher/src/main.rs:3:5
    |
3   | use nix::unistd::Pid;
    |     ^^^^^^^^^^^^^^^^ no `Pid` in `unistd`
    |
note: found an item that was configured out
   --> /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/nix-0.27.1/src/unistd.rs:183:12
    |
183 | pub struct Pid(pid_t);
    |            ^^^
    = note: the item is gated behind the `process` feature

error[E0425]: cannot find function `kill` in module `signal`

```

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

---------

Signed-off-by: yuanwu <yuan.wu@intel.com>
Co-authored-by: drbh <david.richard.holtz@gmail.com>

Update peft + transformers + accelerate + bnb + safetensors (#1646)

Fix index in ChatCompletionChunk (#1648)

Fix a small inconsistency compared the OpenAI's chat-completion behavior
(introduced in
https://github.com/huggingface/text-generation-inference/pull/1427 cc
@drbh). When using `stream=True`, each chunk has an `index` value in
`ChatCompletionChoice`. This index is not meant to be the index of the
generated token but the index of the choice, which is always 0 (since
TGI always return a single choice).

See https://platform.openai.com/docs/api-reference/chat/object:
> index _integer_
> The index of the choice in the list of choices.

---

So instead of

```js
data:{"id":"","object":"text_completion","created":1710508199,"model":"HuggingFaceH4/zephyr-7b-beta","system_fingerprint":"1.4.3-sha-e6bb3ff","choices":[{"index":1,"delta":{"role":"assistant","content":"I"},"logprobs":null,"finish_reason":null}]}
data:{"id":"","object":"text_completion","created":1710508199,"model":"HuggingFaceH4/zephyr-7b-beta","system_fingerprint":"1.4.3-sha-e6bb3ff","choices":[{"index":2,"delta":{"role":"assistant","content":"'"},"logprobs":null,"finish_reason":null}]}
data:{"id":"","object":"text_completion","created":1710508199,"model":"HuggingFaceH4/zephyr-7b-beta","system_fingerprint":"1.4.3-sha-e6bb3ff","choices":[{"index":3,"delta":{"role":"assistant","content":"m"},"logprobs":null,"finish_reason":"length"}]}
```

if should return
```js
data:{"id":"","object":"text_completion","created":1710508199,"model":"HuggingFaceH4/zephyr-7b-beta","system_fingerprint":"1.4.3-sha-e6bb3ff","choices":[{"index":0,"delta":{"role":"assistant","content":"I"},"logprobs":null,"finish_reason":null}]}
data:{"id":"","object":"text_completion","created":1710508199,"model":"HuggingFaceH4/zephyr-7b-beta","system_fingerprint":"1.4.3-sha-e6bb3ff","choices":[{"index":0,"delta":{"role":"assistant","content":"'"},"logprobs":null,"finish_reason":null}]}
data:{"id":"","object":"text_completion","created":1710508199,"model":"HuggingFaceH4/zephyr-7b-beta","system_fingerprint":"1.4.3-sha-e6bb3ff","choices":[{"index":0,"delta":{"role":"assistant","content":"m"},"logprobs":null,"finish_reason":"length"}]}
```

**EDIT:** I also edited ToolCall.index to be always `0` (instead of the
generated token index) but for this one I'm actually unsure. It might be
the index of the tool in the array of tools? OpenAI's documentation
doesn't provide any information about it:
> index _integer_

---

I also noticed that in OpenAI's example, the last chunk doesn't have a
delta and is the only one that has a `finish_reason` returning. TGI is
slightly different since the last chunk has both the last delta (i.e.
the last generated token) + the finish reason. I don't think this is
worth fixing since it is not a requirement according to the docs/specs
(at least not that I know of).

Fixing minor typo in documentation: supported hardware section (#1632)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

- [x] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

Signed-off-by: Sachin Varghese <sachin.mathew31@gmail.com>

feat: bump minijina and add test for core templates (#1626)

This PR bumps `minijinja` and adds tests for all core models as
identified by @xenova 🙏

Inspiration:
https://github.com/huggingface/huggingface.js/blob/main/packages/jinja/test/e2e.test.js

TODO:
- [X] add new test to iterate over known templates
- [X] add default templates
- [x] add custom templates

feat: support force downcast after FastRMSNorm multiply for Gemma (#1658)

This PR adds `force_downcast_after` to `FastRMSNorm.forward` which is
used in the Gemma model. References
https://github.com/huggingface/transformers/pull/29402 and
https://github.com/huggingface/transformers/pull/29729

Setting `force_downcast_after=True` will perform the `hidden_states *
weight` multiplication in f32 and then downcast to half. This differs
slightly from the current implementation which first casts the
`hidden_states` to a half and then multiples.

fix: prefer spaces url over temp url (#1662)

This PR fixes the broken urls in the idefics tests causing CI to fail

fix: improve tool type, bump pydantic and outlines (#1650)

This PR resolves a couple

- [X] adjusts the tool response to align with openai's tools response
type
- [X] bumps pydantic to `2.6.4` in all apps (resolves dependency issue
when running tests)
- [X] bump `outlines` version and fix import for new name

Remove unecessary cuda graph. (#1664)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

Repair idefics integration tests. (#1663)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

feat: update client to 0.7 (#1667)

Close #1652

fix: LlamaTokenizerFast to AutoTokenizer at flash_mistral.py (#1637)

A few cases where you're using a mistral structure or mixtral structure
but not a llama tokenizer, why not make it to call the AutoTokenizer in
exception handling.

Similar PR #619

@Narsil

Inline images for multimodal models. (#1666)

feat: cohere (#1660)

v1.4.4 (#1668)

fix: adjust logprob response logic (#1682)

This PR fixes a bug with `ChatCompletionLogprobs` where if
`top_tokens.len() == 0` empty results were returned.

```bash
 curl http://localhost:3000/v1/chat/completions \
    -X POST \
    -H 'Content-Type: application/json' \
    -d '{
  "model": "tgi",
  "logprobs": true,
  "messages": [
    {
      "role": "user",
      "content": "What is deep learning?"
    }
  ],
  "stream": false,
  "max_tokens": 20
}'
```

response

```json
{"id":"","object":"text_completion","created":1711588522,"model":"google/gemma-2b-it","system_fingerprint":"1.4.4-native","choices":[{"index":0,"message":{"role":"assistant","content":"**Deep learning** is a subset of machine learning (ML) that emphasizes the creation of **artificial"},"logprobs":{"content":[{"token":"**","logprob":-0.22558594,"top_logprobs":[]},{"token":"Deep","logprob":-0.0014877319,"top_logprobs":[]},{"token":" learning","logprob":-0.12695312,"top_logprobs":[]},{"token":"**","logprob":-0.055664062,"top_logprobs":[]},{"token":" is","logprob":-0.00090026855,"top_logprobs":[]},{"token":" a","logprob":-0.006072998,"top_logprobs":[]},{"token":" subset","logprob":-2.25,"top_logprobs":[]},{"token":" of","logprob":-0.00031089783,"top_logprobs":[]},{"token":" machine","logprob":-0.091308594,"top_logprobs":[]},{"token":" learning","logprob":-0.00002348423,"top_logprobs":[]},{"token":" (","logprob":-1.671875,"top_logprobs":[]},{"token":"ML","logprob":-0.00040626526,"top_logprobs":[]},{"token":")","logprob":-0.00016212463,"top_logprobs":[]},{"token":" that","logprob":-0.13769531,"top_logprobs":[]},{"token":" emphasizes","logprob":-4.03125,"top_logprobs":[]},{"token":" the","logprob":-0.2890625,"top_logprobs":[]},{"token":" creation","logprob":-3.109375,"top_logprobs":[]},{"token":" of","logprob":-0.00024032593,"top_logprobs":[]},{"token":" **","logprob":-1.2265625,"top_logprobs":[]},{"token":"artificial","logprob":-0.10546875,"top_logprobs":[]}]},"finish_reason":"length"}],"usage":{"prompt_tokens":15,"completion_tokens":20,"total_tokens":35}}
```

fix: handle batches with and without grammars (#1676)

This PR correctly handles batches with a mixture of constrained and non
constrained generations.

Currently if batch contains mixed generations the generation will throw
an error because it will incorrectly attempt to constrain a request with
an empty grammar.

We now handled `None` grammars and only apply the mask if needed

Fixes:
https://github.com/huggingface/text-generation-inference/issues/1643

feat: Add dbrx support (#1685)

Close #1679

v1.4.5 (#1686)

Add cuda graphs sizes and make it default. (#1703)

```
text-generation-launcher --model-id XXX # Uses cuda graphs by default
text-generation-launcher --model-id XXX --cuda-graphs "1,2"  #Restrict the number of cuda graphs which saves VRAM
text-generation-launcher --model-id XXX --cuda-graphs "0"  # Disabling it entirely
```
<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

Pickle conversion now requires `--trust-remote-code`. (#1704)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

Push users to streaming in the readme. (#1698)

Fixing cohere tokenizer. (#1697)

Force weights_only (before fully breaking pickle files anyway). (#1710)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

Regenerate ld.so.cache (#1708)

fixes
https://github.com/huggingface/text-generation-inference/issues/1711

Signed-off-by: Raphael Glon <oOraph@users.noreply.github.com>
Co-authored-by: Raphael Glon <oOraph@users.noreply.github.com>

Revert license to Apache 2.0 (#1714)

Reverts huggingface/text-generation-inference#725

---------

Co-authored-by: Julien Chaumond <julien@huggingface.co>

Automatic quantization config. (#1719)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

Adding Llava-Next (Llava 1.6) with full support. (#1709)

- Changed all models to extract `embed_tokens` in order to enable llava
to separately call the embeddings and the core model layers.
- Added VlmCausalLM to inherit from FlashMistral in order to be
maximally supported. The only added logics sits on top and parses images
into pixel values, preallocates input_ids space for the image
embeddings, and passes them for the model.
- Added Clip for the vision tower.
- Didn't add flash for the vision tower since there's no padding anyway.
- Added heuristic (potentially incomplete) to calculate number of
features *before* calculating the clip patches (allows for easier logic
reuse of the LLM under the hood).

Still needs to be done:

- [x] Implement the image parsing in the controller side, to avoid
downloading n times per TP shard and also refusing requests too large
early and avoid issues where the truncation actually truncates the
image.
- [ ] Make sure it works with quantization properly.
- [x] Make sure it works with TP>1

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

fix: fix CohereForAI/c4ai-command-r-plus (#1707)

@Narsil @drbh this will update flash attention v2 and vllm.
You will need to re-install them.

hotfix: mixtral

Update libraries (#1713)

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

Easier defaults for models stemmed from configs.

Revert "Easier defaults for models stemmed from configs."

This reverts commit b83aab9bb390c85d6675f00883c29a5aa205a1ca.

Dev/mask ldconfig output v2 (#1716)

wrap text-generation-launcher in docker image
mask ldconfig failures to user (no need in most cases anyway)

---------

Signed-off-by: Raphael Glon <oOraph@users.noreply.github.com>
Co-authored-by: Raphael Glon <oOraph@users.noreply.github.com>

Fp8 Support (#1726)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

---------

Co-authored-by: Dong Shin <d0104.shin@gmail.com>

Upgrade EETQ (Fixes the cuda graphs). (#1729)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

fix(router): fix a possible deadlock in next_batch (#1731)

chore(cargo-toml): apply lto fat and codegen-units of one (#1651)

I have suggested similar changes over at
https://github.com/huggingface/text-embeddings-inference/pull/201.

Here being my additional question, why `debug` is enabled during release
building? (hence I didn't add the flag to script things)

Applying the following optimizations:
- `lto` (link time optimizations) over all code (including dependencies)
- Using a single `codegen-unit` to apply optimizations within 1 code
unit at build time

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

@OlivierDehaene OR @Narsil

Improve the defaults for the launcher (#1727)

- Renamed `max_input_length` into `max_input_tokens` for consistency
(backward compatible change, will yell if both are set.)
- Will now use the config for `max_input_tokens` `max_total_token` and
`max_batch_total_tokens`.
- Capping the values to 16k in order to save VRAM on behalf of users
(overriddable by simply setting the values).

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

feat: medusa v2 (#1734)

Fix typo in guidance.md (#1735)

compliation -> compilation

- [x] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [x] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

v2.0.0 (#1736)

Fixing CI. (#1748)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

feat: improve tools to include name and add tests (#1693)

This PR makes tool calling aware of the name of the function selected.

Fixes:
https://github.com/huggingface/text-generation-inference/issues/1657

Thank you @puppetm4st3r for the helpful snippets, large parts of this PR
are simply refactors of the code shared 🙏

**opening draft PR because small tweaks are needed before merging

Update response type for `/v1/chat/completions` and `/v1/completions` (#1747)

`/v1/chat/completions` and `/v1/completions` have different output types
depending on the `stream` parameter. This PR aims at fixing the
inconsistency in the auto-generated
[openapi.json](https://huggingface.github.io/text-generation-inference/openapi.json)
specs.

cc @OlivierDehaene @drbh I reused what had been done for the `/`
endpoint but haven't tested anything myself. Could you confirm this is
the correct way of handling things?

Also, should I update the openapi.json file manually? If yes, how can I
do it?

fix: bump clients test base url to llama (#1751)

This PR bumps the client tests from `google/flan-t5-xxl` to
`meta-llama/Llama-2-7b-chat-hf` to resolve issues when calling the
endpoint and `google/flan-t5-xxl` is not available

run with
```bash
make python-client-tests

clients/python/tests/test_client.py ..............     [ 43%]
clients/python/tests/test_errors.py ..........         [ 75%]
clients/python/tests/test_inference_api.py ......      [ 93%]
clients/python/tests/test_types.py ..                  [100%]
```

**note `google/flan-t5-xxl` function is currently unused but still
included in the `conftest.py`

feat: accept list as prompt and use first string (#1702)

This PR allows the `CompletionRequest.prompt` to be sent as a string or
array of strings. When an array is sent the first value will be used if
it's a string; otherwise the according error will be thrown

Fixes:
https://github.com/huggingface/text-generation-inference/issues/1690
Similar to: https://github.com/vllm-project/vllm/pull/323/files

Upgrading all versions. (#1759)

v2.0.1

Make `--cuda-graphs` work as expected (bis) (#1768)

This was ignored up to now, even with `--cuda-graphs 0`.

With this fix, `--cuda-graphs` is obeyed to.

fix typos in docs and add small clarifications (#1790)

Fix some small typos in the docs; add minor clarifications; add guidance
to features on landing page

- [x] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

@OlivierDehaene

Add attribute descriptions for `GenerateParameters` (#1798)

Once https://github.com/huggingface/huggingface.js/pull/629 gets merged,
we will rely on TGI's specs to generate jsonschema types for
`text_generation` and `chat_completion`.

This PR adds some documentation for `GenerationParameters`'s properties
so that they get documented in the downstream tools (TGI docs but also
`huggingface.js`/`huggingface_hub` inference clients). I mostly took
inspiration from [the python
client](https://github.com/huggingface/text-generation-inference/blob/main/clients/python/text_generation/types.py)
for the descriptions.

feat: allow null eos and bos tokens in config (#1791)

This PR resolves an issue loading in tokenizer_configs where the eos or
bos token is null as in:
[Qwen/Qwen1.5-72B-Chat](https://huggingface.co/Qwen/Qwen1.5-72B-Chat/blob/main/tokenizer_config.json)

resolves:
https://github.com/huggingface/text-generation-inference/issues/1545 and
related to https://github.com/QwenLM/Qwen1.5/issues/162

Phi3 support (#1797)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

Idefics2. (#1756)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

fix: avoid frequency and repetition penalty on padding tokens (#1765)

This PR resolves an issue with the penalty processors during batched
generation where extra padding tokens incorrectly impact the penalty
scores.

generation is impacted in the case where at least one item in the batch
includes a `frequency_penalty`

reproduction script below
```python
import requests
from concurrent import futures
import time

headers = {
    "Content-Type": "application/json",
}

json_data = {
    "inputs": "[INST] Whats the capitol of France? [/INST]",
    "parameters": {
        "max_new_tokens": 100,
        "seed": 20,
        "do_sample": False,
    },
}

json_data2 = {
    "inputs": "<s>[INST]Write a mind bending story: I saw a puppy a cat a rat and a raccoon during my bike ride in the park[/INST]",
    "parameters": {
        "max_new_tokens": 100,
        "seed": 2,
        "do_sample": False,
        # OFFENDING LINE
        "frequency_penalty": 1.05,
    },
}

base_url = "http://localhost:3000/generate"

def req():
    response = requests.post(base_url, headers=headers, json=json_data)
    print("[req ]", response.json())

def req2():
    response = requests.post(base_url, headers=headers, json=json_data2)
    print("[req2]", response.json())

n = 1

for i in range(0, 3):
    print(f"- {n} threads -")
    with futures.ThreadPoolExecutor(max_workers=n) as executor:
        executor.submit(req)
        for i in range(3):
            executor.submit(req2)

    n += 1

```

**divergence from expected generation is easier to reproduce with
batched grammar requests as they are more sensitive to unexpected
outputs.

this PR resolves the issue by setting the penalty score to 0 where input
ids are padding tokens (0).

---------

Co-authored-by: OlivierDehaene <olivier@huggingface.co>

Adding support for `HF_HUB_OFFLINE` support in the router. (#1789)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

feat: improve temperature logic in chat (#1749)

This PR adds support for `do_sample` to chat to enable greedy sampling

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

Updating the benchmarks so everyone uses openai compat layer. (#1800)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

Update guidance docs to reflect grammar support in API (#1775)

Update guidance docs to reflect grammar support in API. The previous
wording was vague and made it sound like openai API supported the
grammar parameter.

https://github.com/huggingface/text-generation-inference/blob/main/router/src/server.rs#L654
confirms that support for grammar is TGI only at this time.

- [ x] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [x ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ x] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ x] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ x] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

Use the generation config. (#1808)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

2nd round of benchmark modifications (tiny adjustements to avoid overloading the host). (#1816)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

Adding new env variables for TPU backends. (#1755)

On TPU (and probably inferentia). The model needs to know right off the
bat about BATCH_SIZE and MAX_TOTAL_TOKENS (since the entire cache will
be determined by both).

This PR sends that information to the shards to they can allocate
accordingly. Should be no-op for other backends.
<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

add intel xpu support for TGI (#1475)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

---------

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Morgan Funtowicz <funtowiczmo@gmail.com>
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

Blunder (#1815)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

Fixing qwen2. (#1818)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

Dummy CI run. (#1817)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

Changing the waiting_served_ratio default (stack more aggressively by default). (#1820)

This should enable more aggressive by default stacking, meaning better
throughput (in throughput constrained environements).

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

Better graceful shutdown. (#1827)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

Prepare release.

Add the missing `tool_prompt` parameter to Python client (#1825)

This PR adds the missing `tool_prompt` parameter in Python client
<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [x] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [x] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

@Narsil

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

Small CI cleanup. (#1801)

Just unifying some branches and making intentions clearer (no cuda graph
when 0 all the way in the launcher)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

Add reference to TPU support (#1760)

This PR makes a small addition to the readme that reference new TGI
support for TPUs via Optimum TPU
(https://huggingface.co/docs/optimum-tpu/howto/serving)

fix: use get_speculate to the number of layers (#1737)

feat: add how it works section (#1773)

This PR adds a short "how it works" section to guidance and includes a
mention to the outlines library that enables grammars/tools

*and a small formatting change

---------

Co-authored-by: Mishig <mishig.davaadorj@coloradocollege.edu>

Fixing frequency penalty (#1811)

Thank you so much for the work you are doing, this is my little
contribution to this great thing you have built. I hope it is useful and
helpful, please don't hesitate to discuss any matters that are not
clear!

I am basing my implementation of frequency penalty on OpenAI's
implementation:
https://platform.openai.com/docs/guides/text-generation/parameter-details

The problem I see with TGI's current implementation is that is not
taking into account the frequency of tokens which have already been
sampled in the current generation stream. Also, the scaling is of the
adjusted token logits is done differently for positive and negative
logits. While in OpenAI's implementation token frequency is taking into
account and the scaling is always done with a subtraction (if penalty is
positive) or add operation (if penalty is negative).

This leads to corrupt generations as I mentioned in issue #1810 .
Moreover, after my tests, other issues are also gone like the one about
some request's with ``penalty_frequency = 1.0`` overruling other
requests (with ``frequency_penalty = 0.0``) in the same batch and
therefore corrupting all generations in the batch. Basically, padding
does not affect this implementation so I believe this ``score *=
input_ids.ne(0)`` is not needed anymore.

Frequency penalty | -1.0 | 0.0 | 1.0
-- | -- | -- | --
Before my change | https://paste.mozilla.org/JxqGJkWY |
https://paste.mozilla.org/hrztJ56h | https://paste.mozilla.org/pBSEH2zw
After my change | https://paste.mozilla.org/7gXCi7zo |
https://paste.mozilla.org/ZR9rJ92g | https://paste.mozilla.org/gHaD2YnC

---------

Co-authored-by: martini <martin.iglesiasgoyanes@adyen.com>

feat: add vlm docs and simple examples (#1812)

This PR start to add documentation for visual language models

Handle images in chat api (#1828)

This PR allows for messages to be formatted as simple strings, or as an
array of objects including image urls. This is done by formatting
content arrays into a simple string.

Example using `llava-hf/llava-v1.6-mistral-7b-hf`

```bash
curl localhost: 3000/v1/chat/completions \
-X POST \
-H 'Content-Type: application/json' \
-d '{
    "model": "tgi",
    "messages": [
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "Whats in this image?"
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/rabbit.png"
                    }
                }
            ]
        }
    ],
    "stream": false,
    "max_tokens": 20,
    "seed": 42
}'
```

is equivlant to this more simple request

```bash
curl localhost: 3000/v1/chat/completions \
-X POST \
-H 'Content-Type: application/json' \
-d '{
    "model": "tgi",
    "messages": [
        {
            "role": "user",
            "content": "Whats in this image?\n![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/rabbit.png)"
        }
    ],
    "stream": false,
    "max_tokens": 20,
    "seed": 42
}'
```

output
```
```

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

chore: update torch (#1730)

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

(chore): torch 2.3.0 (#1833)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

fix: split docs and start conceptual page (#1836)

This PR improves the guidance docs and adds a section that explains how
grammars are applied on a technical level

Fix: "Fixing" double BOS for mistral too. (#1843)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

Adding scripts to prepare load data. (#1841)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

Remove misleading warning (not that important nowadays anyway). (#1848)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

feat: prefer huggingface_hub in docs and show image api (#1844)

This PR prefers the `huggingface_hub` library, refactors the grammar
docs and adds the new image_url api to the vlm docs.

Updating Phi3 (long context). (#1849)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

Add router name to /info endpoint (#1854)

Add `router` key in `/info` endpoint and set it to
`env!("CARGO_PKG_NAME")` => so always set to `"text-generation-router"`
in TGI. Happy to change the naming if you think of a better one
(framework? package_name?)

The goal is to use this information in `InferenceClient` to know the
model is served with TGI. At the moment we can use
https://api-inference.huggingface.co/models/mistralai/Mistral-7B-Instruct-v0.2/info
to infer it is TGI-served because it returns information but having a
proper key would be better.

For context, a transformers-served model is only outputting `{"ok":
"ok"}` (see
[here](https://api-inference.huggingface.co/models/microsoft/DialoGPT-large/info)).

Upgrading to rust 1.78. (#1851)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

update xpu docker image and use public ipex whel (#1860)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

Refactor layers. (#1866)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

Granite support? (#1882)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

Add: Support for the Falcon2 11B architecture (#1886)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Add's support for the Falcon2 11B model architecture.

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [x] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

---------

Signed-off-by: Raphael Glon <oOraph@users.noreply.github.com>
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: OlivierDehaene <olivier@huggingface.co>
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
Co-authored-by: oOraph <13552058+oOraph@users.noreply.github.com>
Co-authored-by: Raphael Glon <oOraph@users.noreply.github.com>
Co-authored-by: Julien Chaumond <julien@huggingface.co>
Co-authored-by: OlivierDehaene <23298448+OlivierDehaene@users.noreply.github.com>
Co-authored-by: abhishek thakur <1183441+abhishekkrthakur@users.noreply.github.com>
Co-authored-by: Dong Shin <d0104.shin@gmail.com>
Co-authored-by: Christof Weickhardt <christof@weickhardt.ch>
Co-authored-by: Ikko Eltociear Ashimine <eltociear@gmail.com>
Co-authored-by: drbh <david.richard.holtz@gmail.com>
Co-authored-by: Lucain <lucain@huggingface.co>
Co-authored-by: fxmarty <9808326+fxmarty@users.noreply.github.com>
Co-authored-by: Moritz Laurer <41862082+MoritzLaurer@users.noreply.github.com>
Co-authored-by: dr3s <dr3s@users.noreply.github.com>
Co-authored-by: Wang, Yi <yi.a.wang@intel.com>
Co-authored-by: Morgan Funtowicz <funtowiczmo@gmail.com>
Co-authored-by: Maziyar Panahi <maziyar.panahi@iscpif.fr>
Co-authored-by: Brandon Royal <2762697+brandonroyal@users.noreply.github.com>
Co-authored-by: Mishig <mishig.davaadorj@coloradocollege.edu>
Co-authored-by: Martin Iglesias Goyanes <martinigoyanes@hotmail.com>
Co-authored-by: martini <martin.iglesiasgoyanes@adyen.com>

MLPSpeculator. (#1865)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

---------

Co-authored-by: Joshua Rosenkranz <joshua.rosenkranz@gmail.com>

Fixing truncation. (#1890)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

Correct 'using guidance' link (#1892)

Fix typo in link to 'using guidance' article

Add GPT-2 with flash attention (#1889)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

This change adds `FlashGPT2ForCausalLM` and wires it up. The model
itself is pretty straightforward, the main difference from other models
is that it uses trained position embeddings and that all weight matrices
are transposed compared to other models (due to the use of Conv1D in the
upstream model).

<!-- Remove if not applicable -->

Fixes # (issue)

- [x] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [x] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [x] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [x] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

@Narsil

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

Removing accepted ids in the regular info logs, downgrade to debug. (#1898)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

feat: add deprecation warning to clients (#1855)

This PR adds a deprecation warning to the clients and points users to
the https://github.com/huggingface/huggingface_hub

[Bug Fix] Update torch import reference in bnb quantization (#1902)

Fixes `Import Error` occurring from mismatch of usage between
torch.nn.Module and nn.Module.

Pali gemma modeling (#1895)

This PR adds paligemma modeling code

Blog post: https://huggingface.co/blog/paligemma
Transformers PR: https://github.com/huggingface/transformers/pull/30814

install the latest changes and run with
```bash

text-generation-launcher --model-id gv-hf/PaliGemma-base-224px-hf
```

basic example sending various requests
```python
from huggingface_hub import InferenceClient

client = InferenceClient("http://127.0.0.1:3000")

images = [
    "https://huggingface.co/datasets/hf-internal-testing/fixtures-captioning/resolve/main/cow_beach_1.png",
    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/rabbit.png",
]

prompts = [
    "What animal is in this image?",
    "Name three colors in this image.",
    "What are 10 colors in this image?",
    "Where is the cow standing?",
    "answer en Where is the cow standing?",
    "Is there a bird in the image?",
    "Is ther a cow in the image?",
    "Is there a rabbit in the image?",
    "how many birds are in the image?",
    "how many rabbits are in the image?",
]

for img in images:
    print(f"\nImage: {img.split('/')[-1]}")
    for prompt in prompts:
        inputs = f"![]({img}){prompt}\n"
        json_data = {
            "inputs": inputs,
            "parameters": {
                "max_new_tokens": 30,
                "do_sample": False,
            },
        }
        generated_output = client.text_generation(prompt, max_new_tokens=30, stream=False)
        print([f"{prompt}\n{generated_output}"])

```

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

OpenAI function calling compatible support (#1888)

<!-- Remove if not applicable -->

Fixes # (issue)
https://github.com/huggingface/text-generation-inference/issues/1887

- [no ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [yes] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ yes] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [yes ] Did you make sure to update the documentation with your
changes? Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ yes] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

@Narsil

 -->

---------

Co-authored-by: Bao Phan <baopg@inter-k.com>

Fixing types. (#1906)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

Types. (#1909)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

Fixing signals. (#1910)

Taking the signal handles later, so during loads,
regular signal handling is done, we only need to handle SIGINT and
SIGTERM during real loads to get more graceful shutdowns when queries
are in flight.

Fixes #1842

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

Removing some unused code. (#1915)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

MI300 compatibility (#1764)

Adds support for AMD Instinct MI300 in TGI.

Most changes are:
* Support PyTorch TunableOp to pick the GEMM/GEMV kernels for decoding
https://github.com/pytorch/pytorch/tree/main/aten/src/ATen/cuda/tunable.
TunableOp is disabled by default, and can be enabled with
`PYTORCH_TUNABLEOP_ENABLED=1`.
* Update ROCm dockerfile to PyTorch 2.3 (actually patched with changes
from https://github.com/pytorch/pytorch/pull/124362)
* Support SILU & Linear custom kernels contributed by AMD
* Update vLLM paged attention to https://github.com/fxmarty/rocm-vllm/,
branching out of a much more recent commit
https://github.com/ROCm/vllm/commit/3489ce7936c5de588916ae3047c44c23c0b0c308
* Support FA2 Triton kernel as recommended by AMD. Can be used by
specifying `ROCM_USE_FLASH_ATTN_V2_TRITON=1`.
* Update dockerfile to ROCm 6.1

By default, TunableOp tuning results are saved in `/data` (e.g.
`/data/tunableop_meta-llama-Llama-2-70b-chat-hf_tp1_rank0.csv`) in order
to avoid to have to rerun the tuning at each `docker run`.

Example:
```
Validator,PT_VERSION,2.3.0
Validator,ROCM_VERSION,6.1.0.0-82-5fabb4c
Validator,HIPBLASLT_VERSION,0.7.0-1549b021
Validator,GCN_ARCH_NAME,gfx942:sramecc+:xnack-
Validator,ROCBLAS_VERSION,4.1.0-cefa4a9b-dirty
GemmTunableOp_Half_TN,tn_8192_7_28672,Gemm_Rocblas_45475,0.132098
GemmTunableOp_Half_TN,tn_10240_4_8192,Gemm_Rocblas_45546,0.0484431
GemmTunableOp_Half_TN,tn_32000_6_8192,Default,0.149546
GemmTunableOp_Half_TN,tn_32000_3_8192,Gemm_Rocblas_45520,0.147119
GemmTunableOp_Half_TN,tn_8192_3_28672,Gemm_Rocblas_45475,0.132645
GemmTunableOp_Half_TN,tn_10240_3_8192,Gemm_Rocblas_45546,0.0482971
GemmTunableOp_Half_TN,tn_57344_5_8192,Gemm_Rocblas_45520,0.255694
GemmTunableOp_Half_TN,tn_10240_7_8192,Gemm_Rocblas_45517,0.0482522
GemmTunableOp_Half_TN,tn_8192_3_8192,Gemm_Rocblas_45546,0.0444671
GemmTunableOp_Half_TN,tn_8192_5_8192,Gemm_Rocblas_45546,0.0445834
GemmTunableOp_Half_TN,tn_57344_7_8192,Gemm_Rocblas_45520,0.25622
GemmTunableOp_Half_TN,tn_8192_2_28672,Gemm_Rocblas_45475,0.132122
GemmTunableOp_Half_TN,tn_8192_4_8192,Gemm_Rocblas_45517,0.0453191
GemmTunableOp_Half_TN,tn_10240_5_8192,Gemm_Rocblas_45517,0.0482514
GemmTunableOp_Half_TN,tn_8192_5_28672,Gemm_Rocblas_45542,0.133914
GemmTunableOp_Half_TN,tn_8192_2_8192,Gemm_Rocblas_45517,0.0446516
GemmTunableOp_Half_TN,tn_8192_1_28672,Gemm_Hipblaslt_TN_10814,0.131953
GemmTunableOp_Half_TN,tn_10240_2_8192,Gemm_Rocblas_45546,0.0481043
GemmTunableOp_Half_TN,tn_32000_4_8192,Gemm_Rocblas_45520,0.147497
GemmTunableOp_Half_TN,tn_8192_6_28672,Gemm_Rocblas_45529,0.134895
GemmTunableOp_Half_TN,tn_57344_2_8192,Gemm_Rocblas_45520,0.254716
GemmTunableOp_Half_TN,tn_57344_4_8192,Gemm_Rocblas_45520,0.255731
GemmTunableOp_Half_TN,tn_10240_6_8192,Gemm_Rocblas_45517,0.0484816
GemmTunableOp_Half_TN,tn_57344_3_8192,Gemm_Rocblas_45520,0.254701
GemmTunableOp_Half_TN,tn_8192_4_28672,Gemm_Rocblas_45475,0.132159
GemmTunableOp_Half_TN,tn_32000_2_8192,Default,0.147524
GemmTunableOp_Half_TN,tn_32000_5_8192,Default,0.147074
GemmTunableOp_Half_TN,tn_8192_6_8192,Gemm_Rocblas_45546,0.0454045
GemmTunableOp_Half_TN,tn_57344_6_8192,Gemm_Rocblas_45520,0.255582
GemmTunableOp_Half_TN,tn_32000_7_8192,Default,0.146705
GemmTunableOp_Half_TN,tn_8192_7_8192,Gemm_Rocblas_45546,0.0445489
```

---------

Co-authored-by: Mohit Sharma <mohit21sharma.ms@gmail.com>

Add TGI monitoring guide through Grafana and Prometheus (#1908)

As per title. It is very useful.

Update grafana template (#1918)

As per title, there was a mistake

credit to @Narsil

updated
https://huggingface.co/docs/text-generation-inference/basic_tutorials/monitoring
as well

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

Fix TunableOp bug (#1920)

cc @Narsil

Fix TGI issues with ROCm (#1921)

Not all models were tested in
https://github.com/huggingface/text-generation-inference/pull/1764.

Fixing some more issues (notably starcoder2) here, the full CI will come
shortly once we split `build.yml` in two

Fixing the download strategy for ibm-fms (#1917)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

ROCm: make CK FA2 default instead of Triton (#1924)

As per title.

Triton autotune overhead is prohibitive, as it needs to be done for each
different prompt length.

docs: Fix grafana dashboard url (#1925)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes an incorrect url in monitoring doc.

- [x] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

feat: include token in client test like server tests (#1932)

This PR simply includes the HF token in the client tests similar to how
it's included in the server tests. This helps avoid CI failure due to
rate limiting

Creating doc automatically for supported models. (#1929)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

fix: use path inside of speculator config (#1935)

This PR access the path on the speculator similar to
`MLPSpeculatorHead.load` and `MedusaHeadV1.load`

these changes resolves this error locally when loading a `MedusaHeadV2`
```
TypeError: expected str, bytes or os.PathLike object, not dict
```

feat: add train medusa head tutorial (#1934)

This PR adds a tutorial to self distill and train medusa heads for a
specific model

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

reenable xpu for tgi (#1939)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

Fixing some legacy behavior (big swapout of serverless on legacy stuff). (#1937)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

---------

Co-authored-by: Daniël de Kok <me@github.danieldk.eu>

Add completion route to client and add stop parameter where it's missing (#1869)

- Add the stop parameter to the completion route
- Add the completion method to the python client
- Add the stop parameter to the python client's chat method

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [x] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

@Narsil

---------

Co-authored-by: Thomas SCHILLACI <tschilla@px101.prod.exalead.com>
Co-authored-by: Thomas Schillaci <thomas.schillaci@3ds.com>

Improving the logging system. (#1938)

- Added a debug log for speculated ids (helps seeing in logs quality of
  a speculator).
- Remove newlines from child process logs when re-emitting in non JSON
  mode.
- Made standard level be closer to what's expected (only our binaries
  level).
- Propagate that level correctly to the shard (was forced into INFO).

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

Fixing codellama loads by using purely `AutoTokenizer`. (#1947)

- The need for the slow tokenizer default stems from back
  when llama 1 was introduced and all the flags where not
  supported in `tokenizers`.

- Fixes #1891

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

Fix seeded output. (#1949)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

Fix (flash) Gemma prefix and enable tests

Fix GPTQ for models which do not have float16 at the default dtype (simpler) (#1953)

Fix GPTQ for models which do not have float16 at the default dtype

Before this change GPTQ models would not work if the model's default
data type is not `float16`. For example, Gemma GPTQ models would fail
because the default dtype of Gemma is `bfloat16`. There are two issues:

If the default `dtype` is not `float16`, the quantizer's `float16`
parameters get converted to that dtype. The kernels cannot deal
with non-`float16` types. The same applies to inputs of quantized ops.

This is resolved by setting the dtype of gptq/awq-quantized models to
`float16`.

Simpler version of #1951.

**Draft:** just testing...

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [x] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

Processor config chat template (#1954)

This PR loads the `processor_config` similar to the `tokenizer_config`
and uses the processor_config's chat_template if the tokenizer_config
does not include one. These changes enable chat with idefics2

fix small typo and broken link (#1958)

Fix a typo; fix a broken link; add one sentence in the guidance docs to
make the word "grammar" less abstract

- [x] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

@drbh

Upgrade to Axum 0.7 and Hyper 1.0 (Breaking change: disabled ngrok tunneling). (#1959)

- Axum upgraded to hyper 1.0 and most of the ecosystem switched so it's
our time now
- [ngrok-rust](https://github.com/ngrok/ngrok-rust/pull/137/files)
hasn't yet, and hasn't for several months now, so let's disabled the
feature for the time being.

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

Fix (non-container) pytest stdout buffering-related lock-up

Two issues:

1. When one of the stdout/stderr pipe buffers of a process started
   with `subprocess.Popen` is full, the process can get blocked until
   the buffer is drained.
2. Calling `Popen.wait` can deadlock when called before draining
   the pipe buffers (if they are full).

This avoids the issue altogether by giving the child process a
temporary file to write to.

Fixing the text part from tokenizer endpoint. (#1967)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

feat: adjust attn weight loading logic (#1975)

This PR updates `load_attention` to prefer loading specific attention
based on the model type. Additionally there were two cases where
`TensorParallelColumnLinear.load_multi` was called and this reduces it
to a single path

Add support for exl2 quantization

Mostly straightforward, changes to existing code:

* Wrap quantizer parameters in a small wrapper to avoid passing
  around untyped tuples and needing to repack them as a dict.
* Move scratch space computation to warmup, because we need the
  maximum input sequence length to avoid allocating huge
  scratch buffers that OOM.

Gemma GPTQ checks: skip logprob checks

This test fails somewhat regularly due to non-determinism and this
test is primarily to verify that we are loading a model which doesn't
have `float16` as the default dtype correctly.

Update documentation version to 2.0.4 (#1980)

As per title

cc @Narsil

Purely refactors paged/attention into `layers/attention` and make hardware differences more obvious with 1 file per hardware. (#1986)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

Fixing exl2 scratch buffer. (#1990)

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

 -->

single char ` addition for docs (#1989)

I think this will fix the docs from being weirdly formatted. All the
sections after MAX_TOP_N_TOKENS don't show up in the bar on the right
(https://huggingface.co/docs/text-generation-inference/basic_tutorials/launcher#maxtopntokens)

- [x] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?

@merveenoyan

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

Fixing Phi3.

fix the lora-id parameter in the benchmark

Update README.md

add placeholder for flashinfer phi modeling (#24)

integrate lora intommistral

Update Makefile to include punica kernels

Integrate qwen2

Fix minor typos

testing llama-3-70b-gptq

add lora functions to python client; test llama-3-70b AWQ

Add qwen2 1.8b and 72b base inference

Support Flashinfer based Phi2 and Phi3 models (#26)

* add phi model

* fix phi integration errors

* padding for phi

* fix modeling for phi

* workarounds for phi

* use flash attn's position rotary embedding

* support phi3 and baichuan

* fix position encoding

* clean up

Refactor the Flashinfer models (#27)

* refactor the flashinfer models

* fixes

Introduce the flashinfer attention wrapper abstraction and use it for Llama and Gemma models (#28)

* abstract the attention layer

* fix the bugs

Compliant for pre-commit configs

kv.run test workflow

kv.run test workflows (#29)

* python 3.10

* python 3.10.14

* update doc

* dispatch

Kv.run test workflows (#30)

* python 3.10

* python 3.10.14

* update doc

* dispatch

* update python workflow

* update python workflow

* update python workflow

* update python workflow

* update python workflow

* update python workflow

* update python workflow

* update python workflow

* update python workflow

* update python workflow

* update python workflow

* update python workflow

* update python workflow

* update python workflow

* update python workflow

* update python workflow

Llama rewrite (#31)

* write llama in tgi style

* fixes

* fix the runtime issues

reformat the llama files (#32)

Decouple flashinfer code paths from flash attention library dependencies (#33)

* decouple flash attn dependency from flashinfer code paths

* follow up

critical output bug (#25)

* output debug

* update minor

minor typo

bug fix in layers/__init__.py

fix dtype bugs in flashinfer model def

minor fixes and rename tests.xml

test docker (#34)

docker build workflow; remove submodules (#35)

* test docker

* docker

* remove submodule

* updates

remove tgi build workflow

docker workflow

docker workflow

build workflow update

fix in workflow

dependency and rust toolchain fix

fix warm up issue

finalize docker build workflow

minor router-server fix

minor fixes

update to rust 1.79

minor fix in output example

flash attn rotary

fix rotary bug

fixes

fix the flashinfer adapter

fix phi2 and phi3 modeling

fix lint

revert test file

adjust the flashinfer llama model to accomodate baichuan

decouple flashinfer files from flash attention (#41)

Fix the server CLI issue with use_flashinfer flag (#42)

* fix refactor

* empty

* fix lint

update FlashinferAttentionWrapper to flashinfer 0.0.6
---
 Makefile                                      |   2 +
 server/examples/test_local_api.py             |  56 +--
 server/examples/test_local_grpc.py            |  43 +-
 server/text_generation_server/cache.py        |   3 -
 server/text_generation_server/cli.py          |  36 +-
 .../layers/flashinfer_attention.py            |  16 +-
 .../flashinfer_llama_modeling.py              |  19 +-
 .../flashinfer_mistral_modeling.py            | 199 ++++++--
 .../flashinfer_qwen2_modeling.py              | 294 +++++++++---
 .../models_flashinfer/flashinfer_causal_lm.py | 445 +++++++++---------
 .../models_flashinfer/flashinfer_qwen2.py     |   4 +-
 .../server_flashinfer.py                      |  66 ++-
 .../utils/cache_manager_flashinfer.py         | 113 +++--
 .../utils/lora_utils.py                       |   4 +-
 14 files changed, 822 insertions(+), 478 deletions(-)

diff --git a/Makefile b/Makefile
index ccf99e0a..ca5d752e 100644
--- a/Makefile
+++ b/Makefile
@@ -1,5 +1,7 @@
 install-punica-kernel:
 	pip install wheel setuptools --upgrade
+	git submodule sync
+	git submodule update --init
 	cd server/punica_kernels && pip install -v --no-build-isolation .
 
 install-server:
diff --git a/server/examples/test_local_api.py b/server/examples/test_local_api.py
index 87606efc..87fad7c7 100644
--- a/server/examples/test_local_api.py
+++ b/server/examples/test_local_api.py
@@ -2,10 +2,6 @@
 import torch
 from text_generation_server.models_flashinfer.flashinfer_llama import FlashinferLlama
 from text_generation_server.models_flashinfer.flashinfer_gemma import FlashinferGemma
-from text_generation_server.models_flashinfer.flashinfer_qwen2 import FlashinferQwen2
-from text_generation_server.models_flashinfer.flashinfer_chatglm import (
-    FlashinferChatGLM,
-)
 import sys
 
 try:
@@ -31,13 +27,11 @@
     # test = "gemma"
     # test = "llama-3"
     # test = 'llama-3-70'
-    test = "gemma"
+    test = "llama-2"
     # test = 'mistral'
-    # test = 'qwen1.5-7'
-    # test = 'qwen1.5-1.8'
-    # test = 'qwen1.5-70'
-    # test = 'qwen2-7'
-    # test = "chatglm4"
+    # test = 'qwen2'
+    # test = 'qwen2-1.8'
+    # test = 'qwen2-70'
 print("Testing " + test)
 
 # Load demo inputs
@@ -167,7 +161,7 @@ def make_input(lora_id, lora_or_base, id=0, promptOverride=None):
         ),
     ]
     service = FlashinferMistral(model_id="mistralai/Mistral-7B-v0.3")
-elif test == "qwen1.5-7":
+elif test == "qwen2":
     requests = [
         make_input(
             "REILX/Qwen1.5-7B-Chat-750Mb-lora",
@@ -186,7 +180,7 @@ def make_input(lora_id, lora_or_base, id=0, promptOverride=None):
     service = FlashinferQwen2(
         model_id="Qwen/Qwen1.5-7B-Chat", lora_ids=["REILX/Qwen1.5-7B-Chat-750Mb-lora"]
     )
-elif test == "qwen1.5-1.8":
+elif test == "qwen2-1.8":
     # Todo: Add qwen1.5 1.8b chat lora adapter / Output Repetition Problem
     requests = [
         make_input(
@@ -200,7 +194,7 @@ def make_input(lora_id, lora_or_base, id=0, promptOverride=None):
     service = FlashinferQwen2(
         model_id="Qwen/Qwen1.5-1.8B-Chat", lora_ids=["REILX/Qwen1.5-7B-Chat-750Mb-lora"]
     )
-elif test == "qwen1.5-70":
+elif test == "qwen2-70":
     # Todo: Add qwen1.5 72b chat lora adapter
     requests = [
         make_input(
@@ -266,45 +260,23 @@ def make_input(lora_id, lora_or_base, id=0, promptOverride=None):
     service = FlashinferLlama(
         model_id="baichuan-inc/Baichuan2-7B-Chat", trust_remote_code=True
     )
-elif test == "qwen2-7":
-    # Todo: qwen2-7b instruct lora adapter
-    requests = [
-        make_input(
-            "abcdabcd987/gsm8k-llama2-7b-lora-16",
-            "base",
-            id=0,
-            promptOverride="给我讲个故事",
-        ),
-    ]
-    service = FlashinferQwen2(model_id="Qwen/Qwen2-7B-Instruct", trust_remote_code=True)
-
-elif test == "chatglm4":
-    # Todo: chatglm4-9b lora adapter
-    requests = [
-        make_input(
-            "abcdabcd987/gsm8k-llama2-7b-lora-16",
-            "base",
-            id=0,
-            promptOverride="给我讲个故事",
-        ),
-    ]
-    service = FlashinferChatGLM(model_id="THUDM/glm-4-9b-chat", trust_remote_code=True)
 
 print(service.get_lora_adapters())
 tokenizer = service.tokenizer
 
 batch = generate_pb2.Batch(id=0, requests=requests, size=len(requests))
+pb_batch = FlashinferBatch.from_pb(
+    batch, tokenizer, torch.float16, torch.device("cuda")
+)
+
+# Add input batch to model service
+ids = service.add_request(pb_batch)
 display_results = {}
 
 # Iterative generation: each step generates a token for each input in the batch
 isPrefill = True
 while True:
-    if isPrefill:
-        generations, next_batch, _ = service.prefill_batch(batch)
-        isPrefill = False
-    else:
-        generations, next_batch, _, _ = service.decode_batch([next_batch.to_pb()])
-
+    generations, _, _ = service.generate_token(FlashinferBatch.Empty(batch.id))
     for gen in generations:
         if gen.prefill_tokens:
             display_results[gen.request_id] = [
diff --git a/server/examples/test_local_grpc.py b/server/examples/test_local_grpc.py
index 8b92865c..10ac1e59 100644
--- a/server/examples/test_local_grpc.py
+++ b/server/examples/test_local_grpc.py
@@ -46,8 +46,18 @@ def make_input(lora_id, lora_or_base, id=0, promptOverride=None):
 
 
 requests = [
-    make_input("tjluyao/gemma-2b-it-math", "base", id=0),
-    make_input("tjluyao/gemma-2b-it-math", "base", id=1),
+    make_input(
+        "abcdabcd987/gsm8k-llama2-7b-lora-16",
+        "base",
+        id=0,
+        promptOverride="Give me a breif introduction to Byznatine Fault Tolerance and why it is important?",
+    ),
+    make_input(
+        "abcdabcd987/gsm8k-llama2-7b-lora-16",
+        "lora",
+        id=1,
+        promptOverride="Which network interface card is more suitable for distributed systems, Meallanox or Broadcom?",
+    ),
 ]
 
 # Assemble input batch
@@ -68,26 +78,11 @@ def make_input(lora_id, lora_or_base, id=0, promptOverride=None):
     )
     stub.Warmup(wr)
     # Prefill
-    pr = generate_pb2.PrefillRequest(batch=pb_batch_with_inputs)
+    pr = generate_pb2.PrefillRequest(batch=pb_batch_empty)
     resp = stub.Prefill(pr)
-    generations, cbatch = resp.generations, resp.batch
-    for gen in generations:
-        print(gen.tokens.texts)
-
-    print("finished prefill tokens")
-
-    while True:
-        dr = generate_pb2.DecodeRequest(batches=[cbatch])
-        resp = stub.Decode(dr)
-        generations, cbatch = resp.generations, resp.batch
-        toExit = False
-        for gen in generations:
-            if gen.generated_text.text:
-                print("finished")
-                res = gen.generated_text.text
-                toExit = True
-
-        if toExit:
-            break
-
-    print(res)
+    gen, cbatch = resp.generations, resp.batch
+    # Decode
+    dr = generate_pb2.DecodeRequest(batches=[cbatch])
+    resp = stub.Decode(dr)
+    gen, cbatch = resp.generations, resp.batch
+    print("done")
diff --git a/server/text_generation_server/cache.py b/server/text_generation_server/cache.py
index 117f8499..4504733e 100644
--- a/server/text_generation_server/cache.py
+++ b/server/text_generation_server/cache.py
@@ -11,9 +11,6 @@ class Cache:
     def __init__(self):
         self.cache: Dict[int, B] = {}
 
-    def get_all_values(self):
-        return self.cache.values()
-
     def pop(self, batch_id: int) -> Optional[B]:
         return self.cache.pop(batch_id, None)
 
diff --git a/server/text_generation_server/cli.py b/server/text_generation_server/cli.py
index 30bf479f..8406946b 100644
--- a/server/text_generation_server/cli.py
+++ b/server/text_generation_server/cli.py
@@ -93,20 +93,30 @@ def serve(
     if use_flashinfer:
         from text_generation_server import server_flashinfer
 
-        serv = server_flashinfer
+        server_flashinfer.serve(
+            model_id,
+            revision,
+            sharded,
+            quantize,
+            speculate,
+            dtype,
+            trust_remote_code,
+            uds_path,
+            lora_ids,
+        )
     else:
-        serv = server
-    serv.serve(
-        model_id,
-        revision,
-        sharded,
-        quantize,
-        speculate,
-        dtype,
-        trust_remote_code,
-        uds_path,
-        lora_ids,
-    )
+        server.serve(
+            model_id,
+            revision,
+            sharded,
+            quantize,
+            speculate,
+            dtype,
+            trust_remote_code,
+            uds_path,
+            lora_ids,
+        )
+
 
 
 @app.command()
diff --git a/server/text_generation_server/layers/flashinfer_attention.py b/server/text_generation_server/layers/flashinfer_attention.py
index 73245680..a13ddfdd 100644
--- a/server/text_generation_server/layers/flashinfer_attention.py
+++ b/server/text_generation_server/layers/flashinfer_attention.py
@@ -43,8 +43,6 @@ def __init__(
         )
         self.page_size = 16
 
-        self.group_size = self.num_attention_heads // self.num_key_value_heads
-
     def computeAttention(
         self,
         q: torch.Tensor,
@@ -184,17 +182,9 @@ def _batchDecode(
             decodeBatchPosition.kv_last_page_len,
         )
 
-        if self.group_size in [7, 16]:
-            decode_wrapper = flashinfer.BatchDecodeWithPagedKVCacheWrapper(
-                workspace_buffer=self._workspace_buffer,
-                kv_layout="NHD",
-                use_tensor_cores=True,
-            )
-        else:
-            decode_wrapper = flashinfer.BatchDecodeWithPagedKVCacheWrapper(
-                workspace_buffer=self._workspace_buffer, kv_layout="NHD"
-            )
-
+        decode_wrapper = flashinfer.BatchDecodeWithPagedKVCacheWrapper(
+            workspace_buffer=self._workspace_buffer, kv_layout="NHD"
+        )
         decode_wrapper.begin_forward(
             decodeBatchPosition.kv_page_indptr,
             decodeBatchPosition.kv_page_indices,
diff --git a/server/text_generation_server/models_flashinfer/custom_modeling/flashinfer_llama_modeling.py b/server/text_generation_server/models_flashinfer/custom_modeling/flashinfer_llama_modeling.py
index bbd3e6d5..270f4b46 100644
--- a/server/text_generation_server/models_flashinfer/custom_modeling/flashinfer_llama_modeling.py
+++ b/server/text_generation_server/models_flashinfer/custom_modeling/flashinfer_llama_modeling.py
@@ -123,8 +123,7 @@ def forward(
         q = q_proj.contiguous()
         k = k_proj.contiguous()
         v = v_proj.contiguous()
-        if loraWeight:
-            loraWeight.apply_lora_weight_kvq(q, k, v, hidden_states, self.layer_idx)
+        loraWeight.apply_lora_weight_kvq(q, k, v, hidden_states, self.layer_idx)
 
         self.rotary_emb(
             q.view(
@@ -152,10 +151,9 @@ def forward(
             self.rotaryParams,
         )
         attn_outputs = self.o_proj(attn_outputs_raw)
-        if loraWeight:
-            loraWeight.apply_lora_weight_attn(
-                attn_outputs, attn_outputs_raw, self.layer_idx
-            )
+        loraWeight.apply_lora_weight_attn(
+            attn_outputs, attn_outputs_raw, self.layer_idx
+        )
         return attn_outputs
 
 
@@ -208,16 +206,13 @@ def forward(
         gate_up_states = self.gate_up_proj(hidden_states)
         gate_up_states = gate_up_states.view(-1, 2, self.intermediate_size)
         gate = gate_up_states[:, 0].contiguous()
-        if loraWeight:
-            loraWeight.apply_lora_weight_gate(gate, hidden_states, self.layer_idx)
+        loraWeight.apply_lora_weight_gate(gate, hidden_states, self.layer_idx)
         gate = self.act(gate)
         up = gate_up_states[:, 1].contiguous()
-        if loraWeight:
-            loraWeight.apply_lora_weight_up(up, hidden_states, self.layer_idx)
+        loraWeight.apply_lora_weight_up(up, hidden_states, self.layer_idx)
         t = gate * up
         down = self.down_proj(t)
-        if loraWeight:
-            loraWeight.apply_lora_weight_down(down, t, self.layer_idx)
+        loraWeight.apply_lora_weight_down(down, t, self.layer_idx)
         return down
 
 
diff --git a/server/text_generation_server/models_flashinfer/custom_modeling/flashinfer_mistral_modeling.py b/server/text_generation_server/models_flashinfer/custom_modeling/flashinfer_mistral_modeling.py
index 3057d32b..7b8c54c7 100644
--- a/server/text_generation_server/models_flashinfer/custom_modeling/flashinfer_mistral_modeling.py
+++ b/server/text_generation_server/models_flashinfer/custom_modeling/flashinfer_mistral_modeling.py
@@ -47,11 +47,6 @@
 from text_generation_server.layers.rotary import PositionRotaryEmbedding
 from text_generation_server.layers.layernorm import FastRMSNorm
 
-from text_generation_server.layers.flashinfer_attention import (
-    FlashinferAttentionWrapper,
-    AttentionRotaryParams,
-)
-
 
 class FlashinferBatch:
     def __init__(self, seq_indptr, kv_page_indptr, kv_page_indices, kv_last_page_len):
@@ -231,38 +226,177 @@ def forward(
         kvCachePool: KvCachePool,
         prefillBatchPosition: KvCacheBatchPosition,
         decodeBatchPosition: KvCacheBatchPosition,
-        loraWeight: BatchedModelLoraWeight | None,
+        lora: BatchedModelLoraWeight | None,
     ) -> torch.Tensor:
-        q_dim = (
-            self.flashinferWrapper.num_attention_heads * self.flashinferWrapper.head_dim
-        )
-        kv_dim = (
-            self.flashinferWrapper.num_key_value_heads * self.flashinferWrapper.head_dim
-        )
-        qkv = self.qkv_proj(hidden_states)
+        qkv = self.query_key_value(hidden_states)
+
+        # qkv = qkv.to('cuda')
+
         q_proj, k_proj, v_proj = qkv.split(
-            [q_dim, kv_dim, kv_dim],
+            [
+                self.head_size * self.num_heads,
+                self.head_size * self.num_key_value_heads,
+                self.head_size * self.num_key_value_heads,
+            ],
             dim=1,
         )
-        q = q_proj.contiguous()
-        k = k_proj.contiguous()
-        v = v_proj.contiguous()
-        loraWeight.apply_lora_weight_kvq(q, k, v, hidden_states, self.layer_idx)
-        attn_outputs_raw = self.flashinferWrapper.computeAttention(
-            q,
-            k,
-            v,
-            kvCachePool.cache_data[self.layer_idx],
-            kvCachePool.page_len,
-            prefillBatchPosition,
-            decodeBatchPosition,
-            self.rotaryParams,
-        )
-        attn_outputs = self.o_proj(attn_outputs_raw)
-        loraWeight.apply_lora_weight_attn(
-            attn_outputs, attn_outputs_raw, self.layer_idx
+
+        q_proj = q_proj.contiguous()
+        k_proj = k_proj.contiguous()
+        v_proj = v_proj.contiguous()
+
+        # print(f"q proj {q_proj}")
+        # print(f"lora rank: {lora.rank}")
+
+        if lora:
+            add_lora(
+                q_proj,
+                hidden_states,
+                lora.q.wa_ptr,
+                lora.q.wb_ptr,
+                lora.segment,
+                self.layer_idx,
+                lora.rank,
+            )
+            add_lora(
+                k_proj,
+                hidden_states,
+                lora.k.wa_ptr,
+                lora.k.wb_ptr,
+                lora.segment,
+                self.layer_idx,
+                lora.rank,
+            )
+            add_lora(
+                v_proj,
+                hidden_states,
+                lora.v.wa_ptr,
+                lora.v.wb_ptr,
+                lora.segment,
+                self.layer_idx,
+                lora.rank,
+            )
+
+        stack_attn_output = []
+        workspace_buffer = torch.empty(
+            32 * 1024 * 1024, dtype=torch.int8, device=kvCachePool.device
         )
-        return attn_outputs
+        prefillTotalSeqLen = prefillBatchPosition.total_seq_len
+        if prefillTotalSeqLen > 0:
+            q = (
+                q_proj[:prefillTotalSeqLen]
+                .view(prefillTotalSeqLen, self.num_heads, self.head_size)
+                .contiguous()
+            )
+            k = (
+                k_proj[:prefillTotalSeqLen]
+                .view(prefillTotalSeqLen, self.num_key_value_heads, self.head_size)
+                .contiguous()
+            )
+            v = (
+                v_proj[:prefillTotalSeqLen]
+                .view(prefillTotalSeqLen, self.num_key_value_heads, self.head_size)
+                .contiguous()
+            )
+
+            seq_indptr = prefillBatchPosition.seq_indptr.clone()
+            kv_page_indices = prefillBatchPosition.kv_page_indices.clone()
+            kv_page_indptr = prefillBatchPosition.kv_page_indptr.clone()
+            kv_last_page_len = prefillBatchPosition.kv_last_page_len.clone()
+
+            flashinfer.append_paged_kv_cache(
+                k,
+                v,
+                seq_indptr,
+                kvCachePool.cache_data[self.layer_idx],
+                kv_page_indices,
+                kv_page_indptr,
+                kv_last_page_len,
+            )
+
+            prefill_wrapper = flashinfer.BatchPrefillWithPagedKVCacheWrapper(
+                workspace_buffer, "NHD"
+            )
+
+            prefill_wrapper.begin_forward(
+                seq_indptr,
+                kv_page_indptr,
+                kv_page_indices,
+                kv_last_page_len,
+                self.num_heads,
+                self.num_key_value_heads,
+                self.head_size,
+            )
+
+            attn_output_prefill = prefill_wrapper.forward(
+                q,
+                kvCachePool.cache_data[self.layer_idx],
+                causal=True,
+                pos_encoding_mode="ROPE_LLAMA",
+            ).view(prefillTotalSeqLen, self.hidden_size)
+
+            prefill_wrapper.end_forward()
+            stack_attn_output.append(attn_output_prefill)
+
+        decodeTotalSeqLen = decodeBatchPosition.total_seq_len
+        if decodeTotalSeqLen > 0:
+            q = (
+                q_proj[prefillTotalSeqLen:]
+                .view(decodeTotalSeqLen, self.num_heads, self.head_size)
+                .contiguous()
+            )
+            k = (
+                k_proj[prefillTotalSeqLen:]
+                .view(decodeTotalSeqLen, self.num_key_value_heads, self.head_size)
+                .contiguous()
+            )
+            v = (
+                v_proj[prefillTotalSeqLen:]
+                .view(decodeTotalSeqLen, self.num_key_value_heads, self.head_size)
+                .contiguous()
+            )
+
+            flashinfer.append_paged_kv_cache(
+                k,
+                v,
+                decodeBatchPosition.seq_indptr,
+                kvCachePool.cache_data[self.layer_idx],
+                decodeBatchPosition.kv_page_indices,
+                decodeBatchPosition.kv_page_indptr,
+                decodeBatchPosition.kv_last_page_len,
+            )
+
+            decode_wrapper = flashinfer.BatchDecodeWithPagedKVCacheWrapper(
+                workspace_buffer, "NHD"
+            )
+
+            decode_wrapper.begin_forward(
+                decodeBatchPosition.kv_page_indptr,
+                decodeBatchPosition.kv_page_indices,
+                decodeBatchPosition.kv_last_page_len,
+                self.num_heads,
+                self.num_key_value_heads,
+                self.head_size,
+                kvCachePool.page_len,
+                pos_encoding_mode="ROPE_LLAMA",
+            )
+
+            attn_output_decode = decode_wrapper.forward(
+                q,
+                kvCachePool.cache_data[self.layer_idx],
+                pos_encoding_mode="ROPE_LLAMA",
+            ).view(decodeTotalSeqLen, self.hidden_size)
+
+            decode_wrapper.end_forward()
+            stack_attn_output.append(attn_output_decode)
+
+        if len(stack_attn_output) == 1:
+            attn_output = stack_attn_output[0]
+        else:
+            attn_output = torch.cat(stack_attn_output, dim=0)
+
+        o = self.o_proj(attn_output)
+        return o
 
 
 class MistralMLP(nn.Module):
@@ -347,7 +481,6 @@ def __init__(self, flashinferWrapper: FlashinferAttentionWrapper, layer_id, conf
         prefix = f"model.layers.{layer_id}"
         self.self_attn = MistralAttention(
             prefix=f"{prefix}.self_attn",
-            flashinferWrapper=flashinferWrapper,
             config=config,
             weights=weights,
             layer_idx=layer_id,
diff --git a/server/text_generation_server/models_flashinfer/custom_modeling/flashinfer_qwen2_modeling.py b/server/text_generation_server/models_flashinfer/custom_modeling/flashinfer_qwen2_modeling.py
index 9c5a85b6..11e33dc3 100644
--- a/server/text_generation_server/models_flashinfer/custom_modeling/flashinfer_qwen2_modeling.py
+++ b/server/text_generation_server/models_flashinfer/custom_modeling/flashinfer_qwen2_modeling.py
@@ -32,11 +32,9 @@
     Qwen2Config,
     PreTrainedModel,
 )
-from text_generation_server.layers.flashinfer_attention import (
-    FlashinferAttentionWrapper,
-    AttentionRotaryParams,
-)
+
 from punica_kernels import (
+    add_lora_sgmv_custom_cutlass as add_lora,
     rms_norm,
 )
 
@@ -236,17 +234,44 @@ def __init__(self, prefix, config, weights, layer_idx: int):
             config.intermediate_size // weights.process_group.size()
         )
 
-    def forward(self, hidden_states, loraWeight: BatchedModelLoraWeight | None):
+    def forward(self, hidden_states, lora: BatchedModelLoraWeight | None):
         gate_up_states = self.gate_up_proj(hidden_states)
         gate_up_states = gate_up_states.view(-1, 2, self.intermediate_size)
         gate = gate_up_states[:, 0].contiguous()
-        loraWeight.apply_lora_weight_gate(gate, hidden_states, self.layer_idx)
+        if lora:
+            add_lora(
+                gate,
+                hidden_states,
+                lora.gate.wa_ptr,
+                lora.gate.wb_ptr,
+                lora.segment,
+                self.layer_idx,
+                lora.rank,
+            )
         gate = self.act(gate)
         up = gate_up_states[:, 1].contiguous()
-        loraWeight.apply_lora_weight_up(up, hidden_states, self.layer_idx)
+        if lora:
+            add_lora(
+                up,
+                hidden_states,
+                lora.up.wa_ptr,
+                lora.up.wb_ptr,
+                lora.segment,
+                self.layer_idx,
+                lora.rank,
+            )
         t = gate * up
         down = self.down_proj(t)
-        loraWeight.apply_lora_weight_down(down, t, self.layer_idx)
+        if lora:
+            add_lora(
+                down,
+                hidden_states,
+                lora.down.wa_ptr,
+                lora.down.wb_ptr,
+                lora.segment,
+                self.layer_idx,
+                lora.rank,
+            )
         return down
 
 
@@ -299,21 +324,19 @@ def _load_gqa(config, prefix: str, weights):
 
 
 class FlashQwen2Attention(nn.Module):
-    def __init__(
-        self,
-        prefix: str,
-        flashinferWrapper: FlashinferAttentionWrapper,
-        config: Qwen2Config,
-        weights,
-        layer_idx: int
-    ):
+    def __init__(self, prefix: str, config: Qwen2Config, weights, layer_idx: int):
         super().__init__()
-
-        self.flashinferWrapper = flashinferWrapper
-        self.rotaryParams = AttentionRotaryParams(
-            rope_scale=None, rope_theta=config.rope_theta
-        )
-
+        self.num_heads = config.num_attention_heads
+        if self.num_heads % weights.process_group.size() != 0:
+            raise ValueError(
+                f"`num_heads` must be divisible by `num_shards` (got `num_heads`: {self.num_heads} "
+                f"and `num_shards`: {weights.process_group.size()}"
+            )
+        self.num_qo_heads = self.num_heads // weights.process_group.size()
+        self.num_kv_heads = config.num_key_value_heads // weights.process_group.size()
+        self.config = config
+        self.hidden_size = config.hidden_size
+        self.head_dim = self.hidden_size // self.num_heads
         self.layer_idx = layer_idx
         self.qkv_proj = load_attention(config, prefix, weights)
         self.o_proj = TensorParallelRowLinear.load(
@@ -323,6 +346,8 @@ def __init__(
             bias=False,
         )
 
+        self.num_key_value_heads = config.num_key_value_heads
+        self.num_key_value_groups = self.num_heads // self.num_key_value_heads
 
     def forward(
         self,
@@ -330,54 +355,181 @@ def forward(
         kvCachePool: KvCachePool,
         prefillBatchPosition: KvCacheBatchPosition,
         decodeBatchPosition: KvCacheBatchPosition,
-        loraWeight: BatchedModelLoraWeight | None,
+        lora: BatchedModelLoraWeight | None,
     ):
-        q_dim = (
-            self.flashinferWrapper.num_attention_heads * self.flashinferWrapper.head_dim
-        )
-        kv_dim = (
-            self.flashinferWrapper.num_key_value_heads * self.flashinferWrapper.head_dim
-        )
         qkv = self.qkv_proj(hidden_states)
+
         q_proj, k_proj, v_proj = qkv.split(
-            [q_dim, kv_dim, kv_dim],
+            [
+                self.head_dim * self.num_qo_heads,
+                self.head_dim * self.num_kv_heads,
+                self.head_dim * self.num_kv_heads,
+            ],
             dim=1,
         )
-        q = q_proj.contiguous()
-        k = k_proj.contiguous()
-        v = v_proj.contiguous()
-
-        loraWeight.apply_lora_weight_kvq(q, k, v, hidden_states, self.layer_idx)
-
-        attn_outputs_raw = self.flashinferWrapper.computeAttention(
-            q,
-            k,
-            v,
-            kvCachePool.cache_data[self.layer_idx],
-            kvCachePool.page_len,
-            prefillBatchPosition,
-            decodeBatchPosition,
-            self.rotaryParams,
-        )
-        attn_outputs = self.o_proj(attn_outputs_raw)
-        loraWeight.apply_lora_weight_attn(
-            attn_outputs, attn_outputs_raw, self.layer_idx
+
+        q_proj = q_proj.contiguous()
+        k_proj = k_proj.contiguous()
+        v_proj = v_proj.contiguous()
+
+        if lora:
+            add_lora(
+                q_proj,
+                hidden_states,
+                lora.q.wa_ptr,
+                lora.q.wb_ptr,
+                lora.segment,
+                self.layer_idx,
+                lora.rank,
+            )
+            add_lora(
+                k_proj,
+                hidden_states,
+                lora.k.wa_ptr,
+                lora.k.wb_ptr,
+                lora.segment,
+                self.layer_idx,
+                lora.rank,
+            )
+            add_lora(
+                v_proj,
+                hidden_states,
+                lora.v.wa_ptr,
+                lora.v.wb_ptr,
+                lora.segment,
+                self.layer_idx,
+                lora.rank,
+            )
+
+        stack_attn_output = []
+        workspace_buffer = torch.empty(
+            32 * 1024 * 1024, dtype=torch.int8, device=kvCachePool.device
         )
-        return attn_outputs
+        prefillTotalSeqLen = prefillBatchPosition.total_seq_len
+        if prefillTotalSeqLen > 0:
+            # need to revisit if contiguous conversion is the best way
+            q = (
+                q_proj[:prefillTotalSeqLen]
+                .view(prefillTotalSeqLen, self.num_qo_heads, self.head_dim)
+                .contiguous()
+            )
+            k = (
+                k_proj[:prefillTotalSeqLen]
+                .view(prefillTotalSeqLen, self.num_kv_heads, self.head_dim)
+                .contiguous()
+            )
+            v = (
+                v_proj[:prefillTotalSeqLen]
+                .view(prefillTotalSeqLen, self.num_kv_heads, self.head_dim)
+                .contiguous()
+            )
+
+            seq_indptr = prefillBatchPosition.seq_indptr.clone()
+            kv_page_indices = prefillBatchPosition.kv_page_indices.clone()
+            kv_page_indptr = prefillBatchPosition.kv_page_indptr.clone()
+            kv_last_page_len = prefillBatchPosition.kv_last_page_len.clone()
+
+            flashinfer.append_paged_kv_cache(
+                k,
+                v,
+                seq_indptr,
+                kvCachePool.cache_data[self.layer_idx],
+                kv_page_indices,
+                kv_page_indptr,
+                kv_last_page_len,
+            )
+
+            prefill_wrapper = flashinfer.BatchPrefillWithPagedKVCacheWrapper(
+                workspace_buffer, "NHD"
+            )
+
+            prefill_wrapper.begin_forward(
+                seq_indptr,
+                kv_page_indptr,
+                kv_page_indices,
+                kv_last_page_len,
+                self.num_qo_heads,
+                self.num_kv_heads,
+                self.head_dim,
+            )
+
+            attn_output_prefill = prefill_wrapper.forward(
+                q,
+                kvCachePool.cache_data[self.layer_idx],
+                causal=True,
+                pos_encoding_mode="ROPE_LLAMA",  # this may need change
+            ).view(prefillTotalSeqLen, self.hidden_size)
+            prefill_wrapper.end_forward()
+            stack_attn_output.append(attn_output_prefill)
+
+        decodeTotalSeqLen = decodeBatchPosition.total_seq_len
+        if decodeTotalSeqLen > 0:
+            q = (
+                q_proj[prefillTotalSeqLen:]
+                .view(decodeTotalSeqLen, self.num_qo_heads, self.head_dim)
+                .contiguous()
+            )
+            k = (
+                k_proj[prefillTotalSeqLen:]
+                .view(decodeTotalSeqLen, self.num_kv_heads, self.head_dim)
+                .contiguous()
+            )
+            v = (
+                v_proj[prefillTotalSeqLen:]
+                .view(decodeTotalSeqLen, self.num_kv_heads, self.head_dim)
+                .contiguous()
+            )
+
+            flashinfer.append_paged_kv_cache(
+                k,
+                v,
+                decodeBatchPosition.seq_indptr,
+                kvCachePool.cache_data[self.layer_idx],
+                decodeBatchPosition.kv_page_indices,
+                decodeBatchPosition.kv_page_indptr,
+                decodeBatchPosition.kv_last_page_len,
+            )
+
+            decode_wrapper = flashinfer.BatchDecodeWithPagedKVCacheWrapper(
+                workspace_buffer, "NHD"
+            )
+            decode_wrapper.begin_forward(
+                decodeBatchPosition.kv_page_indptr,
+                decodeBatchPosition.kv_page_indices,
+                decodeBatchPosition.kv_last_page_len,
+                self.num_qo_heads,
+                self.num_kv_heads,
+                self.head_dim,
+                kvCachePool.page_len,
+                pos_encoding_mode="ROPE_LLAMA",
+            )
+
+            attn_output_decode = decode_wrapper.forward(
+                q,
+                kvCachePool.cache_data[self.layer_idx],
+                pos_encoding_mode="ROPE_LLAMA",
+            ).view(decodeTotalSeqLen, self.hidden_size)
+
+            decode_wrapper.end_forward()
+            stack_attn_output.append(attn_output_decode)
+
+        if len(stack_attn_output) == 1:
+            attn_outputs = stack_attn_output[0]
+        else:
+            attn_outputs = torch.cat(stack_attn_output, dim=0)
+
+        # output projection
+        o = self.o_proj(attn_outputs)
+        return o
 
 
 class FlashQwen2Layer(nn.Module):
-    def __init__(
-        self,
-        flashinferWrapper: FlashinferAttentionWrapper,
-        layer_id, config, weights
-    ):
+    def __init__(self, layer_id, config, weights):
         super().__init__()
         self.layer_id = layer_id
         prefix = f"model.layers.{layer_id}"
         self.self_attn = FlashQwen2Attention(
             prefix=f"{prefix}.self_attn",
-            flashinferWrapper=flashinferWrapper,
             config=config,
             weights=weights,
             layer_idx=layer_id,
@@ -402,7 +554,7 @@ def forward(
         kvCachePool: KvCachePool,
         prefillBatchPosition: KvCacheBatchPosition,
         decodeBatchPosition: KvCacheBatchPosition,
-        loraWeight: BatchedModelLoraWeight | None,
+        lora: BatchedModelLoraWeight | None,
     ):
         normed_hidden_states, res = self.input_layernorm(hidden_states, residual)
 
@@ -411,14 +563,14 @@ def forward(
             kvCachePool,
             prefillBatchPosition,
             decodeBatchPosition,
-            loraWeight,
+            lora,
         )
 
         normed_attn_res_output, attn_res = self.post_attention_layernorm(
             attn_output, res
         )
 
-        mlp_output = self.mlp(normed_attn_res_output, loraWeight)
+        mlp_output = self.mlp(normed_attn_res_output, lora)
 
         return mlp_output, attn_res
 
@@ -435,19 +587,10 @@ def __init__(self, config, weights):
             prefix="model.embed_tokens", weights=weights
         )
         # self.embed_tokens.weight *= embed_norm
-        assert config.num_attention_heads % weights.process_group.size() == 0
-        assert config.num_key_value_heads % weights.process_group.size() == 0
-        num_attention_heads = config.num_attention_heads // weights.process_group.size()
-        num_key_value_heads = config.num_key_value_heads // weights.process_group.size()
-
-        flashinferWrapper = FlashinferAttentionWrapper(
-            num_attention_heads, num_key_value_heads, config.hidden_size
-        )
 
         self.layers = nn.ModuleList(
             [
                 FlashQwen2Layer(
-                    flashinferWrapper,
                     layer_id,
                     config,
                     weights,
@@ -461,6 +604,9 @@ def __init__(self, config, weights):
 
         self.gradient_checkpointing = False
 
+        self.head_size = self.layers[0].self_attn.head_dim
+        self.num_heads = self.layers[0].self_attn.num_qo_heads
+        self.num_key_value_heads = self.layers[0].self_attn.num_kv_heads
 
     def forward(
         self,
@@ -468,7 +614,7 @@ def forward(
         kvCachePool: KvCachePool,
         prefillBatchPosition: KvCacheBatchPosition,
         decodeBatchPosition: KvCacheBatchPosition,
-        loraWeight: BatchedModelLoraWeight,
+        lora: BatchedModelLoraWeight | None,
     ) -> torch.Tensor:
         hidden_states = self.embed_tokens(input_ids)
         residual = None
@@ -479,7 +625,7 @@ def forward(
                 kvCachePool,
                 prefillBatchPosition,
                 decodeBatchPosition,
-                loraWeight,
+                lora,
             )
 
         hidden_states, _ = self.norm(hidden_states, residual)
@@ -504,14 +650,10 @@ def forward(
         kvCachePool: KvCachePool,
         prefillBatchPosition: KvCacheBatchPosition,
         decodeBatchPosition: KvCacheBatchPosition,
-        loraWeight: BatchedModelLoraWeight,
+        lora: BatchedModelLoraWeight | None,
     ) -> Tuple[torch.Tensor, Optional[torch.Tensor]]:
         hidden_states = self.model(
-            input_ids,
-            kvCachePool,
-            prefillBatchPosition,
-            decodeBatchPosition,
-            loraWeight
+            input_ids, kvCachePool, prefillBatchPosition, decodeBatchPosition, lora
         )
         logits, speculative_logits = self.lm_head(hidden_states)
         return logits, speculative_logits
diff --git a/server/text_generation_server/models_flashinfer/flashinfer_causal_lm.py b/server/text_generation_server/models_flashinfer/flashinfer_causal_lm.py
index 93eeeff7..83be3de2 100644
--- a/server/text_generation_server/models_flashinfer/flashinfer_causal_lm.py
+++ b/server/text_generation_server/models_flashinfer/flashinfer_causal_lm.py
@@ -1,14 +1,14 @@
 import torch
 import torch.distributed
-from typing import Any, Optional
+from typing import Any, TypedDict, Optional
 from text_generation_server.utils.lora_utils import ModelLoraManager, ModelConfigForLora
 from text_generation_server.utils.cache_manager_flashinfer import (
-    getKvCacheBatchPosition,
-    KvCacheBatchPosition,
+    ModelKvCache,
     KvCachePool,
-    RequestKvCache,
 )
 from text_generation_server.utils.tokens import (
+    StopSequenceCriteria,
+    StoppingCriteria,
     FinishReason,
 )
 from text_generation_server.layers.flashinfer_attention import find_padded_head_dim
@@ -20,24 +20,119 @@
 from opentelemetry import trace
 from typing import Optional, Tuple, List, Type, Dict
 from text_generation_server.models import Model
+from text_generation_server.models.causal_lm import CausalLMBatch
 from text_generation_server.models.types import (
+    Batch,
     Tokens,
     Generation,
     GeneratedText,
 )
+from text_generation_server.utils import (
+    NextTokenChooser,
+    StoppingCriteria,
+)
 from text_generation_server.utils.dist import MEMORY_FRACTION
 from dataclasses import dataclass
-from collections.abc import Iterable
-from text_generation_server.cache import Cache
 
 tracer = trace.get_tracer(__name__)
 
 
+class TextGenerationChunk(TypedDict):
+    index: int
+    token_id: int
+    text: str
+    is_stop: bool
+
+
+@dataclass
+class FlashinferBatch(CausalLMBatch):
+    @classmethod
+    def Empty(cls, batch_id):
+        return cls(
+            batch_id=batch_id,
+            requests=None,
+            prefix_offsets=None,
+            read_offsets=None,
+            next_token_choosers=None,
+            stopping_criterias=None,
+            top_n_tokens=None,
+            top_n_tokens_tensor=None,
+            input_ids=None,
+            requests_idx_mapping=None,
+            attention_mask=None,
+            position_ids=None,
+            past_key_values=None,
+            all_input_ids=None,
+            input_lengths=None,
+            max_input_length=None,
+            padding_right_offset=None,
+            max_tokens=None,
+        )
+
+    @classmethod
+    def from_pb(
+        cls,
+        pb: generate_pb2.Batch,
+        tokenizer: PreTrainedTokenizerBase = None,
+        dtype: torch.dtype = None,
+        device: torch.device = "cuda",
+    ) -> "CausalLMBatch":
+        input_ids = []
+        next_token_choosers = []
+        stopping_criterias = []
+        top_n_tokens = []
+        prefix_offsets = []
+        read_offsets = []
+
+        # Parse batch
+        for i, r in enumerate(pb.requests):
+            prompt = r.inputs
+
+            next_token_choosers.append(
+                NextTokenChooser.from_pb(r.parameters, device, tokenizer)
+            )
+            stopping_criteria = StoppingCriteria.from_pb(
+                r.stopping_parameters, tokenizer
+            )
+            stopping_criterias.append(stopping_criteria)
+            top_n_tokens.append(r.top_n_tokens)
+            tokenized_inputs = tokenizer.encode(prompt)
+            input_len = len(tokenized_inputs)
+            prefix_offsets.append(input_len - 5)
+            read_offsets.append(input_len)
+            input_ids.append(tokenized_inputs)
+
+        top_n_tokens_tensor = torch.tensor(
+            top_n_tokens, device=device, dtype=torch.int64
+        )
+
+        return cls(
+            batch_id=pb.id,
+            requests=pb.requests,
+            requests_idx_mapping=None,
+            input_ids=input_ids,
+            attention_mask=None,
+            position_ids=None,
+            past_key_values=None,
+            all_input_ids=None,
+            input_lengths=None,
+            prefix_offsets=prefix_offsets,
+            read_offsets=read_offsets,
+            next_token_choosers=next_token_choosers,
+            stopping_criterias=stopping_criterias,
+            top_n_tokens=top_n_tokens,
+            top_n_tokens_tensor=top_n_tokens_tensor,
+            max_input_length=None,
+            padding_right_offset=None,
+            max_tokens=None,
+        )
+
+
 class RequestContext:
     def __init__(
         self,
-        request_id: str,
         input_ids: list[int],
+        lora_id: str,
         tokenizer,
         *,
         temperature: float,
@@ -46,12 +141,8 @@ def __init__(
         top_k: int,
         maxlen: int,
         stop_token_id: int,
-        is_stopped: bool,
-        request_kv_cache: RequestKvCache,
         prefill_logprobs: bool = True,
-        lora_id: str = "empty",
     ):
-        self.request_id = request_id
         self.temperature = temperature
         self.repetition_penalty = repetition_penalty
         self.top_p = top_p
@@ -81,9 +172,6 @@ def __init__(
         self.tokenizer = tokenizer
         self.prefix_offset = 0
         self.read_offset = 0
-        self.is_stopped = is_stopped
-        self.prefill_tokens: Optional[Tokens] = None
-        self.request_kv_cache = request_kv_cache
 
     def get_next_token_id(self, logits: torch.Tensor) -> int:
         if self.logits_processor:
@@ -106,34 +194,15 @@ def get_next_token_id(self, logits: torch.Tensor) -> int:
     def append_token(self, token_id: int):
         self.output_ids.append(token_id)
 
-    def get_stop_reason(self) -> FinishReason:
+    def is_stop(self) -> FinishReason:
         if len(self.output_ids) - self.prompt_len >= self.maxlen:
             return FinishReason.FINISH_REASON_LENGTH
         if self.output_ids[-1] == self.stop_token_id:
             return FinishReason.FINISH_REASON_EOS_TOKEN
         return None
 
-
-@dataclass(frozen=True)
-class FlashinferBatch:
-    batch_id: int
-    is_prefill: bool
-    request_contexts: List[RequestContext]
-
-    def to_pb(self) -> generate_pb2.CachedBatch:
-
-        max_input_length = max([r.prompt_len for r in self.request_contexts])
-        max_decode_tokens = max([r.maxlen for r in self.request_contexts])
-        max_tokens = len(self.request_contexts) * (max_input_length + max_decode_tokens)
-
-        return generate_pb2.CachedBatch(
-            id=self.batch_id,
-            request_ids=[
-                request_context.request_id for request_context in self.request_contexts
-            ],
-            size=len(self.request_contexts),
-            max_tokens=max_tokens,
-        )
+    def is_prefill(self) -> bool:
+        return len(self.output_ids) == self.prompt_len
 
 
 class FlashinferLM(Model):
@@ -144,12 +213,11 @@ def __init__(
         config: PretrainedConfig,
         dtype: torch.dtype,
         device: torch.device,
-        lora_ids: List[str],
+        lora_ids: List[str] = None,
     ):
         self.device = device
         self.dtype = dtype
         self.model_config = config
-        self.batch_cache = Cache()
 
         if (
             torch.cuda.is_available()
@@ -199,7 +267,7 @@ def __init__(
             f"  Number of Pages to Allocate: {num_pages_to_allocate}"
         )
 
-        self.kvCachePool = KvCachePool(
+        kvCachePool = KvCachePool(
             max_pages=num_pages_to_allocate,
             num_layers=self.model_config.num_hidden_layers,
             num_heads=self.model_config.num_key_value_heads,
@@ -209,6 +277,7 @@ def __init__(
             device=device,
         )
 
+        self.modelKvCache = ModelKvCache(kvCachePool)
         self.model_config_for_lora = ModelConfigForLora(
             num_hidden_layers=config.num_hidden_layers,
             hidden_size=config.hidden_size,
@@ -220,8 +289,9 @@ def __init__(
         self.loraManager = ModelLoraManager(self.model_config_for_lora, dtype)
         if lora_ids:
             self.loraManager.set_lora_weights(
-                lora_ids, self.model_config_for_lora, dtype
+                lora_ids, self.model_config_for_lora or {}, dtype
             )
+        self.reqctx: dict[int, RequestContext] = {}
 
         super(FlashinferLM, self).__init__(
             model=model,
@@ -231,6 +301,13 @@ def __init__(
             device=device,
         )
 
+    def _find_padded_head_dim(self, head_dim):
+        flashInferDimensions = [64, 128, 256]
+        for dim in flashInferDimensions:
+            if head_dim <= dim:
+                return dim
+        raise ValueError("The head dimension is too large for FlashInfer")
+
     def load_lora_adapters(self, lora_ids: List[str]):
         self.loraManager.set_lora_weights(
             lora_ids,
@@ -244,152 +321,115 @@ def remove_lora_adapters(self, lora_ids: list[str] = None):
     def get_lora_adapters(self):
         return list(self.loraManager.lora_weights_cpu)
 
-    def decode_batch(
-        self, cachedBatchesPb: Iterable[generate_pb2.CachedBatch]
-    ) -> Tuple[List[Generation], Optional[FlashinferBatch], Tuple[int, int], int]:
-        start_concat = time.time_ns()
-        batch = self._convertCachedBatch(cachedBatchesPb)
-        concat_ns = time.time_ns() - start_concat
-        generations, next_batch, timings = self.generate_token(batch)
-        if next_batch:
-            self.batch_cache.set(next_batch)
-        return generations, batch, timings, concat_ns
-
-    def prefill_batch(
-        self, batchPb: generate_pb2.Batch
-    ) -> Tuple[List[Generation], Optional[FlashinferBatch], Tuple[int, int]]:
-        batch = self._convertPbBatch(batchPb)
-        generations, next_batch, timings = self.generate_token(batch)
-        if next_batch:
-            self.batch_cache.set(next_batch)
-        return generations, batch, timings
-
-    def clear_cache(self):
-        all_batches: List[FlashinferBatch] = self.batch_cache.get_all_values()
-        for batch in all_batches:
-            for request_context in batch.request_contexts:
-                request_context.request_kv_cache.release()
-
-        self.batch_cache.clear()
+    def has_request(self):
+        return len(self.reqctx) > 0
 
-    def _find_padded_head_dim(self, head_dim):
-        flashInferDimensions = [64, 128, 256]
-        for dim in flashInferDimensions:
-            if head_dim <= dim:
-                return dim
-        raise ValueError("The head dimension is too large for FlashInfer")
-
-    def _convertPbBatch(self, batchPb: generate_pb2.Batch) -> FlashinferBatch:
-        request_contexts = []
-
-        for request in batchPb.requests:
-            prompt = request.inputs
-            input_ids = self.tokenizer.encode(prompt)
-            parameters = request.parameters
-            request_context = RequestContext(
-                request.id,
-                input_ids,
-                self.tokenizer,
-                temperature=parameters.temperature,
-                repetition_penalty=parameters.repetition_penalty,
-                top_p=parameters.top_p,
-                top_k=parameters.top_k,
-                maxlen=min(request.stopping_parameters.max_new_tokens, 4096),
-                stop_token_id=self.tokenizer.eos_token_id,
-                is_stopped=False,
-                request_kv_cache=RequestKvCache(
-                    self.kvCachePool,
-                    self.kvCachePool.page_len,
-                    len(input_ids),
-                ),
-                prefill_logprobs=request.prefill_logprobs,
-                lora_id=request.lora_id,
-            )
-
-            request_contexts.append(request_context)
+    @property
+    def batch_type(self) -> Type[FlashinferBatch]:
+        return FlashinferBatch
 
-        return FlashinferBatch(
-            batch_id=batchPb.id, is_prefill=True, request_contexts=request_contexts
+    def decode(self, generated_ids: List[int]) -> str:
+        return self.tokenizer.decode(
+            generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
         )
 
-    def _convertCachedBatch(
-        self, cachedBatchesPb: Iterable[generate_pb2.CachedBatch]
-    ) -> FlashinferBatch:
-        batches: List[FlashinferBatch] = []
-        for batch_pb in cachedBatchesPb:
-            batch = self.batch_cache.pop(batch_pb.id)
-            if batch is None:
-                raise ValueError(f"Batch ID {batch_pb.id} not found in cache.")
-            batches.append(batch)
-
-        if len(batches) == 0:
-            raise ValueError("All batches are empty")
-
-        request_contexts_combined: List[RequestContext] = []
-        for batch in batches:
-            request_contexts_combined.extend(batch.request_contexts)
-
-        return FlashinferBatch(
-            batch_id=batches[0].batch_id,
-            is_prefill=False,
-            request_contexts=request_contexts_combined,
-        )
+    def add_request(self, batch: FlashinferBatch):
+        ids = []
+        for r in range(len(batch.requests)):
+            id = batch.requests[r].id
+            # Router sends initial request in each iteration
+            if id not in self.reqctx:
+                lora_id = batch.requests[r].lora_id or "empty"
+                input = batch.input_ids[r]
+                parameters = batch.requests[r].parameters
+                stop = batch.requests[r].stopping_parameters
+                prefill_logprobs = batch.requests[r].prefill_logprobs
+
+                if lora_id not in self.loraManager.lora_weights_cpu:
+                    raise ValueError("Cannot find lora weights", lora_id)
+
+                self.reqctx[id] = RequestContext(
+                    input,
+                    lora_id,
+                    self.tokenizer,
+                    temperature=parameters.temperature,
+                    repetition_penalty=parameters.repetition_penalty,
+                    top_p=parameters.top_p,
+                    top_k=parameters.top_k,
+                    maxlen=min(stop.max_new_tokens, 4096),
+                    stop_token_id=self.tokenizer.eos_token_id,
+                    prefill_logprobs=prefill_logprobs,
+                )
+                ids.append(id)
+        return ids
 
-    def batch_type(self):
-        return FlashinferBatch
+    def warmup(self, batch: FlashinferBatch):
+        pass
 
     @tracer.start_as_current_span("generate_token")
+    @torch.no_grad()
     def generate_token(
         self, batch: FlashinferBatch
     ) -> Tuple[List[Generation], Optional[FlashinferBatch], Tuple[int, int]]:
         start = time.time_ns()
-        input_ids, lora_ids, lora_lens = [], [], []
-        request_kv_caches = []
-        for request_context in batch.request_contexts:
-            if not request_context.is_stopped:
-                if batch.is_prefill:
-                    input_ids.extend(request_context.output_ids)
-                else:
-                    input_ids.append(request_context.output_ids[-1])
-                request_kv_caches.append(request_context.request_kv_cache)
-                if not batch.is_prefill:
-                    request_context.request_kv_cache.increment()
-
-                if lora_ids and lora_ids[-1] == request_context.lora_id:
-                    lora_lens[-1] += 1
-                elif request_context.lora_id:
-                    lora_ids.append(request_context.lora_id)
-                    lora_lens.append(1)
-
-        input_ids_tensor = torch.tensor(
+
+        if hasattr(batch, "requests") and batch.requests:
+            ids = self.add_request(batch)
+
+        if not self.reqctx:
+            return None, batch, (0, 0)
+
+        reqs = sorted(
+            self.reqctx.items(),
+            key=lambda req: (not req[1].is_prefill(), req[1].lora_id),
+        )
+
+        input_ids = []
+        lora_ids, lora_lens = [], []
+        batchKvCache = self.modelKvCache.getOrCreate(batch.batch_id)
+        prefill_reqIds = []
+        decode_reqIds = []
+
+        for requestId, req in reqs:
+            req.prefill = req.is_prefill()
+            if req.prefill:
+                input_ids.extend(req.output_ids)
+                prefill_reqIds.append(requestId)
+                batchKvCache.create(requestId, req.prompt_len)
+            else:
+                input_ids.append(req.output_ids[-1])
+                decode_reqIds.append(requestId)
+                batchKvCache.get(requestId).increment()
+            if lora_ids and lora_ids[-1] == req.lora_id:
+                lora_lens[-1] += 1
+            else:
+                lora_ids.append(req.lora_id)
+                lora_lens.append(1)
+
+        input_ids = torch.tensor(
             input_ids,
             dtype=torch.long,
             device=self.device,
         )
 
-        request_kv_caches_prefill = request_kv_caches if batch.is_prefill else []
-        request_kv_caches_decode = [] if batch.is_prefill else request_kv_caches
-        prefillBatchPosition: KvCacheBatchPosition = getKvCacheBatchPosition(
-            request_kv_caches_prefill, isPrefill=True, device=self.device
+        prefillBatchPosition = batchKvCache.getKvCacheBatchPosition(
+            prefill_reqIds, isPrefill=True
         )
-        decodeBatchPosition: KvCacheBatchPosition = getKvCacheBatchPosition(
-            request_kv_caches_decode, isPrefill=False, device=self.device
+        decodeBatchPosition = batchKvCache.getKvCacheBatchPosition(
+            decode_reqIds, isPrefill=False
         )
 
-        loraWeights = (
-            self.loraManager.get_lora_batched_weights(lora_ids, lora_lens)
-            if lora_ids
-            else None
-        )
+        # Forward pass
         raw_logits, _ = self.model(
-            input_ids_tensor,
-            self.kvCachePool,
+            input_ids,
+            self.modelKvCache.kvCachePool,
             prefillBatchPosition,
             decodeBatchPosition,
-            loraWeights,
+            self.loraManager.get_lora_batched_weights(lora_ids, lora_lens),
         )
 
         start_decode = time.time_ns()
+
         prefill_logits = (
             raw_logits[prefillBatchPosition.seq_indptr[1:] - 1]
             if prefillBatchPosition.total_seq_len > 0
@@ -400,82 +440,58 @@ def generate_token(
 
         all_stop = True
         generations: List[Generation] = []
-        num_stopped_requests = 0
-        for i, request_context in enumerate(batch.request_contexts):
-            if request_context.is_stopped:
-                num_stopped_requests += 1
-                continue
-            next_token_id = request_context.get_next_token_id(
-                logits[i - num_stopped_requests].unsqueeze(0)
-            )
-            request_context.append_token(next_token_id)
+        for i, (reqid, reqctx) in enumerate(reqs):
+            next_token_id = reqctx.get_next_token_id(logits[i].unsqueeze(0))
+            reqctx.append_token(next_token_id)
             # text = reqctx.decode_tokens() # todo: ??
-            # special handling for ChatGLM
-            if "ChatGLM" in str(type(self.model)):
-                text = self.tokenizer.decode(
-                    [next_token_id],
-                    clean_up_tokenization_spaces=False,
-                    skip_special_tokens=False,
-                )
-            else:
-                text = self.tokenizer.decode(
-                    next_token_id,
-                    clean_up_tokenization_spaces=False,
-                    skip_special_tokens=False,
-                )
+            text = self.tokenizer.decode(
+                next_token_id,
+                clean_up_tokenization_spaces=False,
+                skip_special_tokens=False,
+            )
 
-            stop_reason = request_context.get_stop_reason()
-            if stop_reason != None:
+            is_stop = reqctx.is_stop()
+            if is_stop != None:
                 output_text = self.tokenizer.decode(
-                    request_context.output_ids[request_context.prompt_len :],
+                    reqctx.output_ids[reqctx.prompt_len :],
                     clean_up_tokenization_spaces=False,
                     skip_special_tokens=False,
                 )
                 generated_text = GeneratedText(
                     output_text,
-                    len(request_context.output_ids) - request_context.prompt_len + 1,
-                    stop_reason,
+                    len(reqctx.output_ids) - reqctx.prompt_len + 1,
+                    is_stop,
                     None,
                 )
-                request_context.is_stopped = True
-                request_context.request_kv_cache.release()
+                self.reqctx.pop(reqid)
+                batchKvCache.release(reqid)
             else:
                 generated_text = None
                 all_stop = False
 
             # Prefill
-            if batch.is_prefill:  # and request_context.prefill_logprobs:
+            if reqctx.prefill:  # and reqctx.prefill_logprobs:
                 # Remove generated token to only have prefill and add nan for first prompt token
                 prefill_logprobs = []  # todo
-                prefill_token_ids = request_context.output_ids[
-                    : request_context.prompt_len
-                ]
-                # special handling for ChatGLM
-                if "ChatGLM" in str(type(self.model)):
-                    prefill_texts = self.tokenizer.batch_decode(
-                        [prefill_token_ids],
-                        clean_up_tokenization_spaces=False,
-                        skip_special_tokens=False,
-                    )
-                else:
-                    prefill_texts = self.tokenizer.batch_decode(
-                        prefill_token_ids,
-                        clean_up_tokenization_spaces=False,
-                        skip_special_tokens=False,
-                    )
-                request_context.prefill_tokens = Tokens(
+                prefill_token_ids = reqctx.output_ids[: reqctx.prompt_len]
+                prefill_texts = self.tokenizer.batch_decode(
+                    prefill_token_ids,
+                    clean_up_tokenization_spaces=False,
+                    skip_special_tokens=False,
+                )
+                reqctx.prefill_tokens = Tokens(
                     prefill_token_ids,
                     prefill_logprobs,
                     prefill_texts,
                     is_special=[],
                 )
-                request_context.prefix_offset = request_context.prompt_len
+                reqctx.prefix_offset = reqctx.prompt_len
             else:
-                request_context.prefill_tokens = None
+                reqctx.prefill_tokens = None
 
             generation = Generation(
-                request_context.request_id,
-                request_context.prefill_tokens,
+                reqid,
+                reqctx.prefill_tokens,
                 Tokens(
                     [next_token_id],
                     [0],  # prob
@@ -492,6 +508,5 @@ def generate_token(
         decode_ns = time.time_ns() - start_decode
         # The router stops generation only when batch=None
         if all_stop:
-            return generations, None, (forward_ns, decode_ns)
-        else:
-            return generations, batch, (forward_ns, decode_ns)
+            batch = None
+        return generations, batch, (forward_ns, decode_ns)
diff --git a/server/text_generation_server/models_flashinfer/flashinfer_qwen2.py b/server/text_generation_server/models_flashinfer/flashinfer_qwen2.py
index 42ab6be9..69c83271 100644
--- a/server/text_generation_server/models_flashinfer/flashinfer_qwen2.py
+++ b/server/text_generation_server/models_flashinfer/flashinfer_qwen2.py
@@ -2,18 +2,20 @@
 import torch.distributed
 
 from typing import Optional, List
-from transformers import AutoTokenizer, AutoConfig
 from text_generation_server.models_flashinfer.flashinfer_causal_lm import FlashinferLM
 from text_generation_server.models_flashinfer.custom_modeling.flashinfer_qwen2_modeling import (
     Qwen2Config,
     FlashQwen2ForCausalLM,
 )
+
 from text_generation_server.utils import (
     initialize_torch_distributed,
     weight_files,
     Weights,
 )
 
+from transformers import AutoTokenizer, AutoConfig
+
 
 class FlashinferQwen2(FlashinferLM):
     def __init__(
diff --git a/server/text_generation_server/server_flashinfer.py b/server/text_generation_server/server_flashinfer.py
index 3a636e02..b8127639 100644
--- a/server/text_generation_server/server_flashinfer.py
+++ b/server/text_generation_server/server_flashinfer.py
@@ -13,8 +13,7 @@
 
 from text_generation_server.cache import Cache
 from text_generation_server.interceptor import ExceptionInterceptor
-from text_generation_server.models_flashinfer import get_model
-from text_generation_server.models_flashinfer.flashinfer_causal_lm import FlashinferLM
+from text_generation_server.models_flashinfer import Model, get_model
 
 from text_generation_server.pb import generate_pb2_grpc, generate_pb2
 from text_generation_server.tracing import UDSOpenTelemetryAioServerInterceptor
@@ -35,7 +34,7 @@ def exit_gracefully(self, signum, frame):
 class TextGenerationService(generate_pb2_grpc.TextGenerationServiceServicer):
     def __init__(
         self,
-        model: FlashinferLM,
+        model: Model,
         cache: Cache,
         quantize: Optional[str],
         server_urls: List[str],
@@ -61,26 +60,40 @@ async def ServiceDiscovery(self, request, context):
         return generate_pb2.ServiceDiscoveryResponse(urls=self.server_urls)
 
     async def ClearCache(self, request, context):
-        self.model.clear_cache()
+        if request.HasField("id"):
+            self.cache.delete(request.id)
+        else:
+            self.cache.clear()
         return generate_pb2.ClearCacheResponse()
 
-    # async def FilterBatch(self, request, context):
-    #     batch = self.cache.pop(request.batch_id)
-    #     if batch is None:
-    #         raise ValueError(f"Batch ID {request.batch_id} not found in cache.")
-    #     filtered_batch = batch.filter(request.request_ids)
-    #     self.cache.set(filtered_batch)
+    async def FilterBatch(self, request, context):
+        batch = self.cache.pop(request.batch_id)
+        if batch is None:
+            raise ValueError(f"Batch ID {request.batch_id} not found in cache.")
+        filtered_batch = batch.filter(request.request_ids)
+        self.cache.set(filtered_batch)
 
-    #     return generate_pb2.FilterBatchResponse(batch=filtered_batch.to_pb())
+        return generate_pb2.FilterBatchResponse(batch=filtered_batch.to_pb())
 
     async def Warmup(self, request, context):
+
+        batch = self.model.batch_type.from_pb(
+            request.batch, self.model.tokenizer, self.model.dtype, self.model.device
+        )
+        max_supported_total_tokens = self.model.warmup(batch)
+
         return generate_pb2.WarmupResponse(
-            max_supported_total_tokens=request.max_total_tokens
+            max_supported_total_tokens=max_supported_total_tokens
         )
 
     async def Prefill(self, request, context):
         start = time.time_ns()
-        generations, next_batch, timings = self.model.prefill_batch(request.batch)
+        batch = self.model.batch_type.from_pb(
+            request.batch, self.model.tokenizer, self.model.dtype, self.model.device
+        )
+
+        generations, next_batch, timings = self.model.generate_token(batch)
+        self.cache.set(next_batch)
         return generate_pb2.PrefillResponse(
             generations=[generation.to_pb() for generation in generations],
             batch=next_batch.to_pb() if next_batch else None,
@@ -91,9 +104,30 @@ async def Prefill(self, request, context):
 
     async def Decode(self, request, context):
         start = time.time_ns()
-        generations, next_batch, timings, concat_ns = self.model.decode_batch(
-            request.batches
-        )
+        if len(request.batches) == 0:
+            raise ValueError("Must provide at least one batch")
+
+        batches = []
+        for batch_pb in request.batches:
+            batch = self.cache.pop(batch_pb.id)
+            if batch is None:
+                raise ValueError(f"Batch ID {batch_pb.id} not found in cache.")
+            batches.append(batch)
+
+        if len(batches) == 0:
+            raise ValueError("All batches are empty")
+
+        if len(batches) > 1:
+            start_concat = time.time_ns()
+            batch = self.model.batch_type.concatenate(batches)
+            concat_ns = time.time_ns() - start_concat
+        else:
+            batch = batches[0]
+            concat_ns = None
+
+        generations, next_batch, timings = self.model.generate_token(batch)
+        self.cache.set(next_batch)
+
         return generate_pb2.DecodeResponse(
             generations=[generation.to_pb() for generation in generations],
             batch=next_batch.to_pb() if next_batch else None,
diff --git a/server/text_generation_server/utils/cache_manager_flashinfer.py b/server/text_generation_server/utils/cache_manager_flashinfer.py
index 7d34edcf..f0e27d49 100644
--- a/server/text_generation_server/utils/cache_manager_flashinfer.py
+++ b/server/text_generation_server/utils/cache_manager_flashinfer.py
@@ -80,6 +80,51 @@ def release(self):
         self.is_released = True
 
 
+class BatchKvCache:
+    def __init__(self, kvCachePool: KvCachePool, page_len, device):
+        self.kvCachePool = kvCachePool
+        self.page_len = page_len
+        self.device = device
+        self.kvCacheDict: dict[int, RequestKvCache] = {}
+
+    def get(self, req_id):
+        return self.kvCacheDict.get(req_id)
+
+    def create(self, req_id, seq_init_len):
+        self.kvCacheDict[req_id] = RequestKvCache(
+            self.kvCachePool, self.page_len, seq_init_len
+        )
+        return self.kvCacheDict[req_id]
+
+    def release(self, req_id):
+        self.kvCacheDict[req_id].release()
+        del self.kvCacheDict[req_id]
+
+    def increment(self):
+        for kvCache in self.kvCacheDict.values():
+            kvCache.increment()
+
+    def setRequestOrder(self, requestIds: List[int]):
+        self.requestIds = requestIds
+
+    def getKvCacheBatchPosition(self, requestIds: List[int], isPrefill: bool):
+        kv_page_indices_list = []
+        kv_page_indptr_list = []
+        seq_indptr_list = []
+        kv_last_page_len_list = []
+        seq_lens_list = []
+        cum_pages = 0
+        cum_seq_len = 0
+        for requestId in requestIds:
+            kvCache = self.kvCacheDict[requestId]
+            kv_page_indices_list.extend(kvCache.kv_page_indices)
+            kv_page_indptr_list.append(cum_pages)
+            seq_indptr_list.append(cum_seq_len)
+            kv_last_page_len_list.append(kvCache.kv_last_page_len)
+            seq_lens_list.append(kvCache.kv_len)
+            cum_pages += len(kvCache.kv_page_indices)
+            cum_seq_len += kvCache.kv_len if isPrefill else 1
+
 def getKvCacheBatchPosition(
     request_kv_caches: List[RequestKvCache], isPrefill: bool, device: torch.device
 ) -> KvCacheBatchPosition:
@@ -94,31 +139,43 @@ def getKvCacheBatchPosition(
         kv_page_indices_list.extend(request_kv_cache.kv_page_indices)
         kv_page_indptr_list.append(cum_pages)
         seq_indptr_list.append(cum_seq_len)
-        kv_last_page_len_list.append(request_kv_cache.kv_last_page_len)
-        seq_lens_list.append(request_kv_cache.kv_len)
-        cum_pages += len(request_kv_cache.kv_page_indices)
-        cum_seq_len += request_kv_cache.kv_len if isPrefill else 1
-
-    kv_page_indptr_list.append(cum_pages)
-    seq_indptr_list.append(cum_seq_len)
-    kv_page_indices = torch.tensor(
-        kv_page_indices_list, dtype=torch.int32, device=device
-    )
-    kv_page_indptr = torch.tensor(kv_page_indptr_list, dtype=torch.int32, device=device)
-    kv_last_page_len = torch.tensor(
-        kv_last_page_len_list, dtype=torch.int32, device=device
-    )
-    seq_indptr = torch.tensor(seq_indptr_list, dtype=torch.int32, device=device)
-    seq_lens = torch.tensor(
-        seq_lens_list,
-        dtype=torch.int32,
-        device=device,
-    )
-    return KvCacheBatchPosition(
-        seq_indptr=seq_indptr,
-        kv_page_indptr=kv_page_indptr,
-        kv_page_indices=kv_page_indices,
-        kv_last_page_len=kv_last_page_len,
-        seq_lens=seq_lens,
-        total_seq_len=cum_seq_len,
-    )
+        kv_page_indices = torch.tensor(
+            kv_page_indices_list, dtype=torch.int32, device=self.device
+        )
+        kv_page_indptr = torch.tensor(
+            kv_page_indptr_list, dtype=torch.int32, device=self.device
+        )
+        kv_last_page_len = torch.tensor(
+            kv_last_page_len_list, dtype=torch.int32, device=self.device
+        )
+        seq_indptr = torch.tensor(
+            seq_indptr_list, dtype=torch.int32, device=self.device
+        )
+        seq_lens = torch.tensor(
+            seq_lens_list,
+            dtype=torch.int32,
+            device=self.device,
+        )
+        return KvCacheBatchPosition(
+            seq_indptr=seq_indptr,
+            kv_page_indptr=kv_page_indptr,
+            kv_page_indices=kv_page_indices,
+            kv_last_page_len=kv_last_page_len,
+            seq_lens=seq_lens,
+            total_seq_len=cum_seq_len,
+        )
+
+
+class ModelKvCache:
+    def __init__(self, kvCachePool: KvCachePool):
+        self.kvCachePool = kvCachePool
+        self.device = kvCachePool.device
+        self.page_len = kvCachePool.page_len
+        self.batchKvCacheDict: dict[int, BatchKvCache] = {}
+
+    def getOrCreate(self, batch_id):
+        batchKvCache = self.batchKvCacheDict.get(batch_id) or BatchKvCache(
+            self.kvCachePool, self.page_len, self.device
+        )
+        self.batchKvCacheDict[batch_id] = batchKvCache
+        return batchKvCache
diff --git a/server/text_generation_server/utils/lora_utils.py b/server/text_generation_server/utils/lora_utils.py
index bb1984be..bde45ca7 100644
--- a/server/text_generation_server/utils/lora_utils.py
+++ b/server/text_generation_server/utils/lora_utils.py
@@ -351,8 +351,8 @@ def get_lora_batched_weights(
         self, lora_ids: List[str], lora_lens: List[int]
     ) -> BatchedModelLoraWeight:
         assert len(lora_ids) <= self.lora_cap
-        # for lora_id in lora_ids:
-        #    assert lora_id in self.lora_weights_cpu
+        for lora_id in lora_ids:
+            assert lora_id in self.lora_weights_cpu
         loraweights = []
         for lora_id in lora_ids:
             if lora_id and lora_id not in self.lora_weights_gpu: