tracker: `generate` composability refactor #30810

gante · 2024-05-14T15:45:40Z

`generate` + composability = more use cases with minimal rewrites

As I write this issue, generate is mostly a sequential monolith. Many internal blocks were carved into functions over the last two years, but navigating there as a beginner is still messy. It is also very challenging to adapt generate to different tasks and/or modalities, forcing us to overwrite the entire generate function (e.g. RAG, MusicGen). All these aspects make using, documenting, maintaining, and testing generate a challenge.

This issue is a tracker for the refactor of generate, where we aim to build the structure outlined in this board. Key ideas for this refactor:
👉 All models can use the base generate API
👉 Reduce if/else blocks
👉 Reduce the barriers to entry for new decoding methods/modalities/use cases
👉 Reduce per-model overwrites when possible
👉 Add unit tests
👉 Add documentation regarding the structure of generate

Tasks

1. Isolate prefill into a separate function, pulling it from the decoding methods. Note that
- a) prefill is done excluding the latest token (input_ids[:, -1:]), so we don't compute variables regarding the latest token twice;
- b) prefill only runs when use_cache=True and cache length < input length - 1;
- c) _expand_inputs_for_generation needs to be changed (it copied inputs before prefill, we will need to copy prefill outputs)
2. (depends on 1.) Separate generate on the 5 stages described in the diagram, passing around the data structures described therein
3. (depends on 1.) Streaming 2.0
- a) Add option to yield/yield from instead of return
- b) Deprecate the old streamer classes;
- c) Add a new class to print the stream into the screen. For beam methods, build a class that prints up to the point where all beams agree with each other.
- d) thoroughly document and communicate this feature
- e) enable streaming into the screen with pipeline
4. (depends on 2.) Separate stage 1 into a set of functions as described in the diagram. Add unit tests.
5. (depends on 2.) Separate stage 2 into a set of functions as described in the diagram. Add unit tests. Move the preparation of common model inputs here, such as position_ids.
6. (depends on 2.) Separate stage 3 into a set of functions as described in the diagram. Add unit tests. Deprecate LogitsWarper in this step (it's a copy of LogitsProcessor)
7. (depends on 2.) Separate stage 5 into a set of functions as described in the diagram. Add unit tests.
8. Add a new document page walking through the structure of generate

[From this point onwards the tasks are only a sketch, need more detailed planning when we get there]

9. Reduce if/elses through templates (e.g. LLMs have a certain default for prepare_inputs_for_generation, VLMs also have their special preprocessing steps, ...)
10. Play around with caching of some blocks to determine whether it speeds up generation
11. Rework prepare_inputs_for_generation ?
12. Remove generate from models that have a custom implementation
(other tasks, TBD)

The text was updated successfully, but these errors were encountered:

amyeroberts · 2024-05-15T10:23:29Z

Adding the WIP label, so you don't get pinged by the stale bot 🤖

dmarx · 2024-05-20T17:12:34Z

Could you elaborate on the "prefill" component? My impression is that this step converts the prompt into a KV cache, i.e. "pre-filling" the KV component for the tokens that are fixed. If that's correct, this component could probably serve double duty (as an exit point from the generate procedure) for users who just want logprobs/scores for the prompt.

dmarx · 2024-05-20T17:20:40Z

IMHO, upsteam of (part of?) the "generate outputs" step in the decoding loop should be a templated _prepare_outputs function whose job is just to attach output attributes to the appropriate output class. Design POC via #29545, concretely:
https://github.com/coreweave/transformers/blob/dmarx.output_streamer/src/transformers/generation/utils.py#L354-L378

gante · 2024-05-21T10:56:00Z

@dmarx

Could you elaborate on the "prefill" component? My impression is that this step converts the prompt into a KV cache, i.e. "pre-filling" the KV component for the tokens that are fixed. If that's correct, this component could probably serve double duty (as an exit point from the generate procedure) for users who just want logprobs/scores for the prompt.

Correct, it can be made a public function with that additional purpose. The difference between the prefill for generate and obtaining the scores for the prompt is that in the former, we only want to keep the past KV. The different output needs suggest to me that a stand-alone public function is preferable to an alternate exit to generate :D Added this to the notes of the prefill stage in the diagram.

IMHO, upsteam of (part of?) the "generate outputs" step in the decoding loop should be a templated _prepare_outputs function whose job is just to attach output attributes to the appropriate output class. Design POC via #29545, concretely:
https://github.com/coreweave/transformers/blob/dmarx.output_streamer/src/transformers/generation/utils.py#L354-L378

That's a good idea, adding it to the diagram!

Thank you for the suggestions 💛

jiqing-feng · 2024-07-11T08:41:17Z

Hi @gante . I see this issue has been open for 2 months. Do you have any plans to start the 1st step? If it relies on other tasks and cannot start recently, do you mind considering #31564 as the temporary solution?

dmarx · 2024-07-18T16:31:39Z

@gante maybe you could create sub-issues to track progress on dependent sub tasks and centralize visibility for any in-progress work?

tylerweitzman · 2024-09-24T14:57:17Z

@gante is there any workaround for this in the meantime? Would a custom streamer class be able to pass through the custom model outputs, it seems the streamer class may only be given the token and not other inputs right?

I have the same issue as #29545 which brought me here

gante · 2024-09-24T17:58:07Z

@tylerweitzman after this plan is complete, streaming should yield all model outputs (and not just the tokens).

The tricky part is timeline, as this competes with other projects 🤗 I'm currently working on #28981 and #32685, which have a higher priority

dmarx · 2024-12-14T08:38:34Z

@tylerweitzman workaround available here: #29545

amyeroberts added the WIP Label your PR/Issue with WIP for some long outstanding Issues/PRs that are work in progress label May 15, 2024

gante mentioned this issue May 16, 2024

Stream ModelOutputs #29545

Closed

26 tasks

This was referenced May 21, 2024

Improving memory efficiency further 🚀 #30860

Closed

Static cache + torch.compile: better documentation for prefill static sequence length #29151

Closed

gante mentioned this issue May 29, 2024

add stream to pipeline parameters #30487

Open

gante mentioned this issue Jun 26, 2024

Optimize 1st token for beam_search #31564

Closed

gante mentioned this issue Aug 2, 2024

GenerationConfig throws Object is not JSON serializable when setting constraints #31070

Closed

4 tasks

This was referenced Aug 5, 2024

Custom beam search scorer argument in generate function #32097

Open

SequenceBiasLogitsProcessor parameter cannot be specified in the the json config file #32416

Closed

gante mentioned this issue Aug 6, 2024

Docs: alert for the possibility of manipulating logits #32467

Merged

gante mentioned this issue Oct 17, 2024

tracker: generate compatibility with torch.compile #28981

Closed

33 tasks

This was referenced Nov 27, 2024

isin() received an invalid combination of arguments #31040

Closed

when model.generate with num_beams=2 and num_return_sequences=2,the output seqs are different from input_ids of stopping_criteria #34574

Open

gante mentioned this issue Jan 30, 2025

[generate] ✨ vectorized beam search ✨ #35802

Open

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tracker: `generate` composability refactor #30810

tracker: `generate` composability refactor #30810

gante commented May 14, 2024 •

edited

Loading

amyeroberts commented May 15, 2024

dmarx commented May 20, 2024

dmarx commented May 20, 2024 •

edited

Loading

gante commented May 21, 2024 •

edited

Loading

jiqing-feng commented Jul 11, 2024

dmarx commented Jul 18, 2024

tylerweitzman commented Sep 24, 2024 •

edited

Loading

gante commented Sep 24, 2024

dmarx commented Dec 14, 2024

tracker: generate composability refactor #30810

tracker: generate composability refactor #30810

Comments

gante commented May 14, 2024 • edited Loading

generate + composability = more use cases with minimal rewrites

Tasks

amyeroberts commented May 15, 2024

dmarx commented May 20, 2024

dmarx commented May 20, 2024 • edited Loading

gante commented May 21, 2024 • edited Loading

jiqing-feng commented Jul 11, 2024

dmarx commented Jul 18, 2024

tylerweitzman commented Sep 24, 2024 • edited Loading

gante commented Sep 24, 2024

dmarx commented Dec 14, 2024

tracker: `generate` composability refactor #30810

tracker: `generate` composability refactor #30810

gante commented May 14, 2024 •

edited

Loading

`generate` + composability = more use cases with minimal rewrites

dmarx commented May 20, 2024 •

edited

Loading

gante commented May 21, 2024 •

edited

Loading

tylerweitzman commented Sep 24, 2024 •

edited

Loading