-
Notifications
You must be signed in to change notification settings - Fork 27.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[pipeline
] A simple fix for half-precision & 8bit models
#21479
[pipeline
] A simple fix for half-precision & 8bit models
#21479
Conversation
The documentation is not available anymore as the PR was closed or merged. |
Thanks for the well thought out issue and proposed fix. I don't particularly like the fix because it depends on some weird internal and still forces users to use Couldn't we just use Imo using |
Thanks for the feedback @Narsil !
Makes sense ! |
I would highly advise against it too. There's a limit to magic. Doing half precision on cpu should crash in a lot of places. We shouldn't upcast on behalf of a user that explicitely asked for half precision imo. That's breaking user intent. Does |
I see, makes sense!
I am not sure here, maybe @sgugger & @muellerzr knows better |
if not the pipeline could have the simplest heuristic |
What do you think: diff --git a/src/transformers/pipelines/base.py b/src/transformers/pipelines/base.py
index 30402b36e..e698f1aa3 100644
--- a/src/transformers/pipelines/base.py
+++ b/src/transformers/pipelines/base.py
@@ -749,7 +749,7 @@ class Pipeline(_ScikitCompat):
framework: Optional[str] = None,
task: str = "",
args_parser: ArgumentHandler = None,
- device: Union[int, str, "torch.device"] = -1,
+ device: Union[int, str, "torch.device"] = None,
torch_dtype: Optional[Union[str, "torch.dtype"]] = None,
binary_output: bool = False,
**kwargs,
@@ -764,6 +764,20 @@ class Pipeline(_ScikitCompat):
self.image_processor = image_processor
self.modelcard = modelcard
self.framework = framework
+
+ # Special handling
+ if self.framework == "pt" and device is not None:
+ self.model = self.model.to(device=device)
+
+ if device is None:
+ # `accelerate` device map
+ hf_device_map = getattr(self.model, "hf_device_map", None)
+ if hf_device_map is not None:
+ # Take the first device used by `accelerate`.
+ device = next(iter(hf_device_map.values()))
+ else:
+ device = -1
+
if is_torch_available() and self.framework == "pt":
if isinstance(device, torch.device):
self.device = device
@@ -775,13 +789,10 @@ class Pipeline(_ScikitCompat):
self.device = torch.device(f"cuda:{device}")
else:
self.device = device
+
self.torch_dtype = torch_dtype
self.binary_output = binary_output
- # Special handling
- if self.framework == "pt" and self.device.type != "cpu":
- self.model = self.model.to(self.device)
-
# Update config with task specific parameters
task_specific_params = self.model.config.task_specific_params
if task_specific_params is not None and task in task_specific_params: Here we just modify the default device when the model uses |
I think that would totally work @Narsil ! Happy to change the PR with your proposed changes, let me know! |
Sure. Let's update the doc too. |
Side notes:
if set(device_map.values()) == {"cpu"} or set(device_map.values()) == {"cpu", "disk"}:
main_device = "cpu"
else:
main_device = [d for d in device_map.values() if d not in ["cpu", "disk"]][0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update.
I propose to simplify the examples a lot.
Then, regarding the actual code, I think your proposed change is entangling and confusing device
/self.device
more than it should. At least more than my original modification.
We could go even further in "cleanliness" and handle all that logic within pipeline
function and not in Pipeline.__init__
since at that point the model should be a black box (that includes moving the model on device).
Modifying the model on device things will break calls like TranslationPipeline(model=MyModel(), device=0)
. I don't think we can realistically break that, but we can probably figure out a non breaking change that does handle device_map
better.
src/transformers/pipelines/base.py
Outdated
if self.framework == "pt" and device is not None: | ||
self.model = self.model.to(device=self.device) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't mix self.device
and device
. This is super error prone.
The proposed change I made was at least explicit about it's default value.
I really think this needs to be changed. Too many opportunities to introduce bugs later on.
- Set the default value (if no value provided)
- Handle
device
to createself.device
. - Use
self.device
everywhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I somehow didn't fully considered your proposition in #21479 (comment) - I think it's wiser to revert my changes with yours!
src/transformers/pipelines/base.py
Outdated
hf_device_map = getattr(self.model, "hf_device_map", None) | ||
if hf_device_map is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's probably a way to structure code where this is written only once.
I think directly in pipeline
function we could warn when both device
and device_map
are set.
Prevents having to guess here. If you're splitting model loading and pipeline loading, then you should be aware of what you do, but we shouldn't actively depend on internals to seek what's going on.
Essentially, when users use pipeline(model=MyModel())
the model
is a black box to us, we shouldn't look at it. We're looking at it in my proposed change only when there's no device being sent.
And to be even purer, we could modify the pipeline
itself, to check hf_device_map
only when we do from_pretrained
. That seems even cleaner since we know that this internal map could exist here (where we can't here if user passes in a real object).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💯 on this @Narsil
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed with all of @Narsil 's comments!
docs/source/en/pipeline_tutorial.mdx
Outdated
# pip install accelerate | ||
import torch | ||
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline | ||
|
||
model = AutoModelForCausalLM.from_pretrained("bigscience/bloom", torch_dtype=torch.bfloat16, device_map="auto") | ||
tokenizer = AutoTokenizer.from_pretrained("bigscience/bloom") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe use a smaller example and say in a note the user can replace it by BLOOM?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure!
Co-authored-by: Nicolas Patry <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for iterating!
Co-authored-by: Sylvain Gugger <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
the failing test is indenpendent to our PR! Merging! |
…ce#21479) * v1 fix * adapt from suggestions * make style * fix tests * add gpu tests * update docs * fix other tests * Apply suggestions from code review Co-authored-by: Nicolas Patry <[email protected]> * better fix * make fixup * better example * revert changes * proposal * more elegant solution * Update src/transformers/pipelines/automatic_speech_recognition.py Co-authored-by: Sylvain Gugger <[email protected]> --------- Co-authored-by: Nicolas Patry <[email protected]> Co-authored-by: Sylvain Gugger <[email protected]>
What does this PR do?
Currently on the
main
branch oftransformers
if a user wants to run apipeline
using large models (thus, ideally loaded withdevice_map=...
) and in half precision (or in int8), they may encounter some issues when callingpipeline
withtop_p
&top_k
sampling:Snippet to reproduce & explanations:
This is because the
input_ids
are automatically set oncpu
since the argumentdevice
is not passed when initializing thepipeline
. A model that is loaded withdevice_map=...
(i.e. withaccelerate
) always sets the output tensor of the model to thedevice
of the input tensor thanks to the forward hooks. Therefore when calling the top_k method, the output tensor is in fp16 (because the model has been loaded in fp16) & oncpu
hence the torch error above.Currently a hack to fix this is to add
device=0
when initializing thepipeline
but this leads to inconsistent and undesirable behaviours for some cases, for example when loading large models in several GPUs, since the callmodel.to(self.device)
will break some internals (the hooks will be still there but the weights will be set on the wrong devices). A snippet to reproduce below:adding this hack also breaks the usage of
pipeline
with int8 models, since theto
method is blocked for these models:Thus, I propose to fix this by simply checking whether a model has been loaded with
accelerate
by looking at the attributehf_device_map
, and set the model on the correct device only if it has not been loaded with accelerate as backend. This fixes 3 bugs: usingpipeline
with a fp16 model that has been loaded withaccelerate
without having any error in case of multi-gpu usage, usingpipeline
with a fp16 model waccelerate
& sampling strategies, and usingpipeline
with int8 models & sampling strategies.cc @sgugger @Narsil