Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimized inference pipeline for Nano #4360

Open
jason-dai opened this issue Apr 6, 2022 · 21 comments
Open

Optimized inference pipeline for Nano #4360

jason-dai opened this issue Apr 6, 2022 · 21 comments
Labels

Comments

@jason-dai
Copy link
Contributor

  1. Current status

    FP32 BF16 IN8
    PyTorch Y N Y
    ONNX Y N Y
    OpenVINO Y N N
    • Trainer.compile(…, onnx=T/F, quantize=T/F, openvino=T/F) - bind relevant methods/variables
    • Trainer.quantize(…) - generate quantized model (PyTorch/ONNX)
    • Model.eval(quantize=T/F) - forward using (quantized) PyTorch model
    • Model.eval_onnx(quantize=T/F)/eval_openvino()/exit_onnx()/exit_openvino() - forward using (quantized) ONNX/OpenVINO model
  2. Desired status

    • Support all combinations of the above table
    • Compile: Trainer.compile() – just bind all methods/variables?
    • Quantize: Trainer.quantize(precision=…, accelerator=…)
    • Forward: model.eval(precision=…, accelerator=…)? – need to call quantize() first?
    • Export/save: Trainer.openvino.export(precision=…)? – how about onnx/quantized? need to be consistent
    • Load: model.load()/model.load_quantized_state_dict()??? - need to have consistent APIs
    • Status: model.eval_status()? – every model should maintain current/default mode, and report here?
    • What's the interactions of there methods? Any other methods needed?

@TheaperDeng @zhentaocc @yangw1234 @shane-huang

@jason-dai jason-dai added the Nano label Apr 6, 2022
@TheaperDeng
Copy link
Contributor

TheaperDeng commented Apr 7, 2022

talked with @zhentaocc

Keep updating according to latest comments.

Trainer.compile

We don't need this method for inference API.


trainer.quantize / trainer.trace

Quantize/trace a model by a specific precision and accelerator and return a new model which only handles this specific accelerated inference model.

# for bf16 or int8 low precision models
new_model = trainer.quantize(model,
                             precision="bf16"/"int8",
                             accelerator=None(pytorch)/"onnxruntime"/"openvino",
                             method="eager"/"fx"/"ipex"/"qlinear"...,
                             backend="inc"/"pot",
                             **kargs_inc,
                             **kargs_pot)

# for fp32 models backended on "onnxruntime" or "openvino"
new_model = trainer.trace(model,
                          accelerator="onnxruntime"/"openvino",
                          **kargs_accelerator)

A normal user should take care of:

  • model: A model that is compiled by Trainer.compile

  • precision: one of "bf16"/"int8"

  • accelerator: one of "pytorch"/"onnxruntime"/"openvino"

An expert user should take care of:

  • method: detailed post-training quantization method defined by each backend (e.g. for pytorch we have "eager", "fx", "ipex"..., for onnxruntime we have "linear", "qlinear"...). A recommeneded value will be set by default according to the precision and accelerator users set.

  • backend: which tool we will use to do the quantization. A recommeneded value will be set by default according to the precision and accelerator users set.

  • **kargs_inc/**kargs_pot: different advanced setting for inc and pot.


model.eval

Users don't need to call this method anymore but it does no harm to call it.


model.status

a @Property in model

return a dict to show which precision and which accelearator our users are using.

>>> model.status
>>> {"precision": xxx, "accelerator": xxx}

@zhentaocc
Copy link
Contributor

zhentaocc commented Apr 8, 2022

keep updating according to latest comments

model.train()​

This function should not be called on amodel that is returned by trainer.trace /trainer.quantize. If called, an error will be raised.

model.inference() -> trainer.inference()​

After our discussion, this method will be deleted since it is highly similar to another method trainer.predict and pytorch users are so familiar to write their own inference loop through model(x) .

trainer.save

trainer.save(model, dirname=…)​

This function will return a dictionary that indicates the saved path. So that our users will understand which file they can take away for further deployment.

trainer.load

trainer.load(model, dirname=…)​

same as above

@TheaperDeng
Copy link
Contributor

TheaperDeng commented Apr 8, 2022

Discussed with @shane-huang earlier and we proposed another design to separate the quantized models/inference sessions from model but still provide same usage as a pytorch model.

This design features a different behavior in trainer.quantize and a new trainer.trace method.

In short, these two methods (trainer methods) will return a new model whose forward has been redirected.

# for bf16 or int8 low precision models
new_model = trainer.quantize(model,
                             precision="bf16"/"int8",
                             accelerator=None(pytorch)/"onnxruntime"/"openvino",
                             method="eager"/"fx"/"ipex"/"qlinear"...,
                             backend="inc"/"pot",
                             **kargs_inc,
                             **kargs_pot)

# for fp32 models backended on "onnxruntime" or "openvino"
new_model = trainer.trace(model,
                          accelerator="onnxruntime"/"openvino",
                          **kargs_accelerator)

yhat = new_model(x)  # x is a torch tensor

Consequentially, some other methods will be changed.

# .eval() is only used to change the accelerator's setting
new_model.eval(**kargs_accelerator)

# trianer.save will save the model's state and a meta data file to identify what precision and accelerator is used
trainer.save(new_model, dirname="...")
new_model = trainer.load(dirname="...")

@jason-dai
Copy link
Contributor Author

jason-dai commented Apr 9, 2022

  1. It may be a good idea to ask users to explicitly call quantize or trace to get the specific acceleration; e.g.,

    • model=trainer.trace(model, accelerator="openvino") will always run using FP32 on OpenVINO
    • model=trainer.quantize(model, precision="INT8", accelerator="onnx") will always run using INT8 on ONNXRT
    • Then there is no need to change eval?
    • And maybe we should make the resulting model un-trainable, which can make it easier to manage?
  2. If using PyTorch to run the model, we should probably set accelerator to None, instead of "pytorch"

  3. For Trainer.save, we need to make sure the saved model can also be loaded by standard tools (such as ONNXRT, OpenVINO, TorchServe, PyTorch, etc.)

@TheaperDeng
Copy link
Contributor

  1. It may be a good idea to ask users to explicitly call quantize or trace to get the specific acceleration; e.g.,
  • model=trainer.trace(model, accelerator="openvino") will always run using FP32 on OpenVINO

Yes

  • model=trainer.quantize(model, precision="INT8", accelerator="onnx") will always run using INT8 on ONNXRT

Yes

  • Then there is no need to change eval?

There are some cases a user might want to call .eval(), especially when they want to change the accelerator's(e.g. openvino/onnxruntime) option. They may call model.eval(session_option) to rebuild an inference session.

  • And maybe we should make the resulting model un-trainable, which can make it easier to manage?

Yes

  1. If using PyTorch to run the model, we should probably set accelerator to None, instead of "pytorch"

  2. For Trainer.save, we need to make sure the saved model can also be loaded by standard tools (such as ONNXRT, OpenVINO, TorchServe, PyTorch, etc.)

  1. We will give clear document
  2. We should give them clear message when calling Trainer.save e.g.
>>> trainer.save(model, dirname=".")
>>> {"meta_data_path": "./model.meta",
     "onnx_file_path": "./model.onnx"}

@jason-dai
Copy link
Contributor Author

  • Then there is no need to change eval?

There are some cases a user might want to call .eval(), especially when they want to change the accelerator's(e.g. openvino/onnxruntime) option. They may call model.eval(session_option) to rebuild an inference session.

Do we really need to support this? The user can always call Trainer.quantize or Trainer.trace to generate a new model.

@TheaperDeng
Copy link
Contributor

  • Then there is no need to change eval?

There are some cases a user might want to call .eval(), especially when they want to change the accelerator's(e.g. openvino/onnxruntime) option. They may call model.eval(session_option) to rebuild an inference session.

Do we really need to support this? The user can always call Trainer.quantize or Trainer.trace to generate a new model.

Exactly, we (@zhentaocc ) talked about this and we agree that users can always call Trainer.quantize or Trainer.trace again. We will update the detailed API later.

@zhentaocc
Copy link
Contributor

Openvino new api will be implemented in #4381.

@yangw1234
Copy link
Contributor

Can we rename trace to optimize?

We are also planning to add the new ipex logic here, since in the new ipex version, the usage changes to "model = ipex.optimize(model)".

e.g.

model = Trainer.optimize(model, accelerator='ipex')

For accelerator ipex, the model can still be used for training; for accelerator onnx/openvino, the model can only be used for inference.

@jason-dai

@jason-dai
Copy link
Contributor Author

Can we rename trace to optimize?

We are also planning to add the new ipex logic here, since in the new ipex version, the usage changes to "model = ipex.optimize(model)".

e.g.

model = Trainer.optimize(model, accelerator='ipex')

For accelerator ipex, the model can still be used for training; for accelerator onnx/openvino, the model can only be used for inference.

@jason-dai

We plan to use trace for optimized inference pipeline

@yangw1234
Copy link
Contributor

Can we rename trace to optimize?
We are also planning to add the new ipex logic here, since in the new ipex version, the usage changes to "model = ipex.optimize(model)".
e.g.

model = Trainer.optimize(model, accelerator='ipex')

For accelerator ipex, the model can still be used for training; for accelerator onnx/openvino, the model can only be used for inference.
@jason-dai

We plan to use trace for optimized inference pipeline

Where do you suggest ipex should go? Trainer.compile(use_ipex=True) or Trainer.trace(use_ipex=True)? Or trainer = Trainer(use_ipex=True) (current API).

@zhentaocc
Copy link
Contributor

zhentaocc commented Apr 21, 2022

Current API for model saving and loading. It can do:

  1. save/export a pytorch model as openvino format
  2. save a PytorchOpenVINOModel as local openvino file

Do you think this is a bit confusing for users accepting multiple types of models? @jason-dai
Otherwise we can keep trainer.save handling 1. And if users have the needs for 2, they can call openvino_model.save(..) to do the same.

trainer.save

save from PytorchOpenvinoModel

openvino_model = trainer.trace(model: nn.Module, x, accelerator='openvino')
trainer.save(openvino_model, 'model.xml`)

or save from torch model directly:

trainer.save(model: nn.Module, 'model.xml`, input_sample=x, accelerator='openvino', input_sample=x)

trainer.load

openvino_model  = PytorchOpenVINOModel.load('model.xml')
openvino_model = Trainer.load('saved_openvino_model.xml', accelerator='openvino')

@jason-dai
Copy link
Contributor Author

Can we rename trace to optimize?
We are also planning to add the new ipex logic here, since in the new ipex version, the usage changes to "model = ipex.optimize(model)".
e.g.

model = Trainer.optimize(model, accelerator='ipex')

For accelerator ipex, the model can still be used for training; for accelerator onnx/openvino, the model can only be used for inference.
@jason-dai

We plan to use trace for optimized inference pipeline

Where do you suggest ipex should go? Trainer.compile(use_ipex=True) or Trainer.trace(use_ipex=True)? Or trainer = Trainer(use_ipex=True) (current API).

If it's for training, can we set it in fit?

@yangw1234
Copy link
Contributor

Can we rename trace to optimize?
We are also planning to add the new ipex logic here, since in the new ipex version, the usage changes to "model = ipex.optimize(model)".
e.g.

model = Trainer.optimize(model, accelerator='ipex')

For accelerator ipex, the model can still be used for training; for accelerator onnx/openvino, the model can only be used for inference.
@jason-dai

We plan to use trace for optimized inference pipeline

Where do you suggest ipex should go? Trainer.compile(use_ipex=True) or Trainer.trace(use_ipex=True)? Or trainer = Trainer(use_ipex=True) (current API).

If it's for training, can we set it in fit?

In theory, we can override pytorch-lightning Trainer's fit method to add a "use_ipex" flag, but I am kind of afraid the usage changes too much from the original pytorch-lighting and make it too complex.

For original pytorch-lightning, the user sets all the parameters in Trainer's constructor and only passes model and data in fit.
E.g.

trainer = Trainer(accelerator='a', training_type='b', trick_1=True, trick-2=True, ...)
trainer.fit(model, data)

What we have changed is:

  1. added a compile method to Trainer
  2. added a trace method to Trainer
  3. added a few other parameters in the constructor (ddp_spawn).

So the problem I am afraid is that if we change fit too, would it be too complex for the user to use?

For example, if the user wants to use some feature, there will be 4 possible places for him/her to look for a parameter. On the other hand, in the original pytorch-lightning case, he/she only has to look through the Trainer's constructor.

@jason-dai
Copy link
Contributor Author

Can we rename trace to optimize?
We are also planning to add the new ipex logic here, since in the new ipex version, the usage changes to "model = ipex.optimize(model)".
e.g.

model = Trainer.optimize(model, accelerator='ipex')

For accelerator ipex, the model can still be used for training; for accelerator onnx/openvino, the model can only be used for inference.
@jason-dai

We plan to use trace for optimized inference pipeline

Where do you suggest ipex should go? Trainer.compile(use_ipex=True) or Trainer.trace(use_ipex=True)? Or trainer = Trainer(use_ipex=True) (current API).

If it's for training, can we set it in fit?

In theory, we can override pytorch-lightning Trainer's fit method to add a "use_ipex" flag, but I am kind of afraid the usage changes too much from the original pytorch-lighting and make it too complex.

For original pytorch-lightning, the user sets all the parameters in Trainer's constructor and only passes model and data in fit. E.g.

trainer = Trainer(accelerator='a', training_type='b', trick_1=True, trick-2=True, ...)
trainer.fit(model, data)

What we have changed is:

  1. added a compile method to Trainer
  2. added a trace method to Trainer
  3. added a few other parameters in the constructor (ddp_spawn).

So the problem I am afraid is that if we change fit too, would it be too complex for the user to use?

For example, if the user wants to use some feature, there will be 4 possible places for him/her to look for a parameter. On the other hand, in the original pytorch-lightning case, he/she only has to look through the Trainer's constructor.

I think we are adding new capabilities to Trainer; each new API in introduced for a new use cases that is not originally supported in PTL:

  • compile to convert PyTorch model
  • trace for model optimizations (for inference or deployment)
  • quantize for model quantization

If a use case is already supported by PTL, we should follow its original API; so for training specific optimization, which one is the preferred API to extend - Trainer.__init__ or Trainer.fit?

@zhentaocc
Copy link
Contributor

For IPEX plugin, I think Callback style implementation will be quite suitable for you to inject extra code or accelerations.
You don't need to overwrite fit or init, just create a new callback and modify setup(..). Then for all fit, val, test, predict, the pieces in setup will be called first before get into the loops. @jason-dai @yangw1234

@jason-dai
Copy link
Contributor Author

For IPEX plugin, I think Callback style implementation will be quite suitable for you to inject extra code or accelerations. You don't need to overwrite fit or init, just create a new callback and modify setup(..). Then for all fit, val, test, predict, the pieces in setup will be called first before get into the loops. @jason-dai @yangw1234

Who needs to write and specify the callback - we or the user?

@zhentaocc
Copy link
Contributor

For IPEX plugin, I think Callback style implementation will be quite suitable for you to inject extra code or accelerations. You don't need to overwrite fit or init, just create a new callback and modify setup(..). Then for all fit, val, test, predict, the pieces in setup will be called first before get into the loops. @jason-dai @yangw1234

Who needs to write and specify the callback - we or the user?

We do this.

@yangw1234
Copy link
Contributor

For IPEX plugin, I think Callback style implementation will be quite suitable for you to inject extra code or accelerations. You don't need to overwrite fit or init, just create a new callback and modify setup(..). Then for all fit, val, test, predict, the pieces in setup will be called first before get into the loops. @jason-dai @yangw1234

It seems to me that model = ipex.optimize(model) will create a new model, is the callback API capable of replacing the pl-module to be trained.

@zhentaocc
Copy link
Contributor

For IPEX plugin, I think Callback style implementation will be quite suitable for you to inject extra code or accelerations. You don't need to overwrite fit or init, just create a new callback and modify setup(..). Then for all fit, val, test, predict, the pieces in setup will be called first before get into the loops. @jason-dai @yangw1234

It seems to me that model = ipex.optimize(model) will create a new model, is the callback API capable of replacing the pl-module to be trained.

possibly you can bind the ipex model to trainer.lightning_module? Or use in-place optimize model = ipex.optimize(model, inplace=True)?

@jason-dai
Copy link
Contributor Author

For IPEX plugin, I think Callback style implementation will be quite suitable for you to inject extra code or accelerations. You don't need to overwrite fit or init, just create a new callback and modify setup(..). Then for all fit, val, test, predict, the pieces in setup will be called first before get into the loops. @jason-dai @yangw1234

Who needs to write and specify the callback - we or the user?

We do this.

It's unclear to me how the user will write their code to use IPEX?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants