Optimized inference pipeline for Nano #4360

jason-dai · 2022-04-06T11:53:01Z

Current status

FP32 BF16 IN8
PyTorch Y N Y
ONNX Y N Y
OpenVINO Y N N
- Trainer.compile(…, onnx=T/F, quantize=T/F, openvino=T/F) - bind relevant methods/variables
- Trainer.quantize(…) - generate quantized model (PyTorch/ONNX)
- Model.eval(quantize=T/F) - forward using (quantized) PyTorch model
- Model.eval_onnx(quantize=T/F)/eval_openvino()/exit_onnx()/exit_openvino() - forward using (quantized) ONNX/OpenVINO model
Desired status
- Support all combinations of the above table
- Compile: Trainer.compile() – just bind all methods/variables?
- Quantize: Trainer.quantize(precision=…, accelerator=…)
- Forward: model.eval(precision=…, accelerator=…)? – need to call quantize() first?
- Export/save: Trainer.openvino.export(precision=…)? – how about onnx/quantized? need to be consistent
- Load: model.load()/model.load_quantized_state_dict()??? - need to have consistent APIs
- Status: model.eval_status()? – every model should maintain current/default mode, and report here?
- What's the interactions of there methods? Any other methods needed?

@TheaperDeng @zhentaocc @yangw1234 @shane-huang

The text was updated successfully, but these errors were encountered:

TheaperDeng · 2022-04-07T07:11:02Z

talked with @zhentaocc

Keep updating according to latest comments.

Trainer.compile

We don't need this method for inference API.

trainer.quantize / trainer.trace

Quantize/trace a model by a specific precision and accelerator and return a new model which only handles this specific accelerated inference model.

# for bf16 or int8 low precision models
new_model = trainer.quantize(model,
                             precision="bf16"/"int8",
                             accelerator=None(pytorch)/"onnxruntime"/"openvino",
                             method="eager"/"fx"/"ipex"/"qlinear"...,
                             backend="inc"/"pot",
                             **kargs_inc,
                             **kargs_pot)

# for fp32 models backended on "onnxruntime" or "openvino"
new_model = trainer.trace(model,
                          accelerator="onnxruntime"/"openvino",
                          **kargs_accelerator)

A normal user should take care of:

model: A model that is compiled by Trainer.compile
precision: one of "bf16"/"int8"
accelerator: one of "pytorch"/"onnxruntime"/"openvino"

An expert user should take care of:

method: detailed post-training quantization method defined by each backend (e.g. for pytorch we have "eager", "fx", "ipex"..., for onnxruntime we have "linear", "qlinear"...). A recommeneded value will be set by default according to the precision and accelerator users set.
backend: which tool we will use to do the quantization. A recommeneded value will be set by default according to the precision and accelerator users set.
**kargs_inc/**kargs_pot: different advanced setting for inc and pot.

model.eval

Users don't need to call this method anymore but it does no harm to call it.

model.status

a @Property in model

return a dict to show which precision and which accelearator our users are using.

>>> model.status
>>> {"precision": xxx, "accelerator": xxx}

zhentaocc · 2022-04-08T02:55:26Z

keep updating according to latest comments

model.train()

This function should not be called on amodel that is returned by trainer.trace /trainer.quantize. If called, an error will be raised.

model.inference() -> trainer.inference()

After our discussion, this method will be deleted since it is highly similar to another method trainer.predict and pytorch users are so familiar to write their own inference loop through model(x) .

trainer.save

trainer.save(model, dirname=…)

This function will return a dictionary that indicates the saved path. So that our users will understand which file they can take away for further deployment.

trainer.load

trainer.load(model, dirname=…)

same as above

TheaperDeng · 2022-04-08T13:43:09Z

Discussed with @shane-huang earlier and we proposed another design to separate the quantized models/inference sessions from model but still provide same usage as a pytorch model.

This design features a different behavior in trainer.quantize and a new trainer.trace method.

In short, these two methods (trainer methods) will return a new model whose forward has been redirected.

# for bf16 or int8 low precision models
new_model = trainer.quantize(model,
                             precision="bf16"/"int8",
                             accelerator=None(pytorch)/"onnxruntime"/"openvino",
                             method="eager"/"fx"/"ipex"/"qlinear"...,
                             backend="inc"/"pot",
                             **kargs_inc,
                             **kargs_pot)

# for fp32 models backended on "onnxruntime" or "openvino"
new_model = trainer.trace(model,
                          accelerator="onnxruntime"/"openvino",
                          **kargs_accelerator)

yhat = new_model(x)  # x is a torch tensor

Consequentially, some other methods will be changed.

# .eval() is only used to change the accelerator's setting
new_model.eval(**kargs_accelerator)

# trianer.save will save the model's state and a meta data file to identify what precision and accelerator is used
trainer.save(new_model, dirname="...")
new_model = trainer.load(dirname="...")

jason-dai · 2022-04-09T13:56:21Z

It may be a good idea to ask users to explicitly call quantize or trace to get the specific acceleration; e.g.,
- model=trainer.trace(model, accelerator="openvino") will always run using FP32 on OpenVINO
- model=trainer.quantize(model, precision="INT8", accelerator="onnx") will always run using INT8 on ONNXRT
- Then there is no need to change eval?
- And maybe we should make the resulting model un-trainable, which can make it easier to manage?
If using PyTorch to run the model, we should probably set accelerator to None, instead of "pytorch"
For Trainer.save, we need to make sure the saved model can also be loaded by standard tools (such as ONNXRT, OpenVINO, TorchServe, PyTorch, etc.)

TheaperDeng · 2022-04-11T02:14:38Z

It may be a good idea to ask users to explicitly call quantize or trace to get the specific acceleration; e.g.,

model=trainer.trace(model, accelerator="openvino") will always run using FP32 on OpenVINO

Yes

model=trainer.quantize(model, precision="INT8", accelerator="onnx") will always run using INT8 on ONNXRT

Yes

Then there is no need to change eval?

There are some cases a user might want to call .eval(), especially when they want to change the accelerator's(e.g. openvino/onnxruntime) option. They may call model.eval(session_option) to rebuild an inference session.

And maybe we should make the resulting model un-trainable, which can make it easier to manage?

Yes

If using PyTorch to run the model, we should probably set accelerator to None, instead of "pytorch"

For Trainer.save, we need to make sure the saved model can also be loaded by standard tools (such as ONNXRT, OpenVINO, TorchServe, PyTorch, etc.)

We will give clear document
We should give them clear message when calling Trainer.save e.g.

>>> trainer.save(model, dirname=".")
>>> {"meta_data_path": "./model.meta",
     "onnx_file_path": "./model.onnx"}

jason-dai · 2022-04-11T09:07:12Z

Then there is no need to change eval?

There are some cases a user might want to call .eval(), especially when they want to change the accelerator's(e.g. openvino/onnxruntime) option. They may call model.eval(session_option) to rebuild an inference session.

Do we really need to support this? The user can always call Trainer.quantize or Trainer.trace to generate a new model.

TheaperDeng · 2022-04-11T10:03:36Z

Then there is no need to change eval?

There are some cases a user might want to call .eval(), especially when they want to change the accelerator's(e.g. openvino/onnxruntime) option. They may call model.eval(session_option) to rebuild an inference session.

Do we really need to support this? The user can always call Trainer.quantize or Trainer.trace to generate a new model.

Exactly, we (@zhentaocc ) talked about this and we agree that users can always call Trainer.quantize or Trainer.trace again. We will update the detailed API later.

zhentaocc · 2022-04-13T07:38:52Z

Openvino new api will be implemented in #4381.

yangw1234 · 2022-04-21T02:29:08Z

Can we rename trace to optimize?

We are also planning to add the new ipex logic here, since in the new ipex version, the usage changes to "model = ipex.optimize(model)".

e.g.

model = Trainer.optimize(model, accelerator='ipex')

For accelerator ipex, the model can still be used for training; for accelerator onnx/openvino, the model can only be used for inference.

@jason-dai

jason-dai · 2022-04-21T06:35:00Z

Can we rename trace to optimize?

We are also planning to add the new ipex logic here, since in the new ipex version, the usage changes to "model = ipex.optimize(model)".

e.g.
model = Trainer.optimize(model, accelerator='ipex')
For accelerator ipex, the model can still be used for training; for accelerator onnx/openvino, the model can only be used for inference.

@jason-dai

We plan to use trace for optimized inference pipeline

yangw1234 · 2022-04-21T06:57:03Z

Can we rename trace to optimize?
We are also planning to add the new ipex logic here, since in the new ipex version, the usage changes to "model = ipex.optimize(model)".
e.g.
model = Trainer.optimize(model, accelerator='ipex')
For accelerator ipex, the model can still be used for training; for accelerator onnx/openvino, the model can only be used for inference.
@jason-dai
We plan to use trace for optimized inference pipeline

Where do you suggest ipex should go? Trainer.compile(use_ipex=True) or Trainer.trace(use_ipex=True)? Or trainer = Trainer(use_ipex=True) (current API).

zhentaocc · 2022-04-21T08:08:33Z

Current API for model saving and loading. It can do:

save/export a pytorch model as openvino format
save a PytorchOpenVINOModel as local openvino file

Do you think this is a bit confusing for users accepting multiple types of models? @jason-dai
Otherwise we can keep trainer.save handling 1. And if users have the needs for 2, they can call openvino_model.save(..) to do the same.

`trainer.save`

save from PytorchOpenvinoModel

openvino_model = trainer.trace(model: nn.Module, x, accelerator='openvino')
trainer.save(openvino_model, 'model.xml`)

or save from torch model directly:

trainer.save(model: nn.Module, 'model.xml`, input_sample=x, accelerator='openvino', input_sample=x)

`trainer.load`

openvino_model  = PytorchOpenVINOModel.load('model.xml')

openvino_model = Trainer.load('saved_openvino_model.xml', accelerator='openvino')

jason-dai · 2022-04-21T10:06:46Z

Can we rename trace to optimize?
We are also planning to add the new ipex logic here, since in the new ipex version, the usage changes to "model = ipex.optimize(model)".
e.g.
model = Trainer.optimize(model, accelerator='ipex')
For accelerator ipex, the model can still be used for training; for accelerator onnx/openvino, the model can only be used for inference.
@jason-dai
We plan to use trace for optimized inference pipeline
Where do you suggest ipex should go? Trainer.compile(use_ipex=True) or Trainer.trace(use_ipex=True)? Or trainer = Trainer(use_ipex=True) (current API).

If it's for training, can we set it in fit?

yangw1234 · 2022-04-22T02:11:54Z

Can we rename trace to optimize?
We are also planning to add the new ipex logic here, since in the new ipex version, the usage changes to "model = ipex.optimize(model)".
e.g.
model = Trainer.optimize(model, accelerator='ipex')
For accelerator ipex, the model can still be used for training; for accelerator onnx/openvino, the model can only be used for inference.
@jason-dai
We plan to use trace for optimized inference pipeline
Where do you suggest ipex should go? Trainer.compile(use_ipex=True) or Trainer.trace(use_ipex=True)? Or trainer = Trainer(use_ipex=True) (current API).
If it's for training, can we set it in fit?

In theory, we can override pytorch-lightning Trainer's fit method to add a "use_ipex" flag, but I am kind of afraid the usage changes too much from the original pytorch-lighting and make it too complex.

For original pytorch-lightning, the user sets all the parameters in Trainer's constructor and only passes model and data in fit.
E.g.

trainer = Trainer(accelerator='a', training_type='b', trick_1=True, trick-2=True, ...)
trainer.fit(model, data)

What we have changed is:

added a compile method to Trainer
added a trace method to Trainer
added a few other parameters in the constructor (ddp_spawn).

So the problem I am afraid is that if we change fit too, would it be too complex for the user to use?

For example, if the user wants to use some feature, there will be 4 possible places for him/her to look for a parameter. On the other hand, in the original pytorch-lightning case, he/she only has to look through the Trainer's constructor.

jason-dai · 2022-04-24T08:55:45Z

Can we rename trace to optimize?
We are also planning to add the new ipex logic here, since in the new ipex version, the usage changes to "model = ipex.optimize(model)".
e.g.
model = Trainer.optimize(model, accelerator='ipex')
For accelerator ipex, the model can still be used for training; for accelerator onnx/openvino, the model can only be used for inference.
@jason-dai
We plan to use trace for optimized inference pipeline
Where do you suggest ipex should go? Trainer.compile(use_ipex=True) or Trainer.trace(use_ipex=True)? Or trainer = Trainer(use_ipex=True) (current API).
If it's for training, can we set it in fit?
In theory, we can override pytorch-lightning Trainer's fit method to add a "use_ipex" flag, but I am kind of afraid the usage changes too much from the original pytorch-lighting and make it too complex.

For original pytorch-lightning, the user sets all the parameters in Trainer's constructor and only passes model and data in fit. E.g.
trainer = Trainer(accelerator='a', training_type='b', trick_1=True, trick-2=True, ...)
trainer.fit(model, data)
What we have changed is:

added a compile method to Trainer

added a trace method to Trainer

added a few other parameters in the constructor (ddp_spawn).

So the problem I am afraid is that if we change fit too, would it be too complex for the user to use?

For example, if the user wants to use some feature, there will be 4 possible places for him/her to look for a parameter. On the other hand, in the original pytorch-lightning case, he/she only has to look through the Trainer's constructor.

I think we are adding new capabilities to Trainer; each new API in introduced for a new use cases that is not originally supported in PTL:

compile to convert PyTorch model
trace for model optimizations (for inference or deployment)
quantize for model quantization

If a use case is already supported by PTL, we should follow its original API; so for training specific optimization, which one is the preferred API to extend - Trainer.__init__ or Trainer.fit?

zhentaocc · 2022-04-24T09:01:27Z

For IPEX plugin, I think Callback style implementation will be quite suitable for you to inject extra code or accelerations.
You don't need to overwrite fit or init, just create a new callback and modify setup(..). Then for all fit, val, test, predict, the pieces in setup will be called first before get into the loops. @jason-dai @yangw1234

jason-dai · 2022-04-24T09:24:24Z

For IPEX plugin, I think Callback style implementation will be quite suitable for you to inject extra code or accelerations. You don't need to overwrite fit or init, just create a new callback and modify setup(..). Then for all fit, val, test, predict, the pieces in setup will be called first before get into the loops. @jason-dai @yangw1234

Who needs to write and specify the callback - we or the user?

zhentaocc · 2022-04-24T09:36:58Z

For IPEX plugin, I think Callback style implementation will be quite suitable for you to inject extra code or accelerations. You don't need to overwrite fit or init, just create a new callback and modify setup(..). Then for all fit, val, test, predict, the pieces in setup will be called first before get into the loops. @jason-dai @yangw1234

Who needs to write and specify the callback - we or the user?

We do this.

yangw1234 · 2022-04-25T01:38:54Z

For IPEX plugin, I think Callback style implementation will be quite suitable for you to inject extra code or accelerations. You don't need to overwrite fit or init, just create a new callback and modify setup(..). Then for all fit, val, test, predict, the pieces in setup will be called first before get into the loops. @jason-dai @yangw1234

It seems to me that model = ipex.optimize(model) will create a new model, is the callback API capable of replacing the pl-module to be trained.

zhentaocc · 2022-04-25T02:25:24Z

For IPEX plugin, I think Callback style implementation will be quite suitable for you to inject extra code or accelerations. You don't need to overwrite fit or init, just create a new callback and modify setup(..). Then for all fit, val, test, predict, the pieces in setup will be called first before get into the loops. @jason-dai @yangw1234

It seems to me that model = ipex.optimize(model) will create a new model, is the callback API capable of replacing the pl-module to be trained.

possibly you can bind the ipex model to trainer.lightning_module? Or use in-place optimize model = ipex.optimize(model, inplace=True)?

jason-dai · 2022-04-25T06:27:07Z

For IPEX plugin, I think Callback style implementation will be quite suitable for you to inject extra code or accelerations. You don't need to overwrite fit or init, just create a new callback and modify setup(..). Then for all fit, val, test, predict, the pieces in setup will be called first before get into the loops. @jason-dai @yangw1234

Who needs to write and specify the callback - we or the user?

We do this.

It's unclear to me how the user will write their code to use IPEX?

jason-dai added the Nano label Apr 6, 2022

zhentaocc mentioned this issue Apr 8, 2022

[Nano] Nano openvino new api #4381

Merged

jason-dai mentioned this issue Apr 24, 2022

[Nano] INC new api and usage #4418

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimized inference pipeline for Nano #4360

Optimized inference pipeline for Nano #4360

jason-dai commented Apr 6, 2022

TheaperDeng commented Apr 7, 2022 •

edited

Loading

zhentaocc commented Apr 8, 2022 •

edited by TheaperDeng

Loading

TheaperDeng commented Apr 8, 2022 •

edited

Loading

jason-dai commented Apr 9, 2022 •

edited

Loading

TheaperDeng commented Apr 11, 2022

jason-dai commented Apr 11, 2022

TheaperDeng commented Apr 11, 2022

zhentaocc commented Apr 13, 2022

yangw1234 commented Apr 21, 2022

jason-dai commented Apr 21, 2022

yangw1234 commented Apr 21, 2022

zhentaocc commented Apr 21, 2022 •

edited

Loading

jason-dai commented Apr 21, 2022

yangw1234 commented Apr 22, 2022

jason-dai commented Apr 24, 2022

zhentaocc commented Apr 24, 2022

jason-dai commented Apr 24, 2022

zhentaocc commented Apr 24, 2022

yangw1234 commented Apr 25, 2022

zhentaocc commented Apr 25, 2022

jason-dai commented Apr 25, 2022

Optimized inference pipeline for Nano #4360

Optimized inference pipeline for Nano #4360

Comments

jason-dai commented Apr 6, 2022

TheaperDeng commented Apr 7, 2022 • edited Loading

Trainer.compile

trainer.quantize / trainer.trace

model.eval

model.status

zhentaocc commented Apr 8, 2022 • edited by TheaperDeng Loading

model.train()​

model.inference() -> trainer.inference()​

trainer.save

trainer.load

TheaperDeng commented Apr 8, 2022 • edited Loading

jason-dai commented Apr 9, 2022 • edited Loading

TheaperDeng commented Apr 11, 2022

jason-dai commented Apr 11, 2022

TheaperDeng commented Apr 11, 2022

zhentaocc commented Apr 13, 2022

yangw1234 commented Apr 21, 2022

jason-dai commented Apr 21, 2022

yangw1234 commented Apr 21, 2022

zhentaocc commented Apr 21, 2022 • edited Loading

trainer.save

trainer.load

jason-dai commented Apr 21, 2022

yangw1234 commented Apr 22, 2022

jason-dai commented Apr 24, 2022

zhentaocc commented Apr 24, 2022

jason-dai commented Apr 24, 2022

zhentaocc commented Apr 24, 2022

yangw1234 commented Apr 25, 2022

zhentaocc commented Apr 25, 2022

jason-dai commented Apr 25, 2022

TheaperDeng commented Apr 7, 2022 •

edited

Loading

zhentaocc commented Apr 8, 2022 •

edited by TheaperDeng

Loading

model.train()

model.inference() -> trainer.inference()

TheaperDeng commented Apr 8, 2022 •

edited

Loading

jason-dai commented Apr 9, 2022 •

edited

Loading

zhentaocc commented Apr 21, 2022 •

edited

Loading

`trainer.save`

`trainer.load`