Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

INT6/INT4 support for model optimization #1100

Closed
JiaojiaoYe1994 opened this issue Jul 20, 2023 · 10 comments
Closed

INT6/INT4 support for model optimization #1100

JiaojiaoYe1994 opened this issue Jul 20, 2023 · 10 comments

Comments

@JiaojiaoYe1994
Copy link

Dear Author,

thank you so much for the project. I have read your results, and have a question regarding the implementation of the PTQ. Does it support INT4/INT6 or other?

@hshen14
Copy link
Contributor

hshen14 commented Jul 20, 2023

yes, INT4/INT6/INT8 are supported in the context of weight-only post-training quantization.

@JiaojiaoYe1994
Copy link
Author

yes, INT4/INT6/INT8 are supported in the context of weight-only post-training quantization.

I see, for example, assume that I will use Post-train static quantization, i.e quantize activation and weight at the same time, can I apply INT4/INT6/INT8 quantization?

@hshen14
Copy link
Contributor

hshen14 commented Jul 21, 2023

yes, INT4/INT6/INT8 are supported in the context of weight-only post-training quantization.

I see, for example, assume that I will use Post-train static quantization, i.e quantize activation and weight at the same time, can I apply INT4/INT6/INT8 quantization?

INT8 is supported for both activation and weight, while INT4/INT6 is weight only.

@paul-ang
Copy link

What is the appropriate workflow to quantize the weights to INT4 and the activations to INT8? Are we able to achieve this in one .fit() session?

@hshen14
Copy link
Contributor

hshen14 commented Sep 13, 2023

What is the appropriate workflow to quantize the weights to INT4 and the activations to INT8? Are we able to achieve this in one .fit() session?

The quantization flow is quite similar (under fit), while additional config is required. See the example: https://github.com/intel/neural-compressor/blob/master/docs/source/quantization_weight_only.md.

@paul-ang
Copy link

I tried that example, but it doesn't work unfortunately. I tried using these two configurations:

Attempt 1

qt_conf = PostTrainingQuantConfig(
        approach="weight_only",
        op_type_dict={
            ".*": {  # re.match
                "weight": {
                    "bits": 4,  # 1-8 bit
                    "group_size": -1,  # -1 (per-channel)
                    "scheme": "sym",
                    "algorithm": "RTN",
                },
            },
        },
    )

This caused an exception KeyError ('default').
Error trace:

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/dist-packages/neural_compressor/quantization.py", line 223, in fit
    strategy.traverse()
  File "/usr/local/lib/python3.9/dist-packages/neural_compressor/strategy/auto.py", line 134, in traverse
    super().traverse()
  File "/usr/local/lib/python3.9/dist-packages/neural_compressor/strategy/strategy.py", line 482, in traverse
    self._prepare_tuning()
  File "/usr/local/lib/python3.9/dist-packages/neural_compressor/strategy/strategy.py", line 378, in _prepare_tuning
    self.capability = self.capability or self.adaptor.query_fw_capability(self.model)
  File "/usr/local/lib/python3.9/dist-packages/neural_compressor/utils/utility.py", line 301, in fi
    res = func(*args, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/neural_compressor/adaptor/pytorch.py", line 4834, in query_fw_capability
    return self._get_quantizable_ops(model.model)
  File "/usr/local/lib/python3.9/dist-packages/neural_compressor/adaptor/pytorch.py", line 1141, in _get_quantizable_ops
    else copy.deepcopy(capability["default"])
KeyError: 'default'

Attempt 2

    qt_conf = PostTrainingQuantConfig(
        domain="object_detection",
        excluded_precisions=["bf16", "fp16"],
        approach="auto",
        quant_level=1,
        op_type_dict={
            "Conv": {
                "weight": {
                    "bits": 4,
                    "group_size": -1,
                    "scheme": "sym",
                    "algorithm": "RTN",
                }
            }
        },
    )

This ran successfully, but I don't think the weights were quantized to 4 bit. There were no accuracy loss and I also manually inspected the model.pt weights file.

I am using torch==2.0.1.

@xin3he
Copy link
Contributor

xin3he commented Sep 21, 2023

@paul-ang Only linear is supported in weight-only approach. The issue you met in case 1 is because our configuration mismatch for Conv2d. I will remove Conv2d when fetching quantizable ops.

@paul-ang
Copy link

Does this mean that quantizing Conv2D lower than 8 bit is not supported at the moment?

@xin3he
Copy link
Contributor

xin3he commented Sep 21, 2023

Does this mean that quantizing Conv2D lower than 8 bit is not supported at the moment?

Yes, usually Conv2d is not a memory-bounding operator so we only support linear for Large Language Models. If you can provide some implementation that used Conv2d with large weight size, we will consider supporting it in weight-only mode.

@xin3he
Copy link
Contributor

xin3he commented Sep 21, 2023

@paul-ang Thank you for reporting the Conv2d issue. We have raised a PR to fix it.

@xin3he xin3he closed this as completed Oct 31, 2023
chensuyue added a commit to chensuyue/lpot that referenced this issue Feb 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants