Add Int4CPULayout and update int4 woq #1278

yanbing-j · 2024-11-13T11:02:44Z

pytorch/pytorch#139611 is merged into PyTorch main branch.

pytorch-bot · 2024-11-13T11:02:49Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/1278

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

torchao/dtypes/affine_quantized_tensor.py

jerryzh168 · 2024-11-14T02:50:26Z

we are doing a refactor for file structure btw: #1234 might be good to rebase after that is landed

test/quantization/test_quant_primitives.py

torchao/dtypes/affine_quantized_tensor.py

jerryzh168 · 2024-11-14T02:54:01Z

torchao/dtypes/affine_quantized_tensor.py

+
+    __torch_function__ = torch._C._disabled_torch_function_impl
+
+    def get_plain(self) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:


we have an unpack op for tensor core tiled layout now, so this can actually be replaced with a call to the op:

ao/torchao/csrc/cuda/tensor_core_tiled_layout/tensor_core_tiled_layout.cu

Lines 311 to 312 in 39f16f4

m.impl("torchao::unpack_tensor_core_tiled_layout", &_unpack_tensor_core_tiled_layout);

m.impl("torchao::dequantize_tensor_core_tiled_layout", &_dequantize_tensor_core_tiled_layout);

do you plan to write similar ops for cpu?

I have noticed this, but I have no bandwidth to do so these days. If you are not urgent for this feature, I can take this task.

cc @mingfeima

that would be great, thanks @yanbing-j , this is not urgent

torchao/quantization/subclass.py

jerryzh168 · 2024-11-14T02:55:40Z

torchao/quantization/utils.py

+        # if int_data_device_type == "mps":
+        #     int_data = int_data.cpu()
+        if int_data_device_type != "cpu":
+            int_data = (int_data[::, ::2] << 4 | int_data[::, 1::2]).to(torch.uint8)
+        # if int_data_device_type == "mps":
+        #     int_data = int_data.to(device="mps")


please remove the code that's commented out

is this equivalent to previous code?

According to #517 (comment), << can be used in MPS backend, don't need to convert to CPU and use CPU backend. Since I don't have mps machine, I want to use CI to check if this can work. Otherwise, I can update to int_data = (torch.bitwise_left_shift(int_data[::, ::2], 4) | int_data[::, 1::2]).to(torch.uint8) instead.

oh I see, makes sense

jerryzh168

can be a separate PR, but can you also help add support for conversion between int4 tensor core tiled layout and int4 cpu layout, we may need a separate util for this, like we discussed in the issue: #1117 (comment)

right now we error out when converting between different devices

ao/torchao/dtypes/affine_quantized_tensor.py

Lines 1486 to 1489 in 39f16f4

    
           if not is_device(torch.device(self.device).type, device): 
        
               raise ValueError( 
        
                   f"TensorCoreTiledAQTTensorImpl does not support conversion from {self.device} to {device}" 
        
               )

, this is fine I think, just need separate utils if people want to do this move.

Test can be added in

ao/test/dtypes/test_affine_quantized.py

Line 44 in 39f16f4

class TestAffineQuantized(TestCase):

torchao/dtypes/affine_quantized_tensor.py

jerryzh168 · 2024-11-15T01:08:51Z

I think you should also unpin pytorch version to get the latest op changes: #1283

yanbing-j · 2024-11-15T06:02:38Z

Hi @jerryzh168 , I have updated to fix CI and involve PyTorch nightly in. Could you please take a look? I tested 2.3.0, 2.4.1, 2.5.1 and 2.6 in CPU in my local.

jerryzh168 · 2024-11-15T20:37:12Z

@yanbing-j we just landed a large refactor PR, can you rebase?

yanbing-j · 2024-11-18T06:01:04Z

@jerryzh168 I have rebased, could you please take a look?

Jack-Khuu · 2024-11-18T18:13:54Z

Thanks for looking into this @yanbing-j

Eagerly awaiting to pick it up in pytorch/torchchat#1367

jerryzh168 · 2024-11-18T18:51:24Z

@yanbing-j there is some failures: https://github.com/pytorch/ao/actions/runs/11886684924/job/33154937493?pr=1278 can you fix them?

yanbing-j · 2024-11-19T06:54:13Z

@jerryzh168 Please review again.

yanbing-j · 2024-11-20T03:49:18Z

@jerryzh168 Please review again.

yanbing-j · 2024-11-20T07:50:35Z

Hi @jerryzh168 , 2 failures in CUDA nightly cannot be reproduced in A100 with torch 2.6.0.dev20241119+cu124. And the CPU nightly failure is related to GLIBC. I don't know how to fix.

jerryzh168 · 2024-11-20T21:15:56Z

.github/workflows/regression_test.yml

@@ -70,6 +70,12 @@ jobs:
            torch-spec: 'torch==2.5.1 --index-url https://download.pytorch.org/whl/cu121'
            gpu-arch-type: "cuda"
            gpu-arch-version: "12.1"
+          - name: CUDA Nightly


why are these tests added? can you rebase on main? I think we have some recent changes to the CI jobs:

ao/.github/workflows/regression_test.yml

Line 21 in 72fb597

test-nightly:

I followed https://github.com/pytorch/ao/pull/1283/files#diff-87efadde54d371bf3d7330fff24599a85d32c4982530dd4c7d7d7855e76489bbL28-L33 to update nightly. I have removed CUDA nightly and CPU nightly already.

jerryzh168 · 2024-11-20T23:30:53Z

also do you know the issue with xpu job errors in current main: https://github.com/pytorch/ao/actions/runs/11942397686/job/33289365532

yanbing-j · 2024-11-21T02:21:41Z

also do you know the issue with xpu job errors in current main: https://github.com/pytorch/ao/actions/runs/11942397686/job/33289365532

@jerryzh168 I saw these XPU failures are related to Windows. Please involve @EikanWang inside.

jerryzh168 · 2024-11-21T05:42:14Z

the error still seems valid: https://github.com/pytorch/ao/actions/runs/11945348130/job/33299816407?pr=1278

yanbing-j · 2024-11-21T05:49:26Z

the error still seems valid: https://github.com/pytorch/ao/actions/runs/11945348130/job/33299816407?pr=1278

It cannot be reproduced in A100 with torch 2.6.0.dev20241119+cu124. Let me try the latest one again,

yanbing-j · 2024-11-21T06:22:37Z

@jerryzh168 Sorry, I still cannot reproduce in A100. Could you please help make a try? Thanks!

$ python test/dtypes/test_affine_quantized.py TestAffineQuantizedBasic.test_flatten_unflatten_device_cpu_bfloat16
.
----------------------------------------------------------------------
Ran 1 test in 0.266s

OK

torch 2.6.0.dev20241120+cu124
torchao 0.7.0+git25b9460 /home/pt-gpu/yanbingj/ao (This is the commit of yanbing/update_int4 branch, using pip install -e . to install)

jerryzh168 · 2024-11-22T06:00:42Z

there is some issue with pytorch nightly version I think, I saw: Downloading https://download.pytorch.org/whl/nightly/cu121/torch-2.6.0.dev20241112%2Bcu121-cp39-cp39-linux_x86_64.whl (767.9 MB)
in the log,

when I'm installing locally, I also installed: Successfully installed nvidia-cusparselt-cu12-0.6.2 torch-2.6.0.dev20241112+cu121

looks like the latest cu121 is: 1112+cu121 in https://download.pytorch.org/whl/nightly/torch/ right now

yanbing-j · 2024-11-22T06:11:20Z

there is some issue with pytorch nightly version I think, I saw: Downloading https://download.pytorch.org/whl/nightly/cu121/torch-2.6.0.dev20241112%2Bcu121-cp39-cp39-linux_x86_64.whl (767.9 MB) in the log,

when I'm installing locally, I also installed: Successfully installed nvidia-cusparselt-cu12-0.6.2 torch-2.6.0.dev20241112+cu121

looks like the latest cu121 is: 1112+cu121 in https://download.pytorch.org/whl/nightly/torch/ right now

Oh, you are right. For cu121, the latest is 1112 nightly, which does not include pytorch/pytorch#139611 (20241112 merged into PyTorch). And for cu124, the latest is 1121, that's why I cannot reproduce.

So, can this PR be merged since this is a platform related issue, and can be regarded as a known issue before CI upgrades to cu124? @jerryzh168

jerryzh168 · 2024-11-22T19:23:28Z

let's upgrade CI to use 12.4 first, I heard 12.1 is deprecated in newer pytorch versions: pytorch/pytorch#138609

#1278 (comment)

* Update nightly job to use 12.4 since 12.1 is deprecated #1278 (comment) * skip failed tests

* Update nightly job to use 12.4 since 12.1 is deprecated pytorch#1278 (comment) * skip failed tests

…ed values (pytorch#1359) * Update cli.py to make --device/--dtype pre-empt quantize dict-specified values Users may expect that cli parameters override the JSON, as per pytorch#1278. Invert logic - case split: 1 - if none (no value) is specified, use value specified in quantize dict, if present; else 2 - if value is specified, override the respective handler if present. * Fix typo in cli.py fix typo --------- Co-authored-by: Jack-Khuu <[email protected]>

jerryzh168 · 2024-12-13T18:50:50Z

torchao/dtypes/uintx/tensor_core_tiled_layout.py

@@ -383,3 +393,251 @@ def get_plain(self) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:

    def get_layout(self) -> Layout:
        return self._layout
+
+
+@dataclass(frozen=True)


oh sorry missed this one, it should have a separate file since it's a different layout, cc @yanbing-j can you help move this to a separate file under the same directly? (int4_cpu_layout.py)

@jerryzh168 Okay, here is the PR #1419.

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 13, 2024

yanbing-j mentioned this pull request Nov 13, 2024

Split int4wo weight packing pytorch/pytorch#139611

Closed

jerryzh168 reviewed Nov 13, 2024

View reviewed changes

torchao/dtypes/affine_quantized_tensor.py Outdated Show resolved Hide resolved

yanbing-j force-pushed the yanbing/update_int4 branch from 1b26f26 to 104d1f3 Compare November 14, 2024 02:46

yanbing-j marked this pull request as ready for review November 14, 2024 02:47