[CI] Quantization workflow #29046

SunMarc · 2024-02-15T18:38:15Z

What does this PR do ?

This PR adds a workflow for quantization tests + related dockerfile. Since we merged the HfQuantizer PR, the community started integrating their own quantizers into transformers (e.g. AQML and much more in the future). This will lead to many third party libraries in the Dockerfile huggingface/transformers-all-latest-gpu. To limit the impact of these libraries on transformers tests, I propose to create a seperate dockerfile + workflow.

HuggingFaceDocBuilderDev · 2024-02-15T19:05:35Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

younesbelkada

Thank you very much ! Can you try to trigger a run of the quantization slow tests and make sure we get the slack notifications (after building the quantization docker image) ? 🙏
For that you can comment out all jobs except the quantization tests job in self-scheduled.yml & build-docker.yml and modify the run condition on on push to the name of the branch you are working on right now (there might be a better way to trigger only the quantization tests but all tests are on the same workflow file)

younesbelkada

I left a comment ! (this can be applied to all our workflow file that used that logic for checking out to the current main branch)

.github/workflows/self-scheduled.yml

SunMarc · 2024-02-16T21:35:41Z

I see that transformers-all-latest-gpu docker image is not being updated for the last two days since the installation fails because of aqml library that requires python 3.10 at least and we uses 3.8 for now. We will have to change that in the quantization dockerfile.

Edit: I tried to install python 3.10 but it didn't work (1.030 E: Unable to locate package python3.10). I found this tutorial but not sure if it is the best way to install it.

BlackSamorez · 2024-02-18T15:07:02Z

The only reason aqlm requires python>=3.10 is a single match-case statement in a non-critical place.

I was able to run aqlm on python 3.8 no problem otherwise. I can replace the statement with an if-else statement and lower the requirement if necessary.

younesbelkada · 2024-02-18T23:55:00Z

@SunMarc thanks!
in trl i build a docker image with python 3.10: https://github.com/huggingface/trl/blob/main/docker/trl-source-gpu/Dockerfile maybe you can take some inspiration from that dockerfile? 🙏 I am not sure why you are getting that error currently as the commands looks correct.
@BlackSamorez yes that would be great if you can also support python 3.8 for AQLM 🙏 Thanks !

SunMarc · 2024-02-20T15:08:28Z

I was able to run aqlm on python 3.8 no problem otherwise. I can replace the statement with an if-else statement and lower the requirement if necessary.

Yes that would be for the best ! I prefer to keep running the quantization tests with python 3.8 since this is what we are actually doing for all transformers tests ! Moreover, it will be better for the users since we keep the requirement low. LMK when it is done ! @BlackSamorez
Otherwise, I will modify the dockerfile and use conda to install python 3.10 as suggested by @younesbelkada.

BlackSamorez · 2024-02-20T15:14:56Z

@SunMarc aqlm will support python >=3.8 starting version 1.0.2. I'm 1 PR away from releasing it.

SunMarc · 2024-02-20T15:20:53Z

Perfect ! I will wait for your PR to be merged then + release then if it doesn't take too much time. Please keep me updated ! Otherwise, I can first merge this PR without aqlm and add it afterwards.

BlackSamorez · 2024-02-20T19:36:41Z

@SunMarc aqlm==1.0.2 is out. May I ask you to please update the docker images?

SunMarc · 2024-02-20T20:02:22Z

I was able to build the image but I don't have the permission to push the image cc @ydshieh

#22 ERROR: failed to push huggingface/transformers-quantization-latest-gpu: push access denied, repository does not exist or may require authorization: server message: insufficient_scope: authorization failed

ydshieh · 2024-02-21T03:03:15Z

@SunMarc Do you builded/pushed via transformers' github actions? If so, do you have a job run link?

SunMarc · 2024-02-21T04:07:56Z

Yes ! Here's the link to the job

younesbelkada · 2024-02-21T06:33:21Z

it's quite strange because the workflow was indeed able to login: https://github.com/huggingface/transformers/actions/runs/7979164961/job/21802381289#step:5:1 but fails to push ...

younesbelkada · 2024-02-21T06:35:18Z

In TRL and PEFT I can confirm the build & push works fine : https://github.com/huggingface/peft/actions/workflows/build_docker_images.yml / https://github.com/huggingface/trl/actions/workflows/docker-build.yml so it's not a token issue as we use the same except if the token has expired for transformers cc @glegendre01

ydshieh · 2024-02-21T09:04:40Z

Hi @SunMarc it's because there is some change in the infra team on docker hub stuff.

The repository huggingface/transformers-quantization-latest-gpu has to be created on docker Hub first, then only after this, you can push to it.

I will ask for it.

ydshieh · 2024-02-21T09:06:19Z

BTW, I will review this PR tomorrow or Friday 🙏

…rkflow

SunMarc · 2024-02-21T16:42:59Z

Hi @ydshieh, i was able to run the tests and get the slack notification. See job here. Thx for your help !

.github/workflows/self-scheduled.yml

Co-authored-by: Younes Belkada <[email protected]>

ydshieh

Thank you @SunMarc (and @younesbelkada ) for this initiative.

I love this idea - quantization tests become more and more, and failing on them becomes hard to track in the whole daily CI.

I have some suggestions and questions for the current version though.

ydshieh · 2024-02-23T01:17:15Z

docker/transformers-all-latest-gpu/Dockerfile

-RUN python3 -m pip install --no-cache-dir https://github.com/casper-hansen/AutoAWQ/releases/download/v0.1.8/autoawq-0.1.8+cu118-cp38-cp38-linux_x86_64.whl
-
-# For bettertransformer + gptq
+# For bettertransformer
 RUN python3 -m pip install --no-cache-dir git+https://github.com/huggingface/optimum@main#egg=optimum


Do we count this as non quantization tests?

I think we need to keep that as optimum is a hard requirement to use GPTQ

docker/transformers-quantization-latest-gpu/Dockerfile

ydshieh · 2024-02-23T01:21:22Z

docker/transformers-quantization-latest-gpu/Dockerfile

+RUN [ ${#PYTORCH} -gt 0 ] && VERSION='torch=='$PYTORCH'.*' ||  VERSION='torch'; echo "export VERSION='$VERSION'" >> ~/.profile
+RUN echo torch=$VERSION
+# `torchvision` and `torchaudio` should be installed along with `torch`, especially for nightly build.
+# Currently, let's just use their latest releases (when `torch` is installed with a release version)
+RUN python3 -m pip install --no-cache-dir -U $VERSION torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/$CUDA


make sure at the end, we do get torch 2.2. See my comment in the change in #29208. Otherwise, move this after ./transformers[dev] step.

Done ! I didn't have to move it after ./transformers[dev] step.

ydshieh · 2024-02-23T01:22:29Z

docker/transformers-quantization-latest-gpu/Dockerfile

+
+RUN python3 -m pip install --no-cache-dir -e ./transformers[dev]
+
+RUN python3 -m pip uninstall -y flax jax


we can also uninstall tensorflow. Or just use ./transformers[torch] above (but we might need to install other stuffs manually, not sure)

(not a big deal, we can keep it if you are busy)

I switched to transformers[dev-torch] !

.github/workflows/build-docker-images.yml

ydshieh · 2024-02-23T01:39:22Z

.github/workflows/build-docker-images.yml

+  latest-quantization-torch-docker:
+    name: "Latest Pytorch + Quantization [dev]"
+     # Push CI doesn't need this image
+    if: inputs.image_postfix != '-push-ci'


Just want to know if we intent to run quantization tests in a daily basis, or if we prefer to run them on each commit merged into main?

I think that it is better to run them on a daily basis. There are many slows tests too + I think that the quantization tests are not easily broken by changes in transformers. The bigger issue is the third party libraries that might introduce breaking changes. cc @younesbelkada

Yes daily basis sounds great!

ydshieh · 2024-02-23T01:41:17Z

.github/workflows/self-scheduled.yml

+  run_tests_quantization_torch_gpu:
+    name: Quantization tests


I am not sure if we intend to keep this in this workflow file. Isn't the goal to move this outside to a separate workflow file ?

cc @younesbelkada

I think we can keep it there for now and I'll move it outside in my PR

.github/workflows/self-scheduled.yml

ydshieh · 2024-02-23T01:42:46Z

utils/notification_service.py

@@ -1043,6 +1043,7 @@ def prepare_reports(title, header, reports, to_truncate=True):
        "PyTorch pipelines": "run_tests_torch_pipeline_gpu",
        "TensorFlow pipelines": "run_tests_tf_pipeline_gpu",
        "Torch CUDA extension tests": "run_tests_torch_cuda_extensions_gpu_test_reports",
+        "Quantization tests": "run_tests_quantization_torch_gpu",


should be revised if we decide to move quantization tests outside this workflow file.

ydshieh · 2024-02-23T13:22:58Z

@SunMarc I am going to merge #29208. Once done, there might be a conflict with this PR, but it should be easy to resolve. Ping me otherwise.

…rkflow

SunMarc · 2024-02-23T16:43:33Z

I've addressed all comments. I think that we are good to merge after I checked that the docker image is built correctly with torch 2.2.0 (currently facing some issues with login into dockerhub)

EDIT: The image was built successfully here cc @ydshieh

ArthurZucker

Preemptively approving, LGTM! Thanks @SunMarc for splitting the workflows!

ArthurZucker · 2024-02-27T01:44:46Z

.github/workflows/self-scheduled.yml

+      - name: Reinstall transformers in edit mode (remove the one installed during docker image build)
+        working-directory: /transformers
+        run: python3 -m pip uninstall -y transformers && python3 -m pip install -e .


@ydshieh I do not understand why we don't:

Install from main on the docker build with pip install git+https://github.com/hugginface/transformers@main (and get all the necessary dependencies)

pip uninstall -y transformers in the docker

actions/checkout@v3

pip install -e .
(As discussed offline with @younesbelkada this should come in a separate PR)

There is no deep reason for what we are doing currently: we just install transformers during docker image build and reuse it.

~~What you suggest (by installing main during docker image build) has a consequence that we can't do experiment easily in a branch when we change setup.py).~~

By using actions/checkout@v3, the path to transformers will be different from the current approach, and therefore we have to change all the working directory path and artifact path, otherwise the report will be empty.

ydshieh

Thank you for this work! I could RIP with the daily CI now.

ydshieh · 2024-02-28T01:50:45Z

.github/workflows/build-docker-images.yml

+  latest-pytorch-deepspeed-amd:
+    name: "PyTorch + DeepSpeed (AMD) [dev]"

-  #   runs-on: [self-hosted, docker-gpu, amd-gpu, single-gpu, mi210]
-  #   steps:
-  #     - name: Set up Docker Buildx
-  #       uses: docker/setup-buildx-action@v3
-  #     - name: Check out code
-  #       uses: actions/checkout@v3
-  #     - name: Login to DockerHub
-  #       uses: docker/login-action@v3
-  #       with:
-  #         username: ${{ secrets.DOCKERHUB_USERNAME }}
-  #         password: ${{ secrets.DOCKERHUB_PASSWORD }}
-  #     - name: Build and push
-  #       uses: docker/build-push-action@v5
-  #       with:
-  #         context: ./docker/transformers-pytorch-deepspeed-amd-gpu
-  #         build-args: |
-  #           REF=main
-  #         push: true
-  #         tags: huggingface/transformers-pytorch-deepspeed-amd-gpu${{ inputs.image_postfix }}
-  #     # Push CI images still need to be re-built daily
-  #     -
-  #       name: Build and push (for Push CI) in a daily basis
-  #       # This condition allows `schedule` events, or `push` events that trigger this workflow NOT via `workflow_call`.
-  #       # The later case is useful for manual image building for debugging purpose. Use another tag in this case!
-  #       if: inputs.image_postfix != '-push-ci'
-  #       uses: docker/build-push-action@v5
-  #       with:
-  #         context: ./docker/transformers-pytorch-deepspeed-amd-gpu
-  #         build-args: |
-  #           REF=main
-  #         push: true
-  #         tags: huggingface/transformers-pytorch-deepspeed-amd-gpu-push-ci
+    runs-on: [self-hosted, docker-gpu, amd-gpu, single-gpu, mi210]
+    steps:
+      - name: Set up Docker Buildx
+        uses: docker/setup-buildx-action@v3
+      - name: Check out code
+        uses: actions/checkout@v3
+      - name: Login to DockerHub
+        uses: docker/login-action@v3
+        with:
+          username: ${{ secrets.DOCKERHUB_USERNAME }}
+          password: ${{ secrets.DOCKERHUB_PASSWORD }}
+      - name: Build and push
+        uses: docker/build-push-action@v5
+        with:
+          context: ./docker/transformers-pytorch-deepspeed-amd-gpu
+          build-args: |
+            REF=main
+          push: true
+          tags: huggingface/transformers-pytorch-deepspeed-amd-gpu${{ inputs.image_postfix }}
+      # Push CI images still need to be re-built daily
+      -
+        name: Build and push (for Push CI) in a daily basis
+        # This condition allows `schedule` events, or `push` events that trigger this workflow NOT via `workflow_call`.
+        # The later case is useful for manual image building for debugging purpose. Use another tag in this case!
+        if: inputs.image_postfix != '-push-ci'
+        uses: docker/build-push-action@v5
+        with:
+          context: ./docker/transformers-pytorch-deepspeed-amd-gpu
+          build-args: |
+            REF=main
+          push: true
+          tags: huggingface/transformers-pytorch-deepspeed-amd-gpu-push-ci


This is irrelevant to this PR. We have (or at least had) issue of building AMD CI images and we haven't really make progress on it.

Better to keep this as it is in this PR.

Oh mb ! I'll revert the changes. Thanks for your careful review !

ydshieh · 2024-02-28T01:51:27Z

.github/workflows/build-docker-images.yml

+  latest-quantization-torch-docker:
+    name: "Latest Pytorch + Quantization [dev]"
+     # Push CI doesn't need this image
+    if: inputs.image_postfix != '-push-ci'


ydshieh · 2024-02-28T01:59:54Z

I would love to see the newly added job could be run successfully (having failing tests are fine), but I don't enforce it here.

This reverts commit 4cb52b8.

* [CI] Quantization workflow * build dockerfile * fix dockerfile * update self-cheduled.yml * test build dockerfile on push * fix torch install * udapte to python 3.10 * update aqlm version * uncomment build dockerfile * tests if the scheduler works * fix docker * do not trigger on psuh again * add additional runs * test again * all good * style * Update .github/workflows/self-scheduled.yml Co-authored-by: Younes Belkada <[email protected]> * test build dockerfile with torch 2.2.0 * fix extra * clean * revert changes * Revert "revert changes" This reverts commit 4cb52b8. * revert correct change --------- Co-authored-by: Younes Belkada <[email protected]>

SunMarc added 3 commits February 15, 2024 18:18

[CI] Quantization workflow

3ee3d1a

build dockerfile

3df06c1

fix dockerfile

69a3ac5

SunMarc requested review from ydshieh and younesbelkada February 15, 2024 18:38

update self-cheduled.yml

f36265f

younesbelkada reviewed Feb 16, 2024

View reviewed changes

.github/workflows/self-scheduled.yml Show resolved Hide resolved

test build dockerfile on push

7454355

SunMarc added 2 commits February 16, 2024 22:45

fix torch install

c745704

udapte to python 3.10

8c34b96

update aqlm version

7fc1a73

SunMarc added 3 commits February 21, 2024 15:43

uncomment build dockerfile

471cb7b

tests if the scheduler works

2f45fda

fix docker

67cd706

SunMarc added 6 commits February 21, 2024 15:49

do not trigger on psuh again

99d0456

add additional runs

aca17cf

test again

a796a5e

all good

e60712d

Merge remote-tracking branch 'upstream/main' into add-quantization-wo…

9057c30

…rkflow

style

3e82d7b

younesbelkada reviewed Feb 22, 2024

View reviewed changes

.github/workflows/self-scheduled.yml Outdated Show resolved Hide resolved

younesbelkada mentioned this pull request Feb 22, 2024

[CI / Workflows] Attempt to slightly refactor the slow tests workflow #29197

Closed

Update .github/workflows/self-scheduled.yml

34e6048

Co-authored-by: Younes Belkada <[email protected]>

ydshieh reviewed Feb 23, 2024

View reviewed changes

SunMarc added 2 commits February 23, 2024 16:56

Merge remote-tracking branch 'upstream/main' into add-quantization-wo…

f9fe630

…rkflow

test build dockerfile with torch 2.2.0

c5c5670

SunMarc requested a review from ydshieh February 23, 2024 16:41

SunMarc added 2 commits February 26, 2024 15:59

fix extra

4c757b8

clean

ce94146

ArthurZucker approved these changes Feb 27, 2024

View reviewed changes

ydshieh approved these changes Feb 28, 2024

View reviewed changes

SunMarc added 3 commits February 28, 2024 15:26

revert changes

4cb52b8

Revert "revert changes"

7506932

This reverts commit 4cb52b8.

revert correct change

9209b46

SunMarc merged commit f54d82c into main Feb 28, 2024
8 checks passed

SunMarc deleted the add-quantization-workflow branch February 28, 2024 15:09

younesbelkada mentioned this pull request Feb 29, 2024

ENH: [CI] Add new workflow to run slow tests of important models on push main if they are modified #29235

Merged


		RUN python3 -m pip install --no-cache-dir -e ./transformers[dev]

		RUN python3 -m pip uninstall -y flax jax

[CI] Quantization workflow #29046

[CI] Quantization workflow #29046

Conversation

SunMarc commented Feb 15, 2024

What does this PR do ?

HuggingFaceDocBuilderDev commented Feb 15, 2024

younesbelkada left a comment • edited Loading

Choose a reason for hiding this comment

younesbelkada left a comment

Choose a reason for hiding this comment

SunMarc commented Feb 16, 2024 • edited Loading

BlackSamorez commented Feb 18, 2024 • edited Loading

younesbelkada commented Feb 18, 2024

SunMarc commented Feb 20, 2024

BlackSamorez commented Feb 20, 2024

SunMarc commented Feb 20, 2024

BlackSamorez commented Feb 20, 2024

SunMarc commented Feb 20, 2024

ydshieh commented Feb 21, 2024

SunMarc commented Feb 21, 2024 • edited Loading

younesbelkada commented Feb 21, 2024

younesbelkada commented Feb 21, 2024

ydshieh commented Feb 21, 2024

ydshieh commented Feb 21, 2024

SunMarc commented Feb 21, 2024 • edited Loading

ydshieh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SunMarc Feb 23, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ydshieh commented Feb 23, 2024

SunMarc commented Feb 23, 2024 • edited Loading

ArthurZucker left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ydshieh Feb 28, 2024 • edited Loading

Choose a reason for hiding this comment

ydshieh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ydshieh commented Feb 28, 2024

younesbelkada left a comment •

edited

Loading

SunMarc commented Feb 16, 2024 •

edited

Loading

BlackSamorez commented Feb 18, 2024 •

edited

Loading

SunMarc commented Feb 21, 2024 •

edited

Loading

SunMarc commented Feb 21, 2024 •

edited

Loading

SunMarc Feb 23, 2024 •

edited

Loading

SunMarc commented Feb 23, 2024 •

edited

Loading

ydshieh Feb 28, 2024 •

edited

Loading