Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] Quantization workflow #29046

Merged
merged 25 commits into from
Feb 28, 2024
Merged

[CI] Quantization workflow #29046

merged 25 commits into from
Feb 28, 2024

Conversation

SunMarc
Copy link
Member

@SunMarc SunMarc commented Feb 15, 2024

What does this PR do ?

This PR adds a workflow for quantization tests + related dockerfile. Since we merged the HfQuantizer PR, the community started integrating their own quantizers into transformers (e.g. AQML and much more in the future). This will lead to many third party libraries in the Dockerfile huggingface/transformers-all-latest-gpu. To limit the impact of these libraries on transformers tests, I propose to create a seperate dockerfile + workflow.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Contributor

@younesbelkada younesbelkada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much ! Can you try to trigger a run of the quantization slow tests and make sure we get the slack notifications (after building the quantization docker image) ? 🙏
For that you can comment out all jobs except the quantization tests job in self-scheduled.yml & build-docker.yml and modify the run condition on on push to the name of the branch you are working on right now (there might be a better way to trigger only the quantization tests but all tests are on the same workflow file)

Copy link
Contributor

@younesbelkada younesbelkada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a comment ! (this can be applied to all our workflow file that used that logic for checking out to the current main branch)

.github/workflows/self-scheduled.yml Show resolved Hide resolved
@SunMarc
Copy link
Member Author

SunMarc commented Feb 16, 2024

I see that transformers-all-latest-gpu docker image is not being updated for the last two days since the installation fails because of aqml library that requires python 3.10 at least and we uses 3.8 for now. We will have to change that in the quantization dockerfile.

Edit: I tried to install python 3.10 but it didn't work (1.030 E: Unable to locate package python3.10). I found this tutorial but not sure if it is the best way to install it.

@BlackSamorez
Copy link
Contributor

BlackSamorez commented Feb 18, 2024

The only reason aqlm requires python>=3.10 is a single match-case statement in a non-critical place.

I was able to run aqlm on python 3.8 no problem otherwise. I can replace the statement with an if-else statement and lower the requirement if necessary.

@younesbelkada
Copy link
Contributor

@SunMarc thanks!
in trl i build a docker image with python 3.10: https://github.com/huggingface/trl/blob/main/docker/trl-source-gpu/Dockerfile maybe you can take some inspiration from that dockerfile? 🙏 I am not sure why you are getting that error currently as the commands looks correct.
@BlackSamorez yes that would be great if you can also support python 3.8 for AQLM 🙏 Thanks !

@SunMarc
Copy link
Member Author

SunMarc commented Feb 20, 2024

I was able to run aqlm on python 3.8 no problem otherwise. I can replace the statement with an if-else statement and lower the requirement if necessary.

Yes that would be for the best ! I prefer to keep running the quantization tests with python 3.8 since this is what we are actually doing for all transformers tests ! Moreover, it will be better for the users since we keep the requirement low. LMK when it is done ! @BlackSamorez
Otherwise, I will modify the dockerfile and use conda to install python 3.10 as suggested by @younesbelkada.

@BlackSamorez
Copy link
Contributor

@SunMarc aqlm will support python >=3.8 starting version 1.0.2. I'm 1 PR away from releasing it.

@SunMarc
Copy link
Member Author

SunMarc commented Feb 20, 2024

Perfect ! I will wait for your PR to be merged then + release then if it doesn't take too much time. Please keep me updated ! Otherwise, I can first merge this PR without aqlm and add it afterwards.

@BlackSamorez
Copy link
Contributor

@SunMarc aqlm==1.0.2 is out. May I ask you to please update the docker images?

@SunMarc
Copy link
Member Author

SunMarc commented Feb 20, 2024

I was able to build the image but I don't have the permission to push the image cc @ydshieh

#22 ERROR: failed to push huggingface/transformers-quantization-latest-gpu: push access denied, repository does not exist or may require authorization: server message: insufficient_scope: authorization failed

@ydshieh
Copy link
Collaborator

ydshieh commented Feb 21, 2024

@SunMarc Do you builded/pushed via transformers' github actions? If so, do you have a job run link?

@SunMarc
Copy link
Member Author

SunMarc commented Feb 21, 2024

Yes ! Here's the link to the job

@younesbelkada
Copy link
Contributor

it's quite strange because the workflow was indeed able to login: https://github.com/huggingface/transformers/actions/runs/7979164961/job/21802381289#step:5:1 but fails to push ...

@younesbelkada
Copy link
Contributor

In TRL and PEFT I can confirm the build & push works fine : https://github.com/huggingface/peft/actions/workflows/build_docker_images.yml / https://github.com/huggingface/trl/actions/workflows/docker-build.yml so it's not a token issue as we use the same except if the token has expired for transformers cc @glegendre01

@ydshieh
Copy link
Collaborator

ydshieh commented Feb 21, 2024

Hi @SunMarc it's because there is some change in the infra team on docker hub stuff.

The repository huggingface/transformers-quantization-latest-gpu has to be created on docker Hub first, then only after this, you can push to it.

I will ask for it.

@ydshieh
Copy link
Collaborator

ydshieh commented Feb 21, 2024

BTW, I will review this PR tomorrow or Friday 🙏

@SunMarc
Copy link
Member Author

SunMarc commented Feb 21, 2024

Hi @ydshieh, i was able to run the tests and get the slack notification. See job here. Thx for your help !

Copy link
Collaborator

@ydshieh ydshieh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @SunMarc (and @younesbelkada ) for this initiative.

I love this idea - quantization tests become more and more, and failing on them becomes hard to track in the whole daily CI.

I have some suggestions and questions for the current version though.

RUN python3 -m pip install --no-cache-dir https://github.com/casper-hansen/AutoAWQ/releases/download/v0.1.8/autoawq-0.1.8+cu118-cp38-cp38-linux_x86_64.whl

# For bettertransformer + gptq
# For bettertransformer
RUN python3 -m pip install --no-cache-dir git+https://github.com/huggingface/optimum@main#egg=optimum
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we count this as non quantization tests?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to keep that as optimum is a hard requirement to use GPTQ

docker/transformers-quantization-latest-gpu/Dockerfile Outdated Show resolved Hide resolved
Comment on lines +23 to +27
RUN [ ${#PYTORCH} -gt 0 ] && VERSION='torch=='$PYTORCH'.*' || VERSION='torch'; echo "export VERSION='$VERSION'" >> ~/.profile
RUN echo torch=$VERSION
# `torchvision` and `torchaudio` should be installed along with `torch`, especially for nightly build.
# Currently, let's just use their latest releases (when `torch` is installed with a release version)
RUN python3 -m pip install --no-cache-dir -U $VERSION torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/$CUDA
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sure at the end, we do get torch 2.2. See my comment in the change in #29208. Otherwise, move this after ./transformers[dev] step.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done ! I didn't have to move it after ./transformers[dev] step.


RUN python3 -m pip install --no-cache-dir -e ./transformers[dev]

RUN python3 -m pip uninstall -y flax jax
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can also uninstall tensorflow. Or just use ./transformers[torch] above (but we might need to install other stuffs manually, not sure)

(not a big deal, we can keep it if you are busy)

Copy link
Member Author

@SunMarc SunMarc Feb 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I switched to transformers[dev-torch] !

.github/workflows/build-docker-images.yml Outdated Show resolved Hide resolved
.github/workflows/build-docker-images.yml Outdated Show resolved Hide resolved
latest-quantization-torch-docker:
name: "Latest Pytorch + Quantization [dev]"
# Push CI doesn't need this image
if: inputs.image_postfix != '-push-ci'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just want to know if we intent to run quantization tests in a daily basis, or if we prefer to run them on each commit merged into main?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that it is better to run them on a daily basis. There are many slows tests too + I think that the quantization tests are not easily broken by changes in transformers. The bigger issue is the third party libraries that might introduce breaking changes. cc @younesbelkada

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes daily basis sounds great!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

Comment on lines +300 to +301
run_tests_quantization_torch_gpu:
name: Quantization tests
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if we intend to keep this in this workflow file. Isn't the goal to move this outside to a separate workflow file ?

cc @younesbelkada

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can keep it there for now and I'll move it outside in my PR

.github/workflows/self-scheduled.yml Show resolved Hide resolved
@@ -1043,6 +1043,7 @@ def prepare_reports(title, header, reports, to_truncate=True):
"PyTorch pipelines": "run_tests_torch_pipeline_gpu",
"TensorFlow pipelines": "run_tests_tf_pipeline_gpu",
"Torch CUDA extension tests": "run_tests_torch_cuda_extensions_gpu_test_reports",
"Quantization tests": "run_tests_quantization_torch_gpu",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be revised if we decide to move quantization tests outside this workflow file.

@ydshieh
Copy link
Collaborator

ydshieh commented Feb 23, 2024

@SunMarc I am going to merge #29208. Once done, there might be a conflict with this PR, but it should be easy to resolve. Ping me otherwise.

@SunMarc SunMarc requested a review from ydshieh February 23, 2024 16:41
@SunMarc
Copy link
Member Author

SunMarc commented Feb 23, 2024

I've addressed all comments. I think that we are good to merge after I checked that the docker image is built correctly with torch 2.2.0 (currently facing some issues with login into dockerhub)

EDIT: The image was built successfully here cc @ydshieh

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Preemptively approving, LGTM! Thanks @SunMarc for splitting the workflows!

Comment on lines +316 to +318
- name: Reinstall transformers in edit mode (remove the one installed during docker image build)
working-directory: /transformers
run: python3 -m pip uninstall -y transformers && python3 -m pip install -e .
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ydshieh I do not understand why we don't:

  1. Install from main on the docker build with pip install git+https://github.com/hugginface/transformers@main (and get all the necessary dependencies)
  2. pip uninstall -y transformers in the docker
  3. actions/checkout@v3
  4. pip install -e .
    (As discussed offline with @younesbelkada this should come in a separate PR)

Copy link
Collaborator

@ydshieh ydshieh Feb 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no deep reason for what we are doing currently: we just install transformers during docker image build and reuse it.

What you suggest (by installing main during docker image build) has a consequence that we can't do experiment easily in a branch when we change setup.py).

By using actions/checkout@v3, the path to transformers will be different from the current approach, and therefore we have to change all the working directory path and artifact path, otherwise the report will be empty.

Copy link
Collaborator

@ydshieh ydshieh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this work! I could RIP with the daily CI now.

Comment on lines 265 to 299
latest-pytorch-deepspeed-amd:
name: "PyTorch + DeepSpeed (AMD) [dev]"

# runs-on: [self-hosted, docker-gpu, amd-gpu, single-gpu, mi210]
# steps:
# - name: Set up Docker Buildx
# uses: docker/setup-buildx-action@v3
# - name: Check out code
# uses: actions/checkout@v3
# - name: Login to DockerHub
# uses: docker/login-action@v3
# with:
# username: ${{ secrets.DOCKERHUB_USERNAME }}
# password: ${{ secrets.DOCKERHUB_PASSWORD }}
# - name: Build and push
# uses: docker/build-push-action@v5
# with:
# context: ./docker/transformers-pytorch-deepspeed-amd-gpu
# build-args: |
# REF=main
# push: true
# tags: huggingface/transformers-pytorch-deepspeed-amd-gpu${{ inputs.image_postfix }}
# # Push CI images still need to be re-built daily
# -
# name: Build and push (for Push CI) in a daily basis
# # This condition allows `schedule` events, or `push` events that trigger this workflow NOT via `workflow_call`.
# # The later case is useful for manual image building for debugging purpose. Use another tag in this case!
# if: inputs.image_postfix != '-push-ci'
# uses: docker/build-push-action@v5
# with:
# context: ./docker/transformers-pytorch-deepspeed-amd-gpu
# build-args: |
# REF=main
# push: true
# tags: huggingface/transformers-pytorch-deepspeed-amd-gpu-push-ci
runs-on: [self-hosted, docker-gpu, amd-gpu, single-gpu, mi210]
steps:
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
- name: Check out code
uses: actions/checkout@v3
- name: Login to DockerHub
uses: docker/login-action@v3
with:
username: ${{ secrets.DOCKERHUB_USERNAME }}
password: ${{ secrets.DOCKERHUB_PASSWORD }}
- name: Build and push
uses: docker/build-push-action@v5
with:
context: ./docker/transformers-pytorch-deepspeed-amd-gpu
build-args: |
REF=main
push: true
tags: huggingface/transformers-pytorch-deepspeed-amd-gpu${{ inputs.image_postfix }}
# Push CI images still need to be re-built daily
-
name: Build and push (for Push CI) in a daily basis
# This condition allows `schedule` events, or `push` events that trigger this workflow NOT via `workflow_call`.
# The later case is useful for manual image building for debugging purpose. Use another tag in this case!
if: inputs.image_postfix != '-push-ci'
uses: docker/build-push-action@v5
with:
context: ./docker/transformers-pytorch-deepspeed-amd-gpu
build-args: |
REF=main
push: true
tags: huggingface/transformers-pytorch-deepspeed-amd-gpu-push-ci
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is irrelevant to this PR. We have (or at least had) issue of building AMD CI images and we haven't really make progress on it.

Better to keep this as it is in this PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh mb ! I'll revert the changes. Thanks for your careful review !

latest-quantization-torch-docker:
name: "Latest Pytorch + Quantization [dev]"
# Push CI doesn't need this image
if: inputs.image_postfix != '-push-ci'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

@ydshieh
Copy link
Collaborator

ydshieh commented Feb 28, 2024

I would love to see the newly added job could be run successfully (having failing tests are fine), but I don't enforce it here.

@SunMarc SunMarc merged commit f54d82c into main Feb 28, 2024
8 checks passed
@SunMarc SunMarc deleted the add-quantization-workflow branch February 28, 2024 15:09
itazap pushed a commit that referenced this pull request May 14, 2024
* [CI] Quantization workflow

* build dockerfile

* fix dockerfile

* update self-cheduled.yml

* test build dockerfile on push

* fix torch install

* udapte to python 3.10

* update aqlm version

* uncomment build dockerfile

* tests if the scheduler works

* fix docker

* do not trigger on psuh again

* add additional runs

* test again

* all good

* style

* Update .github/workflows/self-scheduled.yml

Co-authored-by: Younes Belkada <[email protected]>

* test build dockerfile with torch 2.2.0

* fix extra

* clean

* revert changes

* Revert "revert changes"

This reverts commit 4cb52b8.

* revert correct change

---------

Co-authored-by: Younes Belkada <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants