feat: significantly optimize the time consumption of clip vision #2474
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
In the inference service using ControlNet, I observed a significant increase in time consumption when utilizing clip vision-related preprocessors (e.g.
adapter_clip_sd15
). The time fluctuates between 8s and 250s, depending on machine load. The typical time is around 10s. My machine is equipped with an A10 GPU, 96C CPU, and 251GB memory.Upon investigation, I found a substantial portion of the time is spent in the following section:
sd-webui-controlnet/annotator/clipvision/__init__.py
Line 101 in e7b5b60
Further exploration revealed that these time-consuming operations are primarily due to extensive initialization, such as:
https://github.com/huggingface/transformers/blob/366c03271e01c86e9a573bb64481f185de11ef29/src/transformers/models/clip/modeling_clip.py#L333-L334
https://github.com/huggingface/transformers/blob/366c03271e01c86e9a573bb64481f185de11ef29/src/transformers/models/clip/modeling_clip.py#L241-L244
This includes operations like nn.Linear, among others.
This issue is related to huggingface/transformers#21913.
After implementing this optimization, the average time consumption for
clip vision
can be compressed to at least 1/6 of the previous time. The results are highly stable, and there is no inconsistency in the inference outcomes.In my case, before optimization, even on a completely idle machine, the preprocessing time for
clip vision
took at least 8 seconds. Now, it only requires 1.23 seconds.