8bit and 4bit diffusion transformers -- an experiment #8746

sayakpaul · 2024-07-01T03:11:23Z

sayakpaul
Jul 1, 2024
Maintainer

Introduction

See relevant threads here first: #6500, #7023.

We have a number of pipelines that use a transformer-based backbone for the diffusion process:

With more and more coming, we might want to take advantage of the lower-precision computation capabilities explored in the transformer area (from the LLM world). Two widely popular studies in this area:

LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale
QLoRA: Efficient Finetuning of Quantized LLMs (that introduced the NF4 data-type)

So, I decided to give these exotic data-types and methods a try through the bitsandbytes library. Before setting your expectations too high, please be aware of what to expect here: #6500 (comment).

Experiments

Code

from diffusers import DiffusionPipeline
import argparse
import torch
import time
import bitsandbytes as bnb
import json

SHORT_NAME_MAPPER = {
    "stabilityai/stable-diffusion-3-medium-diffusers": "sd3",
    "PixArt-alpha/PixArt-Sigma-XL-2-1024-MS": "pixart"
}


def load_pipeline(args):
    pipeline = DiffusionPipeline.from_pretrained(args.ckpt_id, torch_dtype=torch.float16).to("cuda")

    def replace_regular_linears(module, mode="8bit"):
        for name, child in module.named_children():
            if isinstance(child, torch.nn.Linear):
                in_features = child.in_features
                out_features = child.out_features
                device = child.weight.data.device

                # Create and configure the Linear layer
                has_bias = True if child.bias is not None else False
                if mode == "8bit":
                    new_layer = bnb.nn.Linear8bitLt(in_features, out_features, bias=has_bias, has_fp16_weights=False)
                else:
                    # TODO: Make that configurable
                    # fp16 for compute dtype leads to faster inference
                    # and one should almost always use nf4 as a rule of thumb
                    bnb_4bit_compute_dtype = torch.float16
                    quant_type = "nf4"

                    new_layer = bnb.nn.Linear4bit(
                        in_features,
                        out_features,
                        bias=has_bias,
                        compute_dtype=bnb_4bit_compute_dtype,
                        quant_type=quant_type,
                    )

                new_layer.load_state_dict(child.state_dict())
                new_layer = new_layer.to(device)

                # Set the attribute
                setattr(module, name, new_layer)
            else:
                # Recursively apply to child modules
                replace_regular_linears(child, mode=mode)

    if args.mode is not None:
        replace_regular_linears(pipeline.transformer, args.mode)

    pipeline.set_progress_bar_config(disable=True)
    return pipeline


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--ckpt_id", default="stabilityai/stable-diffusion-3-medium-diffusers", type=str, choices=list(SHORT_NAME_MAPPER.keys()))
    parser.add_argument("--mode", default=None, type=str, choices=["8bit", "4bit"])
    parser.add_argument("--prompt", default="a golden vase with different flowers", type=str)
    args = parser.parse_args()

    torch.cuda.reset_peak_memory_stats()
    pipeline = load_pipeline(args)

    for _ in range(5):
        _ = pipeline(args.prompt, generator=torch.manual_seed(2024))

    start = time.time()
    output = pipeline(args.prompt, generator=torch.manual_seed(2024))
    end = time.time()
    mem_bytes = torch.cuda.max_memory_allocated()

    image = output.images[0]
    image_name = f"{SHORT_NAME_MAPPER[args.ckpt_id]}" + "_".join(args.prompt.split(" ")) 
    if args.mode is not None:
        image_name += f"_{args.mode}"
    image.save(f"{image_name}.png")

    print(f"Memory: {mem_bytes/(10**6):.3f} MB")
    print(f"Execution time: {(end - start):.3f} sec")

    info = dict(memory=f"{mem_bytes/(10**6):.3f}", time=f"{(end - start):.3f}")
    info_file = f"{SHORT_NAME_MAPPER[args.ckpt_id]}_info.json" if args.mode is None else f"{SHORT_NAME_MAPPER[args.ckpt_id]}_info_{args.mode}.json"
    with open(info_file, "w") as f:
        json.dump(info, f)

Script to launch experiments in bulk

#!/bin/bash

# Define the ckpt_id values
ckpt_ids=("stabilityai/stable-diffusion-3-medium-diffusers" "PixArt-alpha/PixArt-Sigma-XL-2-1024-MS")

# Proper mode
modes=("8bit" "4bit" "")

# Iterate over each combination of options
for ckpt_id in "${ckpt_ids[@]}"; do
    for mode in "${modes[@]}"; do
        # Construct the command
        cmd="python low_precision_transformer_diffusion.py --ckpt_id=$ckpt_id"
        
        if [ -n "$mode" ]; then
            cmd="$cmd --mode $mode"
        fi

        # Run the command
        echo "Running: $cmd"
        $cmd
    done
done

Quantitative results

PixArt-Sigma

Details	Memory (MB)	Time (s)
fp16	15513.733	2.795
nf4	14611.722	2.964
llm.int8()	14944.493	4.160

SD3

Details	Memory (MB)	Time (s)
fp16	20151.529	4.799
nf4	17665.670	5.354
llm.int8()	18083.007	6.650

👁️ Interesting to see the correlation between the memory reduction and the latency improvements doesn't always carry equally for the transformer backbone being used. For example, the memory reduction is far more evident for SD3 than it is for PixArt-Sigma. 👁️

Visual results

model	fp16	llm.int8()	nf4
PixArt-Sigma
SD3

All the results are with the "a golden vase with different flowers" prompt and the default pipeline call arguments weren't changed.

FurkanGozukara · 2024-08-13T05:07:38Z

FurkanGozukara
Aug 13, 2024

wow nf4 got huge quality degrade compared to llm.int8()

0 replies

sayakpaul · 2024-08-13T14:13:42Z

sayakpaul
Aug 13, 2024
Maintainer Author

Related #9165

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

8bit and 4bit diffusion transformers -- an experiment #8746

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Code

Script to launch experiments in bulk

Replies: 2 comments

{{title}}

{{title}}

Select a reply

8bit and 4bit diffusion transformers -- an experiment #8746

sayakpaul Jul 1, 2024 Maintainer

Introduction

Experiments

Code

Script to launch experiments in bulk

Quantitative results

Visual results

Replies: 2 comments

FurkanGozukara Aug 13, 2024

sayakpaul Aug 13, 2024 Maintainer Author

sayakpaul
Jul 1, 2024
Maintainer

FurkanGozukara
Aug 13, 2024

sayakpaul
Aug 13, 2024
Maintainer Author