Performance Profiling Report (Forge/A1111/ComfyUI) #716

huchenlei · 2024-05-07T01:21:16Z

huchenlei
May 7, 2024

This thread will be used for performance profiling analysis for Forge/A1111/ComfyUI. Hopefully people can submit traces, screenshots to help us better understand why A1111 is slow.

I will start with a simple profiling task:

Steps: 4, Sampler: DPM++ 2M, Schedule type: Karras, CFG scale: 7, Seed: 2271253335, Size: 1024x1024, Model hash: aeb7e9e689, Model: juggernautXL_v8Rundiffusion, Clip skip: 2, Version: v1.9.3-2-g3d3fc81f

Hardware

GPU: RTX 4090 (24GB VRAM)
CPU: Ryzen 5 3600
Memory: 32 GB

A1111 Setup

Commandline args: --opt-split-attention --xformers

Profile https://github.com/AUTOMATIC1111/stable-diffusion-webui/blob/1c0a0c4c26f78c32095ebc7f8af82f5c04fca8c0/modules/txt2img.py#L102-L120

def txt2img(id_task: str, request: gr.Request, *args):
    from torch.profiler import profile, record_function, ProfilerActivity
    with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
             record_shapes=True,
             profile_memory=True, # Track memory allocation
             with_stack=True) as prof:
        with record_function("model_inference"):
            p = txt2img_create_processing(id_task, request, *args)

            with closing(p):
                processed = modules.scripts.scripts_txt2img.run(p, *p.script_args)

                if processed is None:
                    processed = processing.process_images(p)

    # Print profiling results
    print(prof.key_averages().table(sort_by="cpu_time_total", row_limit=10))

    # Export to Chrome trace format
    prof.export_chrome_trace("trace.json")

Commandline report

-------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                             Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg       CPU Mem  Self CPU Mem      CUDA Mem  Self CUDA Mem    # of Calls
-------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                  model_inference        35.11%        2.138s       100.00%        6.091s        6.091s     901.211ms        15.27%        5.901s        5.901s       2.46 Mb     -60.42 Mb       1.59 Mb     -20.56 Gb             1
                               CheckpointFunction        17.81%        1.085s        44.61%        2.717s       7.809ms     130.432ms         2.21%        3.044s       8.747ms       2.38 Mb           0 b       2.33 Gb     -87.20 Gb           348
                                     aten::conv2d         0.08%       4.695ms        14.48%     882.125ms       3.500ms       5.522ms         0.09%     504.723ms       2.003ms           0 b		0 b       6.48 Gb      -1.00 Mb           252
                                         aten::to         0.88%      53.445ms        12.19%     742.475ms      80.634us      36.702ms         0.62%        1.933s     209.895us      20.56 Mb		0 b      40.77 Gb           0 b          9208
                                     aten::linear         1.76%     107.050ms        11.72%     713.731ms     130.243us      75.593ms         1.28%        1.583s     288.831us           0 b		0 b      45.85 Gb      -6.54 Gb          5480
                                aten::convolution         0.05%       3.229ms        11.47%     698.461ms       2.863ms     773.000us         0.01%     444.533ms       1.822ms           0 b		0 b       6.32 Gb           0 b           244
                               aten::_convolution         0.16%       9.658ms        11.41%     695.232ms       2.849ms       2.508ms         0.04%     443.760ms       1.819ms           0 b		0 b       6.32 Gb     -20.00 Mb           244
                                   aten::_to_copy         2.44%     148.609ms        11.31%     689.030ms      76.457us     117.232ms         1.99%        1.896s     210.387us      20.56 Mb     -17.87 Kb      40.77 Gb           0 b          9012
                          aten::cudnn_convolution        11.11%     676.928ms        11.11%     676.928ms       2.774ms     414.661ms         7.03%     414.661ms       1.699ms           0 b		0 b       6.33 Gb       6.33 Gb           244
                                      aten::copy_         8.46%     515.292ms         8.54%     520.123ms      55.332us        1.781s        30.17%        1.945s     206.937us      36.12 Kb     -66.21 Kb           0 b           0 b          9400
-------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 6.091s
Self CUDA time total: 5.901s

huchenlei · 2024-05-07T01:37:52Z

huchenlei
May 7, 2024
Author

Forge Setup

Commandline args --xformers.

Profile:

stable-diffusion-webui-forge/modules/txt2img.py

Lines 104 to 122 in 29be1da

    
           def txt2img_function(id_task: str, request: gr.Request, *args): 
        
               p = txt2img_create_processing(id_task, request, *args) 
        
               with closing(p): 
        
                   processed = modules.scripts.scripts_txt2img.run(p, *p.script_args) 
        
                   if processed is None: 
        
                       processed = processing.process_images(p) 
        
               shared.total_tqdm.clear() 
        
               generation_info_js = processed.js() 
        
               if opts.samples_log_stdout: 
        
                   print(generation_info_js) 
        
               if opts.do_not_show_images: 
        
                   processed.images = [] 
        
               return processed.images + processed.extra_images, generation_info_js, plaintext_to_html(processed.info), plaintext_to_html(processed.comments, classname="comments")

def txt2img_function(id_task: str, request: gr.Request, *args):
    from torch.profiler import profile, record_function, ProfilerActivity
    with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
             record_shapes=True,
             profile_memory=True, # Track memory allocation
             with_stack=True) as prof:
        with record_function("model_inference"):
            p = txt2img_create_processing(id_task, request, *args)

            with closing(p):
                processed = modules.scripts.scripts_txt2img.run(p, *p.script_args)

                if processed is None:
                    processed = processing.process_images(p)

    # Print profiling results
    print(prof.key_averages().table(sort_by="cpu_time_total", row_limit=10))

    # Export to Chrome trace format
    prof.export_chrome_trace("trace.json")

Commandline report

-------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                             Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg       CPU Mem  Self CPU Mem      CUDA Mem  Self CUDA Mem    # of Calls
-------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                  model_inference        35.95%        1.591s       100.00%        4.426s        4.426s     831.669ms        18.82%        4.420s        4.420s           0 b    -150.23 Mb     162.78 Mb     -81.29 Gb             1
                                     aten::conv2d         0.07%       3.212ms        30.42%        1.346s       5.518ms     792.000us         0.02%     689.083ms       2.824ms           0 b		0 b       6.34 Gb           0 b           244
                                aten::convolution         0.06%       2.651ms        30.35%        1.343s       5.505ms     765.000us         0.02%     688.291ms       2.821ms           0 b		0 b       6.34 Gb           0 b           244
                               aten::_convolution         0.22%       9.583ms        30.29%        1.341s       5.494ms       2.085ms         0.05%     687.526ms       2.818ms           0 b		0 b       6.34 Gb     -20.00 Mb           244
                          aten::cudnn_convolution        29.92%        1.324s        29.92%        1.324s       5.428ms     616.326ms        13.94%     616.326ms       2.526ms           0 b		0 b       6.34 Gb       6.34 Gb           244
                                     aten::linear         1.41%      62.405ms         8.65%     383.066ms     109.447us      28.014ms         0.63%        1.289s     368.310us           0 b		0 b      24.71 Gb           0 b          3500
                                         aten::to         0.25%      10.958ms         7.19%     318.440ms     160.423us      10.009ms         0.23%     511.684ms     257.775us      30.10 Mb		0 b       5.65 Gb           0 b          1985
                                      aten::copy_         7.02%     310.626ms         7.07%     312.946ms      68.225us     672.258ms        15.21%     687.226ms     149.820us      36.63 Kb     -65.70 Kb           0 b           0 b          4587
                                   aten::_to_copy         0.57%      25.422ms         6.95%     307.482ms     168.761us      26.565ms         0.60%     501.675ms     275.343us      30.10 Mb       8.45 Kb       5.65 Gb           0 b          1822
                                     aten::arange         0.02%     863.000us         4.28%     189.549ms       5.265ms     218.000us         0.00%     381.000us      10.583us      11.00 Kb		0 b      12.00 Kb           0 b            36
-------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 4.426s
Self CUDA time total: 4.420s

Note

Very interestingly although Forge's total duration is longer than A1111, the 1-step duration for steps after step 1 are significantly shorter than A1111. From about 500ms -> about 250ms.

0 replies

huchenlei · 2024-05-07T02:07:21Z

huchenlei
May 7, 2024
Author

ComfyUI Setup

No extra commandline args
Version: 16eabdf

Profile https://github.com/comfyanonymous/ComfyUI/blob/c61eadf69a3ba4033dcf22e2e190fd54f779fc5b/nodes.py#L1321-L1344
Here the code to profile is not very obvious, as ComfyUI code will likely cover model loading, which will make the trace file very big and exceed size that browser WASM can handle. So here I am only sampling the KSampler node's execution.

class KSampler:
    @classmethod
    def INPUT_TYPES(s):
        return {"required":
                    {"model": ("MODEL",),
                    "seed": ("INT", {"default": 0, "min": 0, "max": 0xffffffffffffffff}),
                    "steps": ("INT", {"default": 20, "min": 1, "max": 10000}),
                    "cfg": ("FLOAT", {"default": 8.0, "min": 0.0, "max": 100.0, "step":0.1, "round": 0.01}),
                    "sampler_name": (comfy.samplers.KSampler.SAMPLERS, ),
                    "scheduler": (comfy.samplers.KSampler.SCHEDULERS, ),
                    "positive": ("CONDITIONING", ),
                    "negative": ("CONDITIONING", ),
                    "latent_image": ("LATENT", ),
                    "denoise": ("FLOAT", {"default": 1.0, "min": 0.0, "max": 1.0, "step": 0.01}),
                     }
                }

    RETURN_TYPES = ("LATENT",)
    FUNCTION = "sample"

    CATEGORY = "sampling"

    def sample(self, model, seed, steps, cfg, sampler_name, scheduler, positive, negative, latent_image, denoise=1.0):
        from torch.profiler import profile, record_function, ProfilerActivity
        with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
                record_shapes=True,
                profile_memory=True, # Track memory allocation
                with_stack=True) as prof:
            with record_function("model_inference"):
                result = common_ksampler(model, seed, steps, cfg, sampler_name, scheduler, positive, negative, latent_image, denoise=denoise)

        # Print profiling results
        print(prof.key_averages().table(sort_by="cpu_time_total", row_limit=10))

        # Export to Chrome trace format
        prof.export_chrome_trace("trace.json")
        return result

Workflow:

Commandline report

-----------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                         Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg       CPU Mem  Self CPU Mem    # of Calls 

-----------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
              model_inference        31.71%     636.933ms       100.00%        2.009s        2.009s      50.677ms         2.52%        2.007s        2.007s     256.00 Kb    -356.15 Kb             1  
                 aten::conv2d         0.09%       1.786ms        22.35%     448.849ms       2.200ms     619.000us         0.03%     339.828ms       1.666ms           0 b           0 b           204  
            aten::convolution         0.09%       1.754ms        22.26%     447.063ms       2.191ms     644.000us         0.03%     339.209ms       1.663ms           0 b           0 b           204  
           aten::_convolution         0.31%       6.291ms        22.17%     445.309ms       2.183ms       1.367ms         0.07%     338.565ms       1.660ms           0 b           0 b           204  
      aten::cudnn_convolution        21.66%     435.069ms        21.66%     435.069ms       2.133ms     303.645ms        15.13%     303.645ms       1.488ms           0 b           0 b           204  
                 aten::linear         2.87%      57.584ms        18.24%     366.279ms     123.243us      21.559ms         1.07%     747.946ms     251.664us           0 b           0 b          2972  
                aten::reshape         4.30%      86.276ms         8.82%     177.253ms      19.988us      47.665ms         2.37%     315.879ms      35.620us           0 b           0 b          8868  
                  aten::addmm         6.66%     133.693ms         7.13%     143.118ms     110.772us     492.740ms        24.55%     501.712ms     388.322us           0 b           0 b          1292  
                 aten::matmul         2.10%      42.269ms         5.55%     111.407ms      66.314us      10.407ms         0.52%     163.644ms      97.407us           0 b           0 b          1680  
                  aten::clone         1.71%      34.440ms         4.48%      90.001ms      38.660us      64.363ms         3.21%     261.469ms     112.315us           0 b           0 b          2328  
-----------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 2.009s
Self CUDA time total: 2.007s

0 replies

huchenlei · 2024-05-07T02:24:50Z

huchenlei
May 7, 2024
Author

All experiments above are cold start 1st run after UI launch. Subsequent runs will have 1st step denoise duration more aligned with following steps.

Some intersting observations:

A1111 has almost 5x aten.to calls comparing to Forge. 9000calls vs 1900 calls.
Trace file size is also different. A1111 295MB vs Forge 126 MB. This can be a good indicator that A1111 is doing extra function calls, and having more events to trace.
A1111 calls more aten::linear than Forge. 5000 calls vs 3500 calls.

1 reply

huchenlei May 7, 2024
Author

Traces uploaded: https://huggingface.co/datasets/huchenlei/StableDiffusionUIPerfTraces

huchenlei · 2024-05-07T03:37:58Z

huchenlei
May 7, 2024
Author

The main diff in denosing step performance diff seem to come from cross attn function duration.
A1111 2.5ms vs Forge 1.35 ms.

Perfetto queries
A1111

      INCLUDE PERFETTO MODULE slices.slices;

      
      
      SELECT
        sum(dur) / count(1) as dur,
        count(1) as count

      FROM _slice_with_thread_and_process_info
      WHERE name = 'modules/sd_hijack_optimizations.py(480): xformers_attention_forward'

dur | count -- | -- 2546642 | 560

Forge

      INCLUDE PERFETTO MODULE slices.slices;

      
      
      SELECT
sum(dur) / count(1) as dur,
count(1) as count
      FROM _slice_with_thread_and_process_info
       WHERE name = 'ldm_patched/ldm/modules/attention.py(388): forward'

dur | count -- | -- 1369350 | 560

0 replies

huchenlei · 2024-05-07T03:52:27Z

huchenlei
May 7, 2024
Author

Problem 1

A1111 is using rearrange to perform tensor reshape. However, rearrange involves parsing of a semantics string. This can be very costly. The parsing does not happen very often as the string does seem to be cached, but the cache size obviously is not large enough that we see following slice often in the trace. That Attention func call takes ~6ms, which is way above average.

A1111 impl:

def xformers_attention_forward(self, x, context=None, mask=None, **kwargs):
    h = self.heads
    q_in = self.to_q(x)
    context = default(context, x)

    context_k, context_v = hypernetwork.apply_hypernetworks(shared.loaded_hypernetworks, context)
    k_in = self.to_k(context_k)
    v_in = self.to_v(context_v)

    q, k, v = (rearrange(t, 'b n (h d) -> b n h d', h=h) for t in (q_in, k_in, v_in))
    del q_in, k_in, v_in

    dtype = q.dtype
    if shared.opts.upcast_attn:
        q, k, v = q.float(), k.float(), v.float()

    out = xformers.ops.memory_efficient_attention(q, k, v, attn_bias=None, op=get_xformers_flash_attention_op(q, k, v))

    out = out.to(dtype)

    out = rearrange(out, 'b n h d -> b n (h d)', h=h)
    return self.to_out(out)

Introduced in AUTOMATIC1111/stable-diffusion-webui#1851

Forge impl:

def attention_xformers(q, k, v, heads, mask=None):
    b, _, dim_head = q.shape
    dim_head //= heads
    if BROKEN_XFORMERS:
        if b * heads > 65535:
            return attention_pytorch(q, k, v, heads, mask)

    q, k, v = map(
        lambda t: t.unsqueeze(3)
        .reshape(b, -1, heads, dim_head)
        .permute(0, 2, 1, 3)
        .reshape(b * heads, -1, dim_head)
        .contiguous(),
        (q, k, v),
    )

    if mask is not None:
        pad = 8 - q.shape[1] % 8
        mask_out = torch.empty([q.shape[0], q.shape[1], q.shape[1] + pad], dtype=q.dtype, device=q.device)
        mask_out[:, :, :mask.shape[-1]] = mask
        mask = mask_out[:, :, :mask.shape[-1]]

    out = xformers.ops.memory_efficient_attention(q, k, v, attn_bias=mask)

    out = (
        out.unsqueeze(0)
        .reshape(b, heads, -1, dim_head)
        .permute(0, 2, 1, 3)
        .reshape(b, -1, heads * dim_head)
    )
    return out

Forge explicitly calls reshape and permute, which is explicit and not involving of parsing a string.

2 replies

huchenlei May 7, 2024
Author

After replacing rearrange call with torch native op, performance of attn goes down significantly to ~1.7ms.

dur | count -- | -- 1747571 | 560

wfjsw May 7, 2024

8.9 sec. (einops original) vs. 8.6 sec. (torch non-contiguous) vs. 8.4 sec. (torch contiguous)

wfjsw · 2024-05-07T04:17:39Z

wfjsw
May 7, 2024

top - a1111, bottom - forge
The part marked by green is the 3rd U-Net denoising step.

Profile settings

A1111: ddb28b33a3561a360b429c76f28f7ff1ffe282a0
Forge: 29be1da7cf2b5dccfc70fbdd33eb35c56a31ffb7

Model: SD 1.5 (CLIP SKIP 1, VAE None)
Preview per 10 iters

Tool: NVIDIA Nsight Systems 2024.1.1

A1111 commandline: D:\stable-diffusion-webui\env\python.exe webui.py --xformers
Forge commandline: D:\stable-diffusion-webui\env\python.exe webui.py --always-gpu --xformers --vae-in-fp16

Collect: CUDA trace

Environment Variables:

CUDA_LAUNCH_BLOCKING = 1

Check the Start profiling manually checkbox and start profiling just before generation.

0 replies

huchenlei · 2024-05-07T14:29:57Z

huchenlei
May 7, 2024
Author

Problem 2

A1111's linear call in attn block is taking way longer than forge's. A1111 240ms vs Forge 71ms. From A1111 trace, you can see that there are a bunch of calls to .to, which causes the difference.

A1111

Forge

These tables are inspecting a particular nn.Module.LayerNorm module. There are 6 calls:

first 2 calls: CLIP token process
remaining 4 calls: BasicTransformerBlock forward call once per each iteration.

You can see that the costs in CLIP token process are comparable, but A1111 has extra overhead in all subsequent runs.
The overhead has 2 components:

extra .to calls
calls to Lora/networks.py::network_apply_weights

Forge by default does not do any casting during inference. All dtype casting happens before start of inference.

6 replies

huchenlei May 13, 2024
Author

It seems like the overly large amount of .to calls are caused by precision casting. During inference, x has float32 dtype while weights have float16 dtype.

wfjsw May 13, 2024

this is a behavior of autocast
torch is expected to use float for layernorm even for half inputs: pytorch/pytorch#66707

huchenlei May 14, 2024
Author

It seems like for forge, all tensors are float16. This includes both input and output of norm1 and to_*.

wfjsw May 14, 2024

layer_norm returns float16 when it is not under torch.autocast. Looking at the pytorch source code, it could be implicitly casting it back.

huchenlei May 15, 2024
Author

I tried disable autocasting, and also prevent manual casting. Instead, we do everything in fp16 now in every operation. By avoid casting, A1111's performance is finally on par with ComfyUI and Forge:

Next steps should be prepare proper fix to all issues mentioned in this report and get these PRs merged.

huchenlei · 2024-05-07T14:48:16Z

huchenlei
May 7, 2024
Author

Problem 3

Between calls to 2 BasicTransformerBlock, there is no extra operation in Forge, but there is significant overhead in A1111.

Forge

A1111

Each of those extra args checks takes ~1ms overhead. Considering there are 70 calls to BasicaTransformerBlocks per denosing step. If we do a 20 step run, the total overhead would be 1 * 70 * 20 = 1400ms = 1.4s.

https://github.com/Stability-AI/generative-models/blob/fbdc58cab9f4ee2be7a5e1f2e2787ecd9311942f/sgm/modules/diffusionmodules/util.py#L154C1-L204C42

def checkpoint(func, inputs, params, flag):
    """
    Evaluate a function without caching intermediate activations, allowing for
    reduced memory at the expense of extra compute in the backward pass.
    :param func: the function to evaluate.
    :param inputs: the argument sequence to pass to `func`.
    :param params: a sequence of parameters `func` depends on but does not
                   explicitly take as arguments.
    :param flag: if False, disable gradient checkpointing.
    """
    if flag:
        args = tuple(inputs) + tuple(params)
        return CheckpointFunction.apply(func, len(inputs), *args)
    else:
        return func(*inputs)


class CheckpointFunction(torch.autograd.Function):
    @staticmethod
    def forward(ctx, run_function, length, *args):
        ctx.run_function = run_function
        ctx.input_tensors = list(args[:length])
        ctx.input_params = list(args[length:])
        ctx.gpu_autocast_kwargs = {
            "enabled": torch.is_autocast_enabled(),
            "dtype": torch.get_autocast_gpu_dtype(),
            "cache_enabled": torch.is_autocast_cache_enabled(),
        }
        with torch.no_grad():
            output_tensors = ctx.run_function(*ctx.input_tensors)
        return output_tensors

    @staticmethod
    def backward(ctx, *output_grads):
        ctx.input_tensors = [x.detach().requires_grad_(True) for x in ctx.input_tensors]
        with torch.enable_grad(), torch.cuda.amp.autocast(**ctx.gpu_autocast_kwargs):
            # Fixes a bug where the first op in run_function modifies the
            # Tensor storage in place, which is not allowed for detach()'d
            # Tensors.
            shallow_copies = [x.view_as(x) for x in ctx.input_tensors]
            output_tensors = ctx.run_function(*shallow_copies)
        input_grads = torch.autograd.grad(
            output_tensors,
            ctx.input_tensors + ctx.input_params,
            output_grads,
            allow_unused=True,
        )
        del ctx.input_tensors
        del ctx.input_params
        del output_tensors
        return (None, None) + input_grads

3 replies

huchenlei May 8, 2024
Author

From chatgpt:

Understanding the checkpoint Function

The function checkpoint is structured to conditionally modify the behavior of a function func based on whether gradient checkpointing is enabled (flag parameter). Here’s a breakdown of its key components and functionality:

func: The function representing a part of a neural network, typically a block or layer that processes input tensors.
inputs: These are the tensors directly passed to func.
params: These are additional parameters required by func but are not directly passed as arguments (often parameters of the model or a particular block).
flag: A boolean to enable or disable checkpointing.
When checkpointing is enabled (flag=True), the function combines inputs and params into a single tuple and then uses a custom autograd function (CheckpointFunction) to manage the forward and backward passes.

Role of CheckpointFunction

CheckpointFunction is a custom PyTorch autograd function where:

forward method: It runs the func in a no-grad context to prevent PyTorch from storing intermediate activations. It saves only the inputs and parameters necessary for the backward pass, significantly reducing memory usage.
backward method: When gradients need to be computed, it re-runs the forward pass (this time enabling gradients) to recompute the intermediate activations and then calculates gradients for both inputs and parameters. This is where the trade-off comes in: you save memory but have to do extra computation during the backward pass.

Why parameters() Is Invoked Often

In the context of using checkpoint in a large model:

Initialization and Setup: If params is expected to include all or many parameters of a model or a specific block, it is common to gather these using the parameters() function. This might be where you see frequent calls to parameters().
Dependency Handling: The function checkpoint is designed to handle arbitrary functions func that might depend on certain model parameters indirectly. The frequent invocation of parameters() ensures that all necessary parameters are passed to maintain correct dependencies and state during both forward and backward computations.

Practical Implications

Efficient Memory Management: By using checkpointing, you can train larger models or use larger batch sizes than would otherwise be possible with your available hardware, as it allows you to manage GPU memory more efficiently.
Increased Computational Overhead: The trade-off is that your training might take longer because of the need to recompute certain parts of the forward pass during backpropagation.

huchenlei May 8, 2024
Author

The fix is fairly simple:
In sgm's SpatialTransformer class, set checkpoint = False. The default value is True.

thiagojramos May 10, 2024

I asked ChatGPT to explain it to me, but I think I kind of understood... Or not (I'm not good with Lego):

huchenlei · 2024-05-08T00:22:49Z

huchenlei
May 8, 2024
Author

With Problem1 and Problem 3 fixed. We are going from 580ms per iteration to 420ms per iteration already! 🎉

0 replies

huchenlei · 2024-05-08T00:30:05Z

huchenlei
May 8, 2024
Author

Problem 4

In each denosing step, A1111 seem to call torch.state_dict, while forge does not. This is extra 50ms per iteration.

Forge

A1111

1 reply

huchenlei May 8, 2024
Author

Culprit is following code:

def apply_model(self: sgm.models.diffusion.DiffusionEngine, x, t, cond):
    sd = self.model.state_dict()   # <----- Here!!!!!
    diffusion_model_input = sd.get('diffusion_model.input_blocks.0.0.weight', None)
    if diffusion_model_input is not None:
        if diffusion_model_input.shape[1] == 9:
            x = torch.cat([x] + cond['c_concat'], dim=1)

    return self.model(x, t, cond)

The code was introduced in AUTOMATIC1111/stable-diffusion-webui#14390 to support inpaint model. I have to say it is really dumb considering state_dict is a very expensive function.

huchenlei · 2024-05-08T00:42:08Z

huchenlei
May 8, 2024
Author

With Problem 1, 3, 4 fixed, we are going from 580ms to 370ms per itereation.

0 replies

huchenlei · 2024-05-08T02:51:00Z

huchenlei
May 8, 2024
Author

Problem 5

Obviously test for nan is not free in A1111

The solution is simple. Add --disable-nan-check to A1111 args will cut this cost.

Going from 345ms/it to 325ms/it. 20ms/it cut!

0 replies

wywywywy · 2024-05-08T11:58:53Z

wywywywy
May 8, 2024

Nice work!

What's the final goal of this investigation? Are you planning to upstream all these fixes back to the A1111 repo?

8 replies

Gushousekai195 May 9, 2024

The Layer Diffusion extension tho

altoiddealer May 9, 2024

It’s not like Forge will be deleted from the internet… or layerdiffuse will become incompatible with current Forge, that dev is on vacation like forge dev

AG-w May 16, 2024

is it even possible for forge user to go back to a1111 when some newer forge extensions are using comfyUI code?

ThereforeGames May 16, 2024

Incredible work. 🙂

drhead May 16, 2024

is it even possible for forge user to go back to a1111 when some newer forge extensions are using comfyUI code?

Extensions will have to be ported back to mainline A1111. With Forge being effectively completely abandoned and unmaintained, this was going to have to happen sooner or later anyways. Bringing over the performance improvements from Forge (it's actually looking like exceeding them is fairly realistic at this point) and the genuinely useful utilities like the unet patcher will facilitate this over time.

altoiddealer · 2024-05-08T14:20:03Z

altoiddealer
May 8, 2024

Your rapid progress in this endeavor is startling.

The emergence of Forge should have spurred Automatic1111 to conduct a similar investigation, finding the same flaws. Do they really need a flashing neon sign above the issues? This makes it look like either you dance circles around them from a coding perspective, or their priorities are completely out of whack.

5 replies

freecoderwaifu May 8, 2024

It's open source, it shouldn't be a competition, it should be overall contribution. Though healthy competition is still good, even if some people always want to turn it into a more spiteful one.

One team works on adding overall features, another works on optimization, that's how open source can work.

altoiddealer May 8, 2024

My point is, when Forge was released it was made quite obvious that A1111 could benefit from potential memory management / performance gains. But nothing happened. Now, seemingly overnight we have a Forge dev making rapid progress trying to do the work for them.

wfjsw May 8, 2024

when Forge was released it was made quite obvious that A1111 could benefit from potential memory management / performance gains

No, as you can see on this repo, Forge pulled in a big pile of crap from ComfyUI, and as A1111 is still regularly maintained, it would be impossible to merge such big changes back, let alone this breaks extensions (despite people say otherwise). Also the size of the changes deters most of the attempts to try to compare A1111 with Forge in the scope of codes.

altoiddealer May 8, 2024

As I can see in this discussion thread, huchenlei tweaked a few of A111's functions and reduced iteration speeds from 580ms to 325ms

wfjsw May 8, 2024

But these discoveries are not directly based on code comparison between these two. Instead, one has to profile and nitpick into irregularities and navigate into both side. It is HARD because both repos are cluttered into different ways.

I will not continue on this topic as it doesn't make much sense here.

drhead · 2024-05-10T21:23:31Z

drhead
May 10, 2024

If it's possible, do you think you could look into layout optimization? Conv operations on Nvidia cards with tensor cores have faster kernels available with the channels_last memory format. However, just setting the model and inputs to channels last as a whole in eager mode often doesn't work well with anything that isn't a pure CNN since other ops are better off in contiguous format and will be casted back, and with SD this often means you break even. If you profile a compiled model, though, you will very plainly see that it does choose to use channels_last where appropriate. You could probably realize most of the benefit by having channels last only cast conv layers and nearby bias/norm layers to channels last and casting the intermediate tensors to channels last before those layers and cast it back to continuous afterwards.

Great work on these optimizations so far! I look forward to having both good optimizations and proper maintenance in one frontend.

2 replies

drhead May 16, 2024

From doing profiling of my own, I have found that --opt-channelslast now gives a small but noticeable speed boost on a 3090. With the changes described above and the additional changes I made below, I now can peak at 23 25it/s for sampling 512x512 with CFG, and ~~6.8it/s~~ 7.37it/s for a batch of 4 512x512 with CFG. My baseline was about 6it/s for batch 4 before these changes.

drhead May 17, 2024

just a final note on this one: channelslast seems to help now overall, but this issue needs to be solved in pytorch before we'll see the same speedups you usually see on CNNs: pytorch/pytorch#111824 (fortunately it seems that this is being worked on)

Most of the extra aten::copy_ ops that CL adds are immediately before or after a native_group_norm op because that doesn't support CL, so having that solved will streamline inference much further. I wouldn't be too surprised if I hit 8 it/s when that happens.

drhead · 2024-05-16T04:46:20Z

drhead
May 16, 2024

Two blocking calls need to be eliminated to allow torch to properly fill its dispatch queue:

ldm/modules/diffusionmodules/util.py:timestep_embedding(): Replace:

        freqs = torch.exp(
            -math.log(max_period) * torch.arange(start=0, end=half, dtype=torch.float32) / half
        ).to(device=timesteps.device)

with:

        freqs = torch.exp(
            -math.log(max_period) * torch.arange(start=0, end=half, dtype=torch.float32, device=timesteps.device) / half
        )

This creates the tensor on device and avoids an unnecessary memory copy that will happily block dispatch.

k-diffusion/k_diffusion/sampling.py: All comparison statements that govern control flow (for example if old_denoised is None or sigmas[i + 1] == 0 in sample_dpmpp_2m) need to be changed to use CPU tensors instead of GPU tensors. Preferably set up whatever tensors will be needed during the sampling loop before it starts. If this isn't done, there will be a blocking call on every sampling step.

If these changes are made, torch will be able to line up several sampling steps worth of instructions ahead of time, which will eliminate most overhead.

1 reply

feffy380 May 17, 2024

Seems like these require patching dependencies rather than A1111 itself

huchenlei · 2024-05-17T03:20:36Z

huchenlei
May 17, 2024
Author

I have all PRs bundled here for anyone who wants to try it out: AUTOMATIC1111/stable-diffusion-webui#15821

It is unclear when AUTO will come back and get them merged though.

0 replies

strawberrymelonpanda · 2024-05-18T19:37:20Z

strawberrymelonpanda
May 18, 2024

I have to ask, is it possible to also mimic Forge's better memory handling?

Aside from the speedups, the smarter moving of checkpoints around in memory was super useful for me on a RTX 2080, and still useful sometimes on a RTX 3090. I sometimes see my VRAM spike to max when doing HR Fix on large images with WebUI, but Forge handls pretty much any size with ease. (even with NeverOOM off in Forge, and Tiled VAE on in WebUI)

2 replies

Koitenshin Jun 2, 2024

I wish we could get ComfyUI's memory handling with Automatic's interface. ComfyUI is a major pain to use but I can generate 3840x2160 (4K Ultra HD) images on ComfyUI using a measly 3060TI (8GB VRAM).

ExissNA Jun 13, 2024

I will say this. With as much as I dislike Comfy's interface, it is the first UI I used that actually worked (I'm software illiterate a lot of times) and it was fast. I basically get the speed and usefulness of a layout I prefer (along with an extensions to give me display images of my 20K+ resources I have downloaded to help me remember what they are/do instead of just file names, the secondary reason I continue to use Forge/A1111), and the speed of fast gens, with Forge.

aolko · 2024-06-02T17:45:33Z

aolko
Jun 2, 2024

I've noticed that if you have some extensions installed the whole UI becomes sluggish, slow, and laggy. Even the main queue dies (generate button is not responding), in this case helps agent scheduler.

3 replies

huchenlei Jun 2, 2024
Author

Can you share the list of extensions you enabled?

aolko Jun 2, 2024

Can you share the list of extensions you enabled?

Sure, let me load up an instance with extensions

sysinfo-2024-06-02-20-22.json

aolko Jun 13, 2024

so what should i do @huchenlei ?

MrKrzYch00 · 2024-06-12T22:37:22Z

MrKrzYch00
Jun 12, 2024

Requested: #801 (reply in thread)

System:

CPU: Ryzen 1600X (ZP-B1) [@3875Mhz]
Memory: 32GB G-Skill F4-3200CL14-16GFX [@3133Mhz, dual-ch]
GPU#2: Nvidia RTX 3060 12GB VRAM (AI/computing dedicated) [PCI Express x8 Gen3]
OS: Windows 7 Pro 64-bit
Python: 3.10.14
PyTorch: torch: 2.3.0+cu118 (CUDA DLL 11.8 -> 11.4)

Sampling setup:

txt2img
Positive: 367 Tokens
Negative: 29 Tokens
Model: custom SD 1.5 FP16
Resolution: 512x672
CFG: 30
Sampler: DPM++ 3M SDE Auto/Normal
Steps: 10
no previews

(none uses xformers)
Forge args: --device-id 0 --disable-xformers --opt-split-attention
A1111 args: --device-id 0 --opt-split-attention
Comfy args: --disable-cuda-malloc --disable-xformers

Above 10 steps the profiling process began to slow down, which makes it not suitable for long runs.

(fixed trace logs in reply below)

21 replies

MrKrzYch00 Jun 15, 2024

Now that I'm checking out of curiosity. With --xformers it seems that A1111 is no more printing No module 'xformers'. Proceeding without it. and gradio reports this package being used. Even though that message seemed to me more like it wanted to use xformers but couldn't find the package, hence didn't by default because there is "no module", but maybe I'm wrong about what this message is meant to mean (though it did/does confuse me).

Strangely enough, Forge detects them by default but crashes with ModuleNotFoundError: import of xformers halted; None in sys.modules at a bit later time (before loading the site), and requires --xformers as well to finish loading without a crash. When I initially installed this package I thought that I broke my environment completely so I'm glad that the solution was as simple as to pass either --xformers or --disable-xformers to Forge.

I ran additional test for A1111 with: --opt-sdp-attention --precision half I think you meant to be comparable with the right side of the table (and with how Forge works for me) and added that half precision arg according to what we discussed about Forge using FP16:

100%|███████████████████████████████████████████████| 100/100 [00:12<00:00,  7.81it/s]
Total progress: 100%|███████████████████████████████| 100/100 [00:13<00:00,  7.56it/s]
Total progress: 100%|███████████████████████████████| 100/100 [00:13<00:00,  7.74it/s]

I guess with that A1111 is actually faster than Forge by a small margin. That's a surprise. Now the question would be: is it faster like this (with these args) on Nvidia GTX 3060 12GB VRAM or on my particular system...?

wfjsw Jun 15, 2024

Even though that message seemed to me more like it wanted to use xformers but couldn't find the package, hence didn't by default because there is "no module"

This is actually a pretty historical issue. Back in the old days people don't want to use xformers because it alters the image. The old version of LDM does not have xformers built in, so A1111 hacked into it and injected the xformers so everyone can use it. Later on Stability released a new version that supports xformers. However the way it supports that is by checking if xformers was installed into the environment and use it if it is found. People sometimes wants to turn off the xformers for various reasons and now they couldn't do that without actually uninstalling xformers, as it will always load that if it is installed, regardless of the --xformers flag. To bring back the old behavior, a patch was written (a friend and I did that together) so that xformers will be prohibited to load when --xformers is not present, thus the message you saw (it is printed by LDM).

Forge does not suffer from this problem, but for some reason it did not remove the patch so it breaks when there is no --xformers and it is assuming that xformers actually exists.

is it faster like this (with these args) on Nvidia GTX 3060 12GB VRAM or on my particular system...?

With a larger VRAM you usually won't see much difference while on a card with smaller VRAM the forge's auto lowvram will kick in and make it difficult to do sensible comparison.

MrKrzYch00 Jun 15, 2024

I see. With --xformers in A1111 I get pretty much the same speed as with SDP, while it's slower in Forge. The result is exactly the same with xformers for me - I'm still having --precision half added. I remember that in Forge I had to change RNG from GPU to NV for xformers to produce the same result. In A1111 I'm having GPU RNG now and the resulting png is exactly the same.

altoiddealer Jun 15, 2024

In Forge xformers is enabled by default.
Instead, you need to explicitly disable it, if desired

MrKrzYch00 Jun 15, 2024

While there is some speed increase from picking the right Cross attention optimization in A1111, there actually seems to be a tiny bit of difference in file size and also a small visual difference in results between them. They are not 1:1, very close to each other, but not perfectly the same. If it is unavoidable, hopefully that doesn't render a completely unexpected result in very rare cases. I tested it with no additional command args, just testing each setting in webui.

Koitenshin · 2024-06-16T16:51:36Z

Koitenshin
Jun 16, 2024

If we're talking about weird crap with A1111 and the command flags, using --use-ipex on my nvidia 3060TI (8GB of VRAM) gives me a slight speed increase. I tested every single CLI argument that I thought would give me a speed increase a month ago. Here's the weird results. Tests were run with an SD 1.5 based model producing 1080x1920 images using KHRFix. XX = "don't use, no speed increase", while the numbers did give a speed increase. The full test (which I haven't run yet) is 255 possible combinations of the numbered CLI arguments.

1. XX BASELINE (NO FLAGS)                         3.48 s/it		3.74 s/it		3.60 s/it
2. XX --no-half                                   8.91 s/it		9.70 s/it		OOM
3. XX --no-half-vae                               3.77 s/it		3.78 s/it		OOM
4. XX --medvram                                   3.58 s/it		3.57 s/it		3.60 s/it
5. 01 --medvram-sdxl                              3.29 s/it		3.28 s/it		3.60 s/it		ENABLE
6. XX --lowvram                                   5.18 s/it		5.09 s/it		5.52 s/it
7. 02 --lowram                                    3.34 s/it		3.38 s/it		3.58 s/it		ENABLE
8. 03 --always-batch-cond-uncond                  3.40 s/it		3.52 s/it		3.66 s/it		ENABLE
9. XX --unload-gfpgan                             3.59 s/it		3.68 s/it		3.70 s/it
10. XX --precision full                            Runtime Error w fp16
11. 04 --precision autocast                        3.40 s/it		3.38 s/it		3.59 s/it		ENABLE
12. 05 --upcast-sampling                           3.33 s/it		3.56 s/it		3.57 s/it		ENABLE


13. 06 --xformers                                  1.11 s/it		1.22 s/it		1.27 s/it		ENABLE
14. XX --force-enable-xformers (requires ^)        1.12 s/it		1.19 s/it		1.27 s/it
15. @@ --xformers-flash-attention (For SD2.0?)     3.38 s/it		3.58 s/it		3.59 s/it
16. XX --opt-split-attention                       3.53 s/it		3.55 s/it		3.61 s/it
17. XX --opt-sub-quad-attention                    5.55 s/it		5.54 s/it		6.55 s/it
18. 07 --opt-split-attention-invokeai              3.45 s/it		3.62 s/it		3.58 s/it		ENABLE
19. XX --opt-split-attention-v1                    OOM
20. 08 --opt-sdp-attention                         1.61 s/it		1.72 s/it		1.86 s/it		ENABLE
21. 09 --opt-sdp-no-mem-attention                  1.26 s/it		1.33 s/it		1.44 s/it		ENABLE
22. XX --disable-opt-split-attention               OOM


23. 10 --disable-nan-check                         3.29 s/it		3.39 s/it		3.57 s/it		ENABLE
24. 11 --use-ipex                                  3.37 s/it		3.40 s/it		3.60 s/it		ENABLE
25. XX --disable-model-loading-ram-optimization    3.68 s/it		3.77 s/it		3.62 s/it
26. XX --opt-channelslast (nVidia 16xx & ^)        3.64 s/it		3.75 s/it		3.67 s/it
27. 12 --disable-console-progressbars              3.32 s/it		3.32 s/it		3.32 s/it		ENABLE
28. XX --no-hashing                                3.57 s/it		3.62 s/it		3.61 s/it
29. XX --skip-load-model-at-start                  3.44 a/it		3.49 s/it		3.58 s/it		
30. XX --no-prompt-history                         3.52 s/it		3.50 s/it		3.60 s/it

0 replies

FurkanGozukara · 2024-10-27T10:30:42Z

FurkanGozukara
Oct 27, 2024

This is a gold thread thank you for work

0 replies

MatthewKhouzam · 2024-12-03T16:15:55Z

MatthewKhouzam
Dec 3, 2024

Hey all, this may be late to the party, I am a maintainer of Eclipse Trace Compass, a free open source trace viewer designed to scale. I would like to suggest trying out tc for this, it can injest the json traces and correlate them well. I will make a custom stable-diffusion video soon to illustrate how to. Until then, check this out. https://www.youtube.com/watch?v=YCdzmcpOrK4 as a way to handle the trace event traces.

2 replies

Gushousekai195 Dec 3, 2024

What’s this have to do with AI?

MatthewKhouzam Dec 4, 2024

This is a tool that can be used to read the profiling output when it gets very large. This is a support tool. I feel I got off on the wrong foot here. I just saw SD was using a file format we can handle, and we wanted to help.

The discussion is "Performance profiling report" not AI, did I put this in the wrong place? I can remove my post.

Performance Profiling Report (Forge/A1111/ComfyUI) #716

Hardware

A1111 Setup

Replies: 23 comments · 57 replies

huchenlei May 7, 2024 Author

Forge Setup

Note

huchenlei May 7, 2024 Author

ComfyUI Setup

huchenlei May 7, 2024 Author

huchenlei May 7, 2024 Author

huchenlei May 7, 2024 Author

huchenlei May 7, 2024 Author

Problem 1

huchenlei May 7, 2024 Author

huchenlei May 7, 2024 Author

Problem 2

A1111

Forge

huchenlei May 13, 2024 Author

huchenlei May 14, 2024 Author

huchenlei May 15, 2024 Author

huchenlei May 7, 2024 Author

Problem 3

Forge

A1111

huchenlei May 8, 2024 Author

Understanding the checkpoint Function

Role of CheckpointFunction

Why parameters() Is Invoked Often

Practical Implications

huchenlei May 8, 2024 Author

huchenlei May 8, 2024 Author

huchenlei May 8, 2024 Author

Problem 4

Forge

A1111

huchenlei May 8, 2024 Author

huchenlei May 8, 2024 Author

huchenlei May 8, 2024 Author

Problem 5

huchenlei May 17, 2024 Author

huchenlei Jun 2, 2024 Author

Replies: 23 comments 57 replies

huchenlei
May 7, 2024
Author

huchenlei
May 7, 2024
Author

huchenlei
May 7, 2024
Author

huchenlei May 7, 2024
Author

huchenlei
May 7, 2024
Author

huchenlei
May 7, 2024
Author

huchenlei May 7, 2024
Author

huchenlei
May 7, 2024
Author

huchenlei May 13, 2024
Author

huchenlei May 14, 2024
Author

huchenlei May 15, 2024
Author

huchenlei
May 7, 2024
Author

huchenlei May 8, 2024
Author

huchenlei May 8, 2024
Author

huchenlei
May 8, 2024
Author

huchenlei
May 8, 2024
Author

huchenlei May 8, 2024
Author

huchenlei
May 8, 2024
Author

huchenlei
May 8, 2024
Author

huchenlei
May 17, 2024
Author

huchenlei Jun 2, 2024
Author