Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[js/web] JSEP LayerNormalization and InstanceNormalizations kernels #16830

Merged
merged 15 commits into from
Aug 8, 2023

Conversation

dakenf
Copy link
Contributor

@dakenf dakenf commented Jul 24, 2023

Description

Added two kernels for Layer and Instance norm

Also added maximum limits for maxBufferSize when requesting GPU device as by default it's limited to 256mb and it fails allocating 600mb buffer while running fp32 StableDiffusion weights.

Motivation and Context

These two are used in StableDiffusion and many other networks

@satyajandhyala satyajandhyala added the ep:WebGPU ort-web webgpu provider label Jul 24, 2023
@satyajandhyala
Copy link
Contributor

@dakenf
Copy link
Contributor Author

dakenf commented Jul 25, 2023

Can we verify/enable some of these tests if not all?

I've enabled all instance norm tests (there were just two) and layer norm except some "expanded" ones which don't actually test the OP and fail for some reason. They are: test_layer_normalization_3d_axis_negative_1_epsilon_expanded, test_layer_normalization_3d_axis_negative_2_epsilon_expanded, test_layer_normalization_3d_axis_negative_3_epsilon_expanded, test_layer_normalization_3d_axis0_epsilon_expanded, test_layer_normalization_3d_axis2_epsilon_expanded, test_layer_normalization_4d_axis_negative_2_expanded and test_layer_normalization_4d_axis2_expanded

@dakenf
Copy link
Contributor Author

dakenf commented Jul 27, 2023

@fs-eire @satyajandhyala can you please advise me about a weird issue i get with LayerNorm?
In the spec and for tests it requires three outputs: result, mean data and inv std data. And it works fine.
But when i run it with StableDiffusion text encoder or unet it throws an error:

Writable storage buffer binding aliasing found between [BindGroup] set at bind group index 0, binding index 4, and [BindGroup] set at bind group index 0, binding index 5, with overlapping ranges (offset: 0, size: 1280) and (offset: 0, size: 1280) in [Buffer].
 - While encoding [ComputePassEncoder].DispatchWorkgroups(8, 1, 1).

So if i comment out the last binding it works perfectly fine but tests fail as they expect three outputs

@group(0) @binding(0) var<storage, read> x : array<${dataType}>;
  @group(0) @binding(1) var<storage, read> scale : array<${dataType}>;
  @group(0) @binding(2) var<storage, read> bias : array<${dataType}>;
  @group(0) @binding(3) var<storage, read_write> output : array<${dataType}>;
  @group(0) @binding(4) var<storage, read_write> meanDataOutput : array<${dataType}>;
  // COMMENT OUT THIS @group(0) @binding(5) var<storage, read_write> invStdOutput : array<${dataType}>;

  ${shaderHelper.mainStart(workgroupLimits)}
   ... shader code

    meanDataOutput[global_idx] = mean;
    invStdOutput[global_idx] = 1 / meanSquare;
  }`;
        return {
            ...metadata,
            outputs: [
                {dims: outputShape, dataType: inputs[0].dataType, gpuDataType: GpuDataType.default},
                {dims: meanInvStdDevDim, dataType: inputs[0].dataType, gpuDataType: GpuDataType.default},
                 // COMMENT OUT THIS {dims: meanInvStdDevDim, dataType: inputs[0].dataType, gpuDataType: GpuDataType.default},
            ],
            getShaderSource,
            dispatchGroup: () => (dispatchGroup)
        };
    };

And i have another unrelated issue. After adding Gemm for opset 13 (currenty in JSEP it has a kernel only for 11 and falls back to CPU) i get this error:

[Buffer] usage (BufferUsage::(Storage|BufferUsage::80000000)) includes writable usage and another usage in the same synchronization scope.
 - While validating compute pass usage.

I guess it hits max buffer count/sizes. Is there an easy way to dispose unused buffers during the execution? As it uses more than 8gb of video memory. 3gb for UNET weights and allocates more than 5 during OrtRun.

@fs-eire
Copy link
Contributor

fs-eire commented Jul 28, 2023

The output count needs to match the actual number of the node's outputs in the graph. I am not sure if a context can get the outputs number, if not we need to add this.

@satyajandhyala
Copy link
Contributor

satyajandhyala commented Jul 28, 2023

@fs-eire @satyajandhyala can you please advise me about a weird issue i get with LayerNorm? In the spec and for tests it requires three outputs: result, mean data and inv std data. And it works fine. But when i run it with StableDiffusion text encoder or unet it throws an error:

Writable storage buffer binding aliasing found between [BindGroup] set at bind group index 0, binding index 4, and [BindGroup] set at bind group index 0, binding index 5, with overlapping ranges (offset: 0, size: 1280) and (offset: 0, size: 1280) in [Buffer].
 - While encoding [ComputePassEncoder].DispatchWorkgroups(8, 1, 1).

So if i comment out the last binding it works perfectly fine but tests fail as they expect three outputs

@group(0) @binding(0) var<storage, read> x : array<${dataType}>;
  @group(0) @binding(1) var<storage, read> scale : array<${dataType}>;
  @group(0) @binding(2) var<storage, read> bias : array<${dataType}>;
  @group(0) @binding(3) var<storage, read_write> output : array<${dataType}>;
  @group(0) @binding(4) var<storage, read_write> meanDataOutput : array<${dataType}>;
  // COMMENT OUT THIS @group(0) @binding(5) var<storage, read_write> invStdOutput : array<${dataType}>;

  ${shaderHelper.mainStart(workgroupLimits)}
   ... shader code

    meanDataOutput[global_idx] = mean;
    invStdOutput[global_idx] = 1 / meanSquare;
  }`;
        return {
            ...metadata,
            outputs: [
                {dims: outputShape, dataType: inputs[0].dataType, gpuDataType: GpuDataType.default},
                {dims: meanInvStdDevDim, dataType: inputs[0].dataType, gpuDataType: GpuDataType.default},
                 // COMMENT OUT THIS {dims: meanInvStdDevDim, dataType: inputs[0].dataType, gpuDataType: GpuDataType.default},
            ],
            getShaderSource,
            dispatchGroup: () => (dispatchGroup)
        };
    };

And i have another unrelated issue. After adding Gemm for opset 13 (currenty in JSEP it has a kernel only for 11 and falls back to CPU) i get this error:

[Buffer] usage (BufferUsage::(Storage|BufferUsage::80000000)) includes writable usage and another usage in the same synchronization scope.
 - While validating compute pass usage.

I guess it hits max buffer count/sizes. Is there an easy way to dispose unused buffers during the execution? As it uses more than 8gb of video memory. 3gb for UNET weights and allocates more than 5 during OrtRun.

Does it help specifying the actual size, if possible, in addition to dataType for x, scale, bias, output, etc.?

@dakenf
Copy link
Contributor Author

dakenf commented Jul 28, 2023

Does it help specifying the actual size, if possible, in addition to dataType for x, scale, bias, output, etc.?

The suggestion above helped. I've added OutputCount to serialized kernel context and it started to work fine

@fs-eire
Copy link
Contributor

fs-eire commented Jul 30, 2023

I have a question regarding the optional inputs/outputs of an operator. take LayerNormalization for example - is it possible that a graph contains an LayerNormalization node whose output[0] and output[2] exist, but output[1] is omitted? If this situation may happen, then defining 'outputCount' may not be sufficient (it'll be ambiguous because if outputCount==2, it may mean either output[0]+output[1] or output[0]+output[2])

@dakenf
Copy link
Contributor Author

dakenf commented Jul 31, 2023

I have a question regarding the optional inputs/outputs of an operator. take LayerNormalization for example - is it possible that a graph contains an LayerNormalization node whose output[0] and output[2] exist, but output[1] is omitted? If this situation may happen, then defining 'outputCount' may not be sufficient (it'll be ambiguous because if outputCount==2, it may mean either output[0]+output[1] or output[0]+output[2])

From what I see in the code, if output[1] does not exist, node will have three outputs but for the second context->OutputType(1) will return nullptr. But I might be wrong. CPU execution provider does not have any checks, it always has three outputs.

Also, bias input for LayerNorm is optional, so i will add a check for it

@guschmue
Copy link
Contributor

guschmue commented Aug 1, 2023

/azp run ONNX Runtime Web CI Pipeline

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@guschmue
Copy link
Contributor

guschmue commented Aug 2, 2023

ok, some reviewers got added to your chromium PR - hope that works out.

@dakenf
Copy link
Contributor Author

dakenf commented Aug 2, 2023

Pefect, thanks. If I will be able to fix memory.grow after 4gb then there will be no reason to use 32bit version with JSEP. Right now the only way to use more memory is to specify fixed size with INITIAL_MEMORY linking flag

@guschmue
Copy link
Contributor

guschmue commented Aug 2, 2023

yes. I'm sure there will be some more complications on the way, but it will be needed now that webgpu is there and shows the possibilities to run things like sd or llm's in the browser.

@guschmue
Copy link
Contributor

guschmue commented Aug 4, 2023

/azp run ONNX Runtime Web CI Pipeline

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@satyajandhyala
Copy link
Contributor

@dakenf
The latest change to the webgpu-operator.md document removed some comments existed previously. I am not sure whether that it intentional or not. We don't edit this file directly. It should be generated using npm run build:doc in onnxruntime/js/web folder. The comments get inserted by generate-webgpu-operator-md.ts. After making any necessary changes to this TypeScript file you need to run npm run prepare to compile/generate JavaScript file.

@satyajandhyala
Copy link
Contributor

@dakenf Some unresolved/valid comments are blocking merging this PR. Please address these comments so that this PR can be merged.

@satyajandhyala
Copy link
Contributor

@dakenf Click on "view" link in "Unresolved conversations" to go to the unresolved comment directly.

@dakenf
Copy link
Contributor Author

dakenf commented Aug 4, 2023

@dakenf Click on "view" link in "Unresolved conversations" to go to the unresolved comment directly.

I think it's just github interface lagging because i get this when i click on it
image
But the actual code is already fixed if you go here (with int32_t and axis=-1) https://github.com/microsoft/onnxruntime/pull/16830/files
image
Or am i looking in the wrong place?

@satyajandhyala
Copy link
Contributor

@dakenf Thanks for clarifying. I marked the conversation resolved.

@guschmue
Copy link
Contributor

guschmue commented Aug 7, 2023

/azp run ONNX Runtime Web CI Pipeline

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@guschmue
Copy link
Contributor

guschmue commented Aug 7, 2023

/azp run Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,Windows ARM64 QNN CI Pipeline,Windows CPU CI Pipeline

@azure-pipelines
Copy link

Azure Pipelines successfully started running 9 pipeline(s).

@guschmue
Copy link
Contributor

guschmue commented Aug 7, 2023

/azp run Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline,onnxruntime-binary-size-checks-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed

@azure-pipelines
Copy link

Azure Pipelines successfully started running 6 pipeline(s).

@guschmue
Copy link
Contributor

guschmue commented Aug 7, 2023

going to take this for a quick test before I merge it

@guschmue guschmue merged commit c3f0425 into microsoft:main Aug 8, 2023
jchen351 pushed a commit that referenced this pull request Aug 12, 2023
…16830)

### Description
Added two kernels for Layer and Instance norm

Also added maximum limits for `maxBufferSize` when requesting GPU device
as by default it's limited to 256mb and it fails allocating 600mb buffer
while running fp32 StableDiffusion weights.


### Motivation and Context
These two are used in StableDiffusion and many other networks
kleiti pushed a commit to kleiti/onnxruntime that referenced this pull request Mar 22, 2024
…icrosoft#16830)

### Description
Added two kernels for Layer and Instance norm

Also added maximum limits for `maxBufferSize` when requesting GPU device
as by default it's limited to 256mb and it fails allocating 600mb buffer
while running fp32 StableDiffusion weights.


### Motivation and Context
These two are used in StableDiffusion and many other networks
siweic0 pushed a commit to siweic0/onnxruntime-web that referenced this pull request May 9, 2024
…icrosoft#16830)

### Description
Added two kernels for Layer and Instance norm

Also added maximum limits for `maxBufferSize` when requesting GPU device
as by default it's limited to 256mb and it fails allocating 600mb buffer
while running fp32 StableDiffusion weights.


### Motivation and Context
These two are used in StableDiffusion and many other networks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ep:WebGPU ort-web webgpu provider
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants