Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement texture mipmap generation via compute pipelines #5757

Open
wants to merge 39 commits into
base: master
Choose a base branch
from

Conversation

frenzibyte
Copy link
Member

@frenzibyte frenzibyte commented Apr 26, 2023

Preface

Since #5508, custom mipmap generation was implemented to reduce the overhead in the case of uploading small textures to an atlas.

The implementation revolved around binding one level of a texture as a sampler and another level as a render target (FBO), and using texture to downsample the larger texture and store the result in the smaller texture.

This was pretty straightforward for OpenGL and Metal, but D3D11 and Vulkan weren't very supportive. Both of them don't seem to like the idea of binding the same texture as both an input and output of the shader.

That was worked around by allocating another texture to bind as the input for the shader, but that required preparing the texture content for each level, which means switching between encoders on Metal (and Veldrid's Vulkan implementation doesn't handle that well?).

Compute-based mipmap generation

After exploring through compute shaders for the past week, I've came up with a simple implementation that avoids creating extra resources as much as possible.

The concept of it boils down to a compute shader accepting an input texture, a linear sampler, and a uniform buffer supplying the uploaded region. The compute shader downsamples from the texture according to the specified uploaded region and the current invocation ID, and stores the result in the output texture.

For Veldrid, this is implemented by splitting each texture into texture views in VeldridTextureResources, bind one view for sampling and another for output, and supply a re-usable uniform buffer with region data.

For Legacy OpenGL, this is implemented similarly, but texture views aren't required as each texture resembles a sampler by itself, so the properties can be modified accordingly for each dispatch call without creating extra resources.

Compute shaders is shown to be supported by 80% of systems according to this game, but the rendering-based path still exists as a fall back for older hardware since it still behaves 10x better than driver implementation. It can be further improved by using texture views and caching VBOs, but I'm leaving it as-is in this PR.

Better rectangle merging implementation

After writing a test scene for mipmap generation, it turns out there was a high overhead coming from the rectangle merging logic added in #5508.

The loop works well for a few overlapping regions, but can get very complex when queuing a high number of uploads at different regions.

The overhead can get as high as ~500ms, which is 100x higher than the overhead of mipmap generation itself.

After discussing this on discord with @smoogipoo, I've came up with a different implementation that follows the nature of texture atlases and produces as less rectangles as possible to simplify the mipmap generation process.

The implementation accepts all uploaded regions, and produces one or two rectangles, one covering all uploaded regions that are horizontally after the top-left uploaded region, and another rectangle covering all uploaded regions that are horizontally before the top-left uploaded region.

Before After
CleanShot 2023-04-26 at 16 04 00 CleanShot 2023-04-26 at 15 54 05

(each highlighted box represents a region that's worked on by the mipmap generation logic)

On master, mipmaps would be generated for every single uploaded data (including the paddings around the texture) as can be seen above, this resulted in too many vertices and unnecessary fills.

With the new implementation, all uploaded regions are formed into two rectangles to dispatch as many threads as possible without overlapping / unnecessary work.

The new implementation is also ~75x faster than the original logic, which should reduce stutters in osu! when starting gameplay, as there are a lot of textures getting added to the atlas at once.


As an aside, I have noticed both the computing and rendering implementations missing certain textures when testing with osu!, I'll leave this PR blocked from merge for now until I have the time to investigate it or someone else can look into it.

Requires testing on all configurations below:

  • Windows (Legacy OpenGL)
    • Computing path (requires GL_ARB_compute_shader)
    • Rendering path
  • Windows (Veldrid OpenGL)
    • Computing path (requires GL_ARB_compute_shader and GL_ARB_texture_view)
    • Rendering path
  • Windows (Direct3D 11)
    • Computing path (requires feature level 11_0+)
    • Rendering path
  • macOS (x64, Legacy OpenGL)
  • macOS (x64, Veldrid OpenGL)
  • macOS (x64, Metal)
  • macOS (M1, Legacy OpenGL)
  • macOS (M1, Veldrid OpenGL)
  • macOS (M1, Metal)
  • Linux (Legacy OpenGL)
    • Computing path (requires GL_ARB_compute_shader)
    • Rendering path
  • Linux (Veldrid OpenGL)
    • Computing path (requires GL_ARB_compute_shader and GL_ARB_texture_view)
    • Rendering path
  • Android (Legacy OpenGL)
    • Computing path (requires GL_ARB_compute_shader)
    • Rendering path
  • iOS (Legacy OpenGL)
    • Computing path (requires GL_ARB_compute_shader)
    • Rendering path
  • iOS (Veldrid OpenGL)
    • Computing path (requires GL_ARB_compute_shader and GL_ARB_texture_view)
    • Rendering path
  • iOS (Metal)

Optional configurations:

  • Windows (Vulkan)
  • Linux (Vulkan)

Direct3D 11 doesn't like reading from "image2D"-like types with
complicated formats like rgba8.
We want compatibility with 10_0 feature levels, which use Shader Model 4 (cs_4_0). Shader Model 4 only supports UAV resources of type `RWByteAddressBuffer` or `RWStructuredBuffer` (i.e. structured buffers).
The overhead required to make this work on older Direct3D versions is
quite bad compared to using the framebuffer method or driver
implementation.
Metal mostly support this (needs testing on multiple devices for confirmation),
and D3D11 also supports this according to documentations (cs_5_0 profile only).
It's incorrect for texture dimensions that are not a multiple of the threadgroup size. Using texture width is more accurate.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: On hold
Development

Successfully merging this pull request may close these issues.

1 participant