-
-
Notifications
You must be signed in to change notification settings - Fork 21.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize base and shadow meshes for vertex cache #94241
Conversation
5c1b821
to
a6db472
Compare
This could also benefit #94097 when using complex PrimitiveMeshes. Like |
Would procedural geometry use ImporterMesh or SurfaceTool? Asking because SurfaceTool already exposes |
At this point, I expect to use both for procedural generation and .. csg, but I am in favour of this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! I'm glad to see that the performance benefits are so tangible
#68959 should be tested with this.. |
@mrjustaguy That issue should not be affected by this change in isolation for two reasons: 1) this PR only adds the relevant functionality to the glTF import path; adding it to .obj is a matter of adding this to the .obj importer but I'm worried this will cause conflicts with #94108 so I'd rather do that separately / as part of that change if this ends up getting merged first: + for (int i = 0; i < r_meshes.size(); i++) {
+ r_meshes.get(i)->optimize_indices_for_cache();
+ }
edit yeah confirmed that shadow meshes help on that file, depth pre-pass drops from 0.45ms to 0.24ms on 4090. Without this change but with shadow mesh creation depth pre-pass drops to 0.26ms, so there's a small improvement for shadow meshes even for this edge case from this PR which is nice. |
I mean that was really a Stress test to compare Godot 3 with Godot 4 primitive performance.. Though I think that there have been a few optimizations relevant to it since I've last looked at it so IDK how Godot 4 compares to 3 Today in that aspect. |
Procedural geometry generation is done with SurfaceTool, but ImporterMesh also exposes similar functions so that import scripts can make use of it. |
Needs a rebase to resolve merge conflicts after some initial merges in 4.4. |
Previously, vertex cache optimization was ran for the LOD meshes, but was never ran for the base mesh or for the shadow meshes, including shadow LOD chain (shadow LOD chain would sometimes get implicitly optimized for vertex cache as a byproduct of base LOD optimization, but not always). This could significantly affect the rendering performance of geometry heavy scenes, especially for depth or shadow passes where the fragment load is light.
Rebased vs master. |
Thanks! |
Previously, vertex cache optimization was ran for the LOD meshes, but was never ran for the base mesh or for the shadow meshes, including shadow LOD chain (shadow LOD chain would sometimes get implicitly optimized for vertex cache as a byproduct of base LOD optimization, but not always). This could significantly affect the rendering performance of geometry heavy scenes, especially for depth or shadow passes where the fragment load is light.
This PR unconditionally runs the optimization for base mesh before further processing, and for any generated shadow index buffers; if meshoptimizer module is not loaded, we silently skip the processing. Note that this is the same algorithm we already use for LOD index buffers.
I generally treat this optimization as "always on, do no harm" - it only changes the order of triangles, which is generally speaking indeterminate on import, and is fairly quick. For a sense of scale, this is ~6x faster than tangent generation, ~25x faster than LOD generation (before my previous optimization PR, so maybe ~10x after?), and consequently should not change the import time much. I've tested this with DragonAttenuation model (https://github.com/KhronosGroup/glTF-Sample-Models/tree/main/2.0/DragonAttenuation) and didn't see overall import time change in a statistically measurable way. The appearance of any model should be the same, this only changes the submitted triangle order within each mesh, which has no impact on opaque meshes and should not make transparent meshes worse in that the order of triangles on them could not be relied upon anyway.
As any hardware performance optimization, this is hard to measure well. On a scene with 28 clones of the model above, with some objects closer to camera (LOD 0) and some further away, my aggregate measurements on NVidia RTX 4090 make that scene ~17% faster in terms of full frame time to render. Most of the gains are just from the shadow mesh optimization (it's something like 11% for shadow mesh optimization and 6% extra on top from base mesh optimization) - depth pre-pass and shadow passes tend to be vertex/raster bound, and the shadow mesh is rendered multiple times, so that makes sense. Note that other meshes may display no performance gains (for example, if a mesh is fairly low-poly, or if the scene has been preprocessed with tools like gltfpack that generate optimal order, the gains will be small to non-existent), and could also display larger performance gains (as the original order can be more pathologically bad depending on the exporter). Realistically I would not expect a double digit performance improvement here on any realistic scenes, but the gains are free.
The measurements quoted above are with VSync disabled using full frame FPS, if we measure the GPU time on the individual passes (using Godot's Visual Profiler), the relative gains are more significant - note that I'm using the numbers as displayed by the profiler (2 decimal digits), my GPU is clearly too fast for this 😝: