Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up line rendering with WebGL2 instancing #6162

Closed
wants to merge 10 commits into from

Conversation

davepagurek
Copy link
Contributor

@davepagurek davepagurek commented May 26, 2023

Resolves #6091

Changes

Currently, drawing strokes in WebGL mode is fairly slow. Since each segment/cap/join of a stroke is made of at least 2 triangles, we had to duplicate data per vertex of each triangle on the CPU.

Now that WebGL2 is enabled by default, we can use instanced rendering, which lets us repeat attributes per instance being rendered. This lets the GPU do the copying for us.

Here's a stress test example: https://editor.p5js.org/davepagurek/sketches/cWr5RmNAK

If you toggle between the 1.6.0 CDN <script> tag to the uploaded p5.min.js, the frame rate goes up from ~40fps to ~60fps on Chrome my M1 Macbook Pro (closer to ~18fps and ~40fps in Firefox.)

If you toggle version: 2 to version: 1, it goes back to copying the vertex data on the CPU and the frame rate drops to ~35fps. This is slower than before since the format required for instancing is a bit less efficient when done on the CPU, but WebGL2 support is pretty broad now so this should only happen on legacy code.

Future work

The new bottleneck in line rendering seems to be libtess. There are a few things we might consider doing:

  • Update how curveDetail works to maybe be relative to curve length (or curvature) to save vertices
  • Move away from libtess, maybe to mapbox/earcut which apparently is faster, but would require extra work to tesselate shapes that aren't on the XY plane

PR Checklist

  • npm run lint passes
  • [Inline documentation] is included / updated
  • [Unit tests] are included / updated

@davepagurek davepagurek requested a review from aferriss May 26, 2023 21:31
@davepagurek davepagurek marked this pull request as draft May 26, 2023 22:16
@davepagurek davepagurek changed the title Speed up line rendering with WebGL2 instancing [WIP] Speed up line rendering with WebGL2 instancing May 26, 2023
@davepagurek davepagurek changed the title [WIP] Speed up line rendering with WebGL2 instancing Speed up line rendering with WebGL2 instancing May 26, 2023
@davepagurek davepagurek marked this pull request as ready for review May 26, 2023 22:48
@@ -8,6 +8,28 @@ p5.RenderBuffer = class {
this.attr = attr; // the name of the vertex attribute
this._renderer = renderer;
this.map = map; // optional, a transformation function to apply to src
this._namespace = undefined;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you leave some short comments here about the purpose of these properties. I know they're also defined in the setters down below but I find it helpful to view it all in one place.

@@ -27,45 +49,77 @@ p5.RenderBuffer = class {
model = geometry;
}

let geometryData = geometry;
if (this._namespace) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there some redundancy of if statements and loops here? Any reason to not just combine into one if and one for

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

combining them should be fine!

this._offset = 0;
this._divisor = undefined;
}
namespace(namespace) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why wouldn't we just add these properties to the constructor instead of setting them through chaining functions during construction?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly because it gets hard to read with a long chain of parameters, and because it's valid to have some of these without others and having undefined in the middle of the list feels weird. I could also refactor this to use something like an options object for the constructor, which could serve a similar purpose?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think an options object sounds like it would be much cleaner here if you're open to doing that refactor.

if (divisor !== undefined) {
this._renderer.GL.vertexAttribDivisor(loc, divisor);
} else if (this._renderer.webglVersion === constants.WEBGL2) {
this._renderer.GL.vertexAttribDivisor(loc, 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand what this divisor is doing, though I assume it's ok to have it be zero?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The divisor is how many instances get the same attribute value in a row before moving on to the next value in the buffer. The default behavior is the same as 0, which means don't repeat anything, just keep pulling new values from the buffer like normal. For attributes that we want to repeat, we call set a different divisor to reuse the value for the whole instance or for multiple instances.

I can also add something like this in a comment explaining what it is so that it's less necessary to have a bunch of MDN tabs open to understand what the code is trying to do 😅

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense to me, thanks for the explanation!

vec4 posp = uModelViewMatrix * aPosition;
vec4 posqIn = uModelViewMatrix * (aPosition + vec4(aTangentIn, 0));
vec4 posqOut = uModelViewMatrix * (aPosition + vec4(aTangentOut, 0));
vec4 position = abs(aSide) == 1. ? aFromPosition : aToPosition;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any way we can avoid all this branching? Maybe some clever math or predetermine the from / to position for the variable before reaching the shader?

Or is this just needless optimization?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if I can think of a way to get from/to working before reaching the shader. I was hoping there would be a way to repeat attributes for a few vertices in an instance to use just one position attribute, but it looks like all we can do is repeat the attribute for every vertex in the instance.

I think I can make that happen with mix() to avoid the conditional! Not sure if that'll be faster or not but I can check and see.

Copy link
Contributor

@aferriss aferriss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @davepagurek This is really complicated stuff and seems like it could be a big win for perf! I do worry a little about regressing performance on webGL1 though. Any way to prevent that?

I think another good test would be to make sure that lines drawn in webGL1 match those in webGL2

@aferriss
Copy link
Contributor

On the demo you linked, I'm actually seeing better performance from the CDN p5, using both webGL1 and webGL2. Your version was giving around 15fps, the CDN was close to 60 when drawing the different shapes.

I also got an error when trying to use your version with webGL1, not sure if maybe that script got built before all of your changes were made, but just want to flag that.

@davepagurek
Copy link
Contributor Author

On the demo you linked, I'm actually seeing better performance from the CDN p5, using both webGL1 and webGL2. Your version was giving around 15fps, the CDN was close to 60 when drawing the different shapes.

I've noticed sometimes changes to the html file don't update immediately for me until I save, I wonder if maybe that's happening? Do you mind checking again and comparing this version (instancing, webgl2) with this version (using the CDN, no instancing, only webgl1)? If it's still slower with instancing, that's definitely a problem -- what browser/OS do you use?

I also got an error when trying to use your version with webGL1, not sure if maybe that script got built before all of your changes were made, but just want to flag that.

Weird, I don't get that on Chrome/Firefox on my Mac, mind sending me the stacktrace of the error?

@davepagurek
Copy link
Contributor Author

I do worry a little about regressing performance on webGL1 though. Any way to prevent that?

This data format requires a tad more copying of values than before now that there are two position attributes instead of one. One option is to use a different line shader with different attributes in WebGL1, but I was hoping to avoid that because the maintenance cost goes up. I can see if there's anything I can do to speed up the copying in the WebGL1 case though. I can do a bit more debugging and see if something like using a fixed sized array upfront can help (since we can calculate the buffer sizes needed ahead of time before sorting the data into the non-instanced format)

@aferriss
Copy link
Contributor

Weird, I don't get that on Chrome/Firefox on my Mac, mind sending me the stacktrace of the error?

The error is TypeError: r._makeStrokeBufferData is not a function at undefined:2:860144 and only occurs when drawing 3d geometry with calls like sphere or box.

I'm using chrome 113 on Mac OSX 13.13

I'm reliably seeing very low framerates when testing with just lines using your build, and around 20fps using the CDN build. I made a copy with some average framerate calculations here as well.
https://editor.p5js.org/aferriss/sketches/HP0KpYo1J

@davepagurek
Copy link
Contributor Author

Thanks! ok, found a fix for the crash for WebGL1 mode. Eventually I'll add some more testing for WebGL1 mode for retained and immediate mode, but it sounds like before getting to that there's some more fundamental performance things to work out.

Also thanks for the p5 editor link, I think the difference in my testing was on single shapes with lots of lines vs many shapes with single lines. It seems like when rendering shapes with a lot of data (the single large begin/endShape in my editor link), the bottleneck is in duplicating the data, which this change addresses, but when rendering a single line, the fact that there are more buffers involved means that the overhead of flattening multiple arrays and allocating more Float32Arrays adds up. I'm going to spend some more time seeing if there's a way to keep the performance improvements for large shapes without compromising drawing many lines so much.

@davepagurek
Copy link
Contributor Author

OK I'm back with some updated! I have a new version here for testing: https://editor.p5js.org/davepagurek/sketches/tcug_H1Z8

Updates I made to speed things up:

  • No longer checking if face culling is enabled when drawing strokes. To make strokes work regardless of whether or not face culling is enabled, I updated the winding order of lines to be consistent with faces.
  • Using the same strides/divisors for all stroke buffers. This seems to be a bit faster than changing it up between segments/caps/joins (maybe the WebGL driver internally does some deduplication of commands?)
  • Avoiding flattening data when putting it into RenderBuffers. This is done by just writing them directly into an array rather than keeping around subarrays, and since the data is never read by other code (just sent right to the GPU), there is little cost in code readability.
  • Storing stroke data directly in Float32Arrays. I made a new p5.DataVector class similar to a C++ vector that doubles its capacity when data grows long. This way we avoid allocating new data every frame and having to GC it, which seems to be a big bottleneck.
    • As an aside, I also shrink the capacity down to be the nearest power of 2 before sending the data to the GPU, since it seems that even when you specify a count in gl.bufferData, it sends the whole Flaot32Array. Hopefully the power-of-2 sizes leave enough wiggle room to avoid too many resizes. It's great for drawing the same sort of shape many times in a row though!

Here's how it stands up against 1.6.0 in WebGL1 and 2 modes:

TestPreviousUpdated
ChromeFirefoxWebGL 1WebGL 2
ChromeFirefoxChromeFirefox
1000 lines 60fps16fps(↓)28fps(↑)24fps(=)60fps(↑)34fps
Large curve 23fps10fps(↓)12fps(=)11fps(↑)39ps(↑)25fps
Retained+Immediate Shapes 60fps60fps(=)60fps(=)60fps(=)60fps(=)60fps

Comparing against 1.6.0:

  • In WebGL2 mode, everything is around 2x as fast, or capped out at the 60fps max
  • In WebGL1 mode, Chrome is always 2x slower, while Firefox is around the same or marginally better

Personally, I think I'm OK with the WebGL1 tradeoffs, as I think its main use cases are:

  • old browsers, e.g. Safari on my old Mac which I won't update the OS of, but it still has access to up-to-date Chrome and Firefox with WebGL2 support anyway
  • shader compatibility with old shaders that used extensions (which could be used in WebGL2 by default but would require updating the GLSL syntax), but shader-based sketches seem not to rely much on geometry and lines

If you're OK with the performance tradeoffs for WebGL1, I can continue on and make another update with the extra comments/refactoring/tests we talked about earlier.

@aferriss
Copy link
Contributor

aferriss commented Jun 8, 2023

Thanks for the updates @davepagurek ! I'll take some time to look through your new commits and test perf again. For my own clarity, in your table does the previous column indicate p5 perf prior to any of your changes, or do they represent the perf from your initial round of commits? If it's the latter, including perf from prior to making any of these changes would be helpful to see.

It looks like there are wins everywhere except for drawing in chrome using webGL1, and a more than 2x slow down still seems like a large regression to me. Do you have a sense of why it's such a substantial hit for that platform?

It'd be really great if we could test on some mobile devices, safari, and some windows machines as well if possible.

@davepagurek
Copy link
Contributor Author

The "previous" column is for v1.6.0.

The slowdown is mostly because the data layout is a little different (there's a to and from coordinate instead of just a single vertex position) because it's optimizing for the webgl2 case. There's a tradeoff to make between webgl1 speed and code maintenance, because we could use a separate shader and data storage format if we're using webgl1, but then we have to keep track of the extra code. This implementation uses the same data layout for both, and has an extra for loop to mimic on the cpu what webgl2 instanced rendering does on the gpu, opting to compromise speed to simplify solve already kind of complex code.

I think it is definitely possible to use ifdefs to use a different shader layout and use different buffer structure in webgl1 mode if we want to treat it less like a graceful degradation and more like a primary usage target if we're ok with the added complexity and maintenance.

@aferriss
Copy link
Contributor

aferriss commented Jun 8, 2023

Thanks for the extra explanation. It sounds to me like the extra complexity may not be worth the amount of devices we'd be continuing to support. It's a little unclear how many samples they have but data from here seems to suggest that almost 98% of devices are able to support webGL2. Might be good to get some additional input from some of the other maintainers as well.

@davepagurek
Copy link
Contributor Author

davepagurek commented Jun 8, 2023

Also, I got someone to test on windows for me, this is what they reported back:

AMD Ryzen 9 5950X (16-Core/3.40 GHz), AMD Radeon RX 6600 - all modes using default settings as well.

chrome (114.0.5735.110)
many lines - 11-13 fps (1.6.0) / 5 fps (p5.js in sketch)
large curve - 21-22 fps / 45-46 fps
retained shapes - 57 fps / 60 fps

firefox (114.0)
many lines - 3 fps / 5-6 fps
large curve - 14 fps / 33-36 fps
retained shapes - 60 fps / 60 fps

It looks like the many lines one is still slower in Chrome, and actually quite a bit slower than on my Mac even on the previous version of p5. Since a single large curve runs reasonably fast, it seems like the bottleneck there is likely the data transfer between CPU and GPU.

I'll also try to run some tests on Safari and my phone later!

@davepagurek
Copy link
Contributor Author

I did some more testing on Safari on my mac and Chrome on my Android phone:

TestPreviousUpdated
SafariMobile ChromeWebGL1WebGL2
SafariMobile ChromeSafariMobile Chrome
1000 lines26fps6fps18fps4fps10fps2fps
Large curve20fps9fps15fps4fps41fps18fps
Retained+Immediate Shapes
60fps60fps60fps60fps60fps60fps

It seems like for both of these, rendering many lines gets a bit slower, and rendering large shapes gets faster. Interestingly, safari is actually faster in WebGL 1 mode than WebGL 2.

It seems like there isn't much of a pattern. I suppose it depends on what pipelines the hardware is optimized for under the hood? Mobile seems to be extremely slow at large volumes of back-and-forth between the CPU and GPU, so rendering that many lines is never fast.

So it looks like this instancing update is not a clear win across the board, but rather, this will be a question of what we want to optimize for. I think I'm slightly more inclined to optimize for larger shapes rather than many lines:

  • In 2D mode people often use many small lines because:
    • There isn't a way of doing line gradients. In WebGL mode, we can do line gradients, so that should be the recommended approach
    • They want to emulate textured brushes with many small strokes. This is also valid, but probably better served by a shader, although that's a pretty big skill jump to make. This is an area I'd like to make more accessible going forward regardless, though
  • In 2D mode, curves are relatively cheap. In WebGL, they're significantly more expensive than straight lines, and drawing a few curves can quickly land you in something that performs as well as the "large curve" test case

...but also we don't need to necessarily optimize for that using the method in this PR (although I don't have other ideas off the top of my head just yet.)

What are your thoughts @aferriss? Also @stalgiag if you have any thoughts that would also be great!

@davepagurek
Copy link
Contributor Author

I think I'm going to close this one in favour of the more general performance improvements in #6230, and also the more manual approach to instancing that #6276 starts to add.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve WebGL2 line performance using drawArraysInstanced()
2 participants