-
-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up line rendering with WebGL2 instancing #6162
Conversation
@@ -8,6 +8,28 @@ p5.RenderBuffer = class { | |||
this.attr = attr; // the name of the vertex attribute | |||
this._renderer = renderer; | |||
this.map = map; // optional, a transformation function to apply to src | |||
this._namespace = undefined; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you leave some short comments here about the purpose of these properties. I know they're also defined in the setters down below but I find it helpful to view it all in one place.
@@ -27,45 +49,77 @@ p5.RenderBuffer = class { | |||
model = geometry; | |||
} | |||
|
|||
let geometryData = geometry; | |||
if (this._namespace) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there some redundancy of if statements and loops here? Any reason to not just combine into one if
and one for
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
combining them should be fine!
this._offset = 0; | ||
this._divisor = undefined; | ||
} | ||
namespace(namespace) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why wouldn't we just add these properties to the constructor instead of setting them through chaining functions during construction?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly because it gets hard to read with a long chain of parameters, and because it's valid to have some of these without others and having undefined
in the middle of the list feels weird. I could also refactor this to use something like an options object for the constructor, which could serve a similar purpose?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think an options object sounds like it would be much cleaner here if you're open to doing that refactor.
src/webgl/p5.Shader.js
Outdated
if (divisor !== undefined) { | ||
this._renderer.GL.vertexAttribDivisor(loc, divisor); | ||
} else if (this._renderer.webglVersion === constants.WEBGL2) { | ||
this._renderer.GL.vertexAttribDivisor(loc, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I understand what this divisor is doing, though I assume it's ok to have it be zero?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The divisor is how many instances get the same attribute value in a row before moving on to the next value in the buffer. The default behavior is the same as 0, which means don't repeat anything, just keep pulling new values from the buffer like normal. For attributes that we want to repeat, we call set a different divisor to reuse the value for the whole instance or for multiple instances.
I can also add something like this in a comment explaining what it is so that it's less necessary to have a bunch of MDN tabs open to understand what the code is trying to do 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense to me, thanks for the explanation!
vec4 posp = uModelViewMatrix * aPosition; | ||
vec4 posqIn = uModelViewMatrix * (aPosition + vec4(aTangentIn, 0)); | ||
vec4 posqOut = uModelViewMatrix * (aPosition + vec4(aTangentOut, 0)); | ||
vec4 position = abs(aSide) == 1. ? aFromPosition : aToPosition; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any way we can avoid all this branching? Maybe some clever math or predetermine the from / to position for the variable before reaching the shader?
Or is this just needless optimization?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if I can think of a way to get from/to working before reaching the shader. I was hoping there would be a way to repeat attributes for a few vertices in an instance to use just one position attribute, but it looks like all we can do is repeat the attribute for every vertex in the instance.
I think I can make that happen with mix()
to avoid the conditional! Not sure if that'll be faster or not but I can check and see.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @davepagurek This is really complicated stuff and seems like it could be a big win for perf! I do worry a little about regressing performance on webGL1 though. Any way to prevent that?
I think another good test would be to make sure that lines drawn in webGL1 match those in webGL2
On the demo you linked, I'm actually seeing better performance from the CDN p5, using both webGL1 and webGL2. Your version was giving around 15fps, the CDN was close to 60 when drawing the different shapes. I also got an error when trying to use your version with webGL1, not sure if maybe that script got built before all of your changes were made, but just want to flag that. |
I've noticed sometimes changes to the html file don't update immediately for me until I save, I wonder if maybe that's happening? Do you mind checking again and comparing this version (instancing, webgl2) with this version (using the CDN, no instancing, only webgl1)? If it's still slower with instancing, that's definitely a problem -- what browser/OS do you use?
Weird, I don't get that on Chrome/Firefox on my Mac, mind sending me the stacktrace of the error? |
This data format requires a tad more copying of values than before now that there are two position attributes instead of one. One option is to use a different line shader with different attributes in WebGL1, but I was hoping to avoid that because the maintenance cost goes up. I can see if there's anything I can do to speed up the copying in the WebGL1 case though. I can do a bit more debugging and see if something like using a fixed sized array upfront can help (since we can calculate the buffer sizes needed ahead of time before sorting the data into the non-instanced format) |
The error is I'm using chrome 113 on Mac OSX 13.13 I'm reliably seeing very low framerates when testing with just lines using your build, and around 20fps using the CDN build. I made a copy with some average framerate calculations here as well. |
Thanks! ok, found a fix for the crash for WebGL1 mode. Eventually I'll add some more testing for WebGL1 mode for retained and immediate mode, but it sounds like before getting to that there's some more fundamental performance things to work out. Also thanks for the p5 editor link, I think the difference in my testing was on single shapes with lots of lines vs many shapes with single lines. It seems like when rendering shapes with a lot of data (the single large begin/endShape in my editor link), the bottleneck is in duplicating the data, which this change addresses, but when rendering a single line, the fact that there are more buffers involved means that the overhead of flattening multiple arrays and allocating more |
OK I'm back with some updated! I have a new version here for testing: https://editor.p5js.org/davepagurek/sketches/tcug_H1Z8 Updates I made to speed things up:
Here's how it stands up against 1.6.0 in WebGL1 and 2 modes:
Comparing against 1.6.0:
Personally, I think I'm OK with the WebGL1 tradeoffs, as I think its main use cases are:
If you're OK with the performance tradeoffs for WebGL1, I can continue on and make another update with the extra comments/refactoring/tests we talked about earlier. |
Thanks for the updates @davepagurek ! I'll take some time to look through your new commits and test perf again. For my own clarity, in your table does the previous column indicate p5 perf prior to any of your changes, or do they represent the perf from your initial round of commits? If it's the latter, including perf from prior to making any of these changes would be helpful to see. It looks like there are wins everywhere except for drawing in chrome using webGL1, and a more than 2x slow down still seems like a large regression to me. Do you have a sense of why it's such a substantial hit for that platform? It'd be really great if we could test on some mobile devices, safari, and some windows machines as well if possible. |
The "previous" column is for v1.6.0. The slowdown is mostly because the data layout is a little different (there's a to and from coordinate instead of just a single vertex position) because it's optimizing for the webgl2 case. There's a tradeoff to make between webgl1 speed and code maintenance, because we could use a separate shader and data storage format if we're using webgl1, but then we have to keep track of the extra code. This implementation uses the same data layout for both, and has an extra for loop to mimic on the cpu what webgl2 instanced rendering does on the gpu, opting to compromise speed to simplify solve already kind of complex code. I think it is definitely possible to use ifdefs to use a different shader layout and use different buffer structure in webgl1 mode if we want to treat it less like a graceful degradation and more like a primary usage target if we're ok with the added complexity and maintenance. |
Thanks for the extra explanation. It sounds to me like the extra complexity may not be worth the amount of devices we'd be continuing to support. It's a little unclear how many samples they have but data from here seems to suggest that almost 98% of devices are able to support webGL2. Might be good to get some additional input from some of the other maintainers as well. |
Also, I got someone to test on windows for me, this is what they reported back:
It looks like the many lines one is still slower in Chrome, and actually quite a bit slower than on my Mac even on the previous version of p5. Since a single large curve runs reasonably fast, it seems like the bottleneck there is likely the data transfer between CPU and GPU. I'll also try to run some tests on Safari and my phone later! |
I did some more testing on Safari on my mac and Chrome on my Android phone:
It seems like for both of these, rendering many lines gets a bit slower, and rendering large shapes gets faster. Interestingly, safari is actually faster in WebGL 1 mode than WebGL 2. It seems like there isn't much of a pattern. I suppose it depends on what pipelines the hardware is optimized for under the hood? Mobile seems to be extremely slow at large volumes of back-and-forth between the CPU and GPU, so rendering that many lines is never fast. So it looks like this instancing update is not a clear win across the board, but rather, this will be a question of what we want to optimize for. I think I'm slightly more inclined to optimize for larger shapes rather than many lines:
...but also we don't need to necessarily optimize for that using the method in this PR (although I don't have other ideas off the top of my head just yet.) What are your thoughts @aferriss? Also @stalgiag if you have any thoughts that would also be great! |
Resolves #6091
Changes
Currently, drawing strokes in WebGL mode is fairly slow. Since each segment/cap/join of a stroke is made of at least 2 triangles, we had to duplicate data per vertex of each triangle on the CPU.
Now that WebGL2 is enabled by default, we can use instanced rendering, which lets us repeat attributes per instance being rendered. This lets the GPU do the copying for us.
Here's a stress test example: https://editor.p5js.org/davepagurek/sketches/cWr5RmNAK
If you toggle between the 1.6.0 CDN
<script>
tag to the uploaded p5.min.js, the frame rate goes up from ~40fps to ~60fps on Chrome my M1 Macbook Pro (closer to ~18fps and ~40fps in Firefox.)If you toggle
version: 2
toversion: 1
, it goes back to copying the vertex data on the CPU and the frame rate drops to ~35fps. This is slower than before since the format required for instancing is a bit less efficient when done on the CPU, but WebGL2 support is pretty broad now so this should only happen on legacy code.Future work
The new bottleneck in line rendering seems to be libtess. There are a few things we might consider doing:
curveDetail
works to maybe be relative to curve length (or curvature) to save verticesPR Checklist
npm run lint
passes