-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SoftGPU: Rasterize triangles in chunks of 4 pixels #9635
Conversation
I like it. Good prep for mipmapping. As this clearly shows, doing SIMD where you simply write it like the straightline code, but with one component from each pixel in each lane, becomes quite easy, and once fully applied there's no way this won't be faster than doing it a single pixel at a time. Buildbots are a little unhappy though. |
cd9d58f
to
66eee29
Compare
Not very optimal yet.
This is significantly faster.
66eee29
to
2dc1118
Compare
Sure, although it can be a bit of a pain. We still have some sort of skew issue - a (0, 11)-(0, 11) 1:1 draw doesn't actually draw 1:1 in Crisis Core (The cross button symbol in the bottom right - in nearest.) I guess that means #8282 didn't handle all the cases. But, probably better to start fixing these things in a four-pixel pipeline anyway. -[Unknown] |
GPU/Math3D.h
Outdated
@@ -634,6 +634,13 @@ class Vec4 | |||
return Vec4(VecClamp(x, l, h), VecClamp(y, l, h), VecClamp(z, l, h), VecClamp(w, l, h)); | |||
} | |||
|
|||
Vec4 Reciprocal() const | |||
{ | |||
// In case we use doubles, maintain accuracy. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you clarify this? If T is a double, 1.0f will just be automatically cast to 1.0 and the division will be performed at double precision. Is this intended or not? I'm confused :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, was worrying about the accuracy problems and trying to mess with values to fix things - removed the comment.
-[Unknown]
Hm, I was thinking (this is not a call for action, just thoughts for the future): instead of using Vec4 across the lanes everywhere, an alternate way of formulating the math might be to use Vec2, Vec3, and Vec4 composed out of __m128. Like, Vec4<__m128>, then you can still perform "vector operations" and use various operator overloads etc while ignoring the fact that you're doing it for four pixels at a time. And you'd have a type like "scalar" which would be used where a single float is currently used, just an __m128 with overloads like a Vec4 just not named so. Not sure how confusing that would be though. |
2dc1118
to
4fb7e43
Compare
Well, that sounds like it'd make the non-SSE paths more complicated. I'd actually like to move the pipeline to jit, at first built as a chain of func calls (like vertexjit or more like MIPS Comp_Generic really), in steps. Then we can construct a key, select a jit program from a cache, and run it. In that scenario, it might be ideal to use 16 u8s for colors, or maybe two pairs of 8 u16s to simplify blending. But not sure. Want to be mindful of available regs. -[Unknown] |
Yeah, good points - lots of this can definitely be done at 16-bit or 8-bit. |
A little experiment: Just wanted to try it quickly and texel lookup was a nice self-contained piece. A bit underwhelming (considering ApplyTexturing is typically 20-40% of wall time), about 10% FPS improvement at best. Not terribly optimal though, and obviously would want to at least decode 16-bit directly to xmms (maybe via a jit ABI, and 4 texels at a time.) The best profiling results were SampleNearest 21% -> SamplerJit 9% in Hexyz Force (at barely 64 FPS.) Probably need a "texture cache" for better performance... -[Unknown] |
If ApplyTexturing was 20% and you got a 10% total improvement, that means that you approximately doubled the speed of texel fetching, which isn't too bad still. But yeah, would also have expected a little better than that... |
I wonder if rather than linear sampling from If we did that, it should be possible to simply calculate all 4 addresses after the first one without much effort... -[Unknown] |
Not sure I understand what you mean. The texture coordinates are often very dissimilar from the pixel locations on the screen, imagine any perspective mapping or a rotated mapping. Of course when drawing 1:1 rectangles, there are many possible optimizations including skipping the UV calculations altogether. |
I mean when doing linear sampling (the 4 samples used to interpolate.) Currently it does winding: https://github.com/hrydgard/ppsspp/blob/master/GPU/Software/Rasterizer.cpp#L209 I don't mean when drawing multiple pixels, this is for just one pixel. -[Unknown] |
Right, some simplification may be possible. You only need to calculate one address to fetch from, and then just offset by 1 horizontally and (texw) vertically to get the other three - if it weren't for wrapping and clamping which might have you fetch from either the same address, or alternatively from the other side of the texture. Not sure how to do this in the most elegant way. |
Well, my point is that if the U and V are even, then you're guaranteed:
(you are not guaranteed these things if U or V are odd - in that case +1 might go to a new tile, when swizzled, so you end up needing to re-examine U and V.) Wrapping and clamping won't cause problems in that case unless it's a 1x1 mip level, which can be special cased (since they are power of two sized.) In that case one might early-out of linear sampling anyway. So if we (in linear filtering only) always sample based on even and off UVs, things get much simpler for sampling all four at once. -[Unknown] |
But that doesn't really work, does it? Let's imagine one dimension texture t[], and your sole texture coordinate U is:
Or are you saying that we'll get around that by rewriting the last equation to lerp[t[2], t[1], 0.25) by xoring the indices by the low bit and 1.0-x the lerp factor? |
D'oh, right. I wasn't thinking about the lerp later of course. I'm stupid. -[Unknown] |
Interestingly, I found that with samplerjit, the thread loop (which is really naive) is mostly just waiting longer. I wonder if I have a threading bug somehow, or if it's just showing the naivety of slicing by y... We could probably "bin" and trivially discard based on say 60x68 tiles or something, right? -[Unknown] |
That would explain some lack of speedup, yeah... Tiled binning is a good way to go for multithreading rendering, definitely better than slicing by Y if you have many small triangles, which we generally do. Finding the optimal tile size is gonna be quite some trial and error though, I'm sure. |
A bit better now with linear in the jit, but just not much faster... master...unknownbrackets:samplerjit Less rounding errors this way, though. -[Unknown] |
Currently, this is generally a bit slower, but it's a step in the right direction.
The last commit shows the benefit of this change in one area. Sample performance change in Tales of Destiny 2 (before this PR -> after this PR w/throughmode perf):
130% -> 200% - Logos during intro
35% -> 70% - Load save screen
64% -> 50% - 3D overworld
-[Unknown]