-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CPU bicubic texture upscaler speed-up(?) #15774
Comments
Fixed that sentence for you. It also pretty much says why most people lost interest in CPU texture scaling nowadays and not because of just texture scaling done on the GPU which sure can still have some future simply due to performance, but even that fades away compared to texture replacement which works great and from few years can be done easily without artistic skills by just using any AI trained to upscale specific graphic style. Automatic texture upscaling is mostly a quick and lazy fix which many people doesn't like, but sometimes use it anyway as it's one of the easier ways to work around texture glitches when rendering PSP games above native res, as far as I know bicubic does nothing to those problems. To avoid requiring future CPU's PPSSPP limits the upscaling both per frame and by texture longevity(textures that refresh too often are just skipped), those textures which are scaled are also cached so the reason why it's fast enough for many is because it doesn't do anything 99% of the time, which is also why people that do know about it will continue to say it's not fast enough even for modern hardware. IMO from all my testing bicubic doesn't do much as a texture scaler for PSP games other than adding cost and really the only reason it existed seems to be being part of one of the xBRZ variants.:c Would be cool to have something that does no visible change other than dealing with those common UI issues when rendering games above x1 res, but again bicubic unfortunately doesn't help with that. |
So, bicubic texture upscaler doesn't get much love, simply because it's just not that important? That... makes sense. It indeed does not seem to do much for textures in my limited testing. |
This CPU scaling stuff is very old, and the comments along with the functionality itself was contributed, so I don't really endorse the statement "even most powerful CPU's are not powerful enough to upscale textures in real time". We already have a GPU-based scaling path although it's currently Vulkan-only, but performs much better. Future scaling work should probably go in that direction for the other backends too... Also I don't think any of our current scaling algorithm are generally very good. Some of them do look pretty good in 2D games but overall, the benefit just isn't that big IMHO... I would merge a PR that just speeds up our bicubic as-is though :) |
Non-separable filter and all? Hm, might be doable. |
No I mean it's fine to replace the implementation too of course, as long as it looks about the same. I meant I won't require new work in this area to be for the GPU. |
So I grabbed a random DOOM2-ish door texture, and upscaled it x4. Results
You might want to open images in a new tab. I think
As I said, I'm not confident touching C++ codebase, but I may try... |
I like the sharpness of Catmull-Rom best for sure! But compared to just using the original, it's not really that much of an improvement... |
Added 'bilinear' to the table above, which is what "just using the original" would probably look like if the game itself decided to set texture filtering to linear. |
I think it's great to improve these, and I think the CPU scaling can be improved and probably can't be removed. That said, just to set expectations, a lot of games have complex effects, use palettes in ways that are annoying for texture scaling or texture caching, or perform realtime CPU readback effects (like a sepia filter.) Many of these aren't throughout the game, but might just be at certain points - like only right after you defeat this boss, or only when you're walking around in that lava area, etc. It would require a greater reduction than 20ns -> 2ns for me claim that any top-end PC can handle realtime texture scaling in games throughout. And I've definitely told people the opposite - that the current implementation (really, xBRZ, though anyway) can't handle realtime scaling even on the 8-core 4Ghz CPU they're so proud of. Because it can't. That doesn't mean it's a waste to improve it. I wouldn't say the current software renderer is gonna run most games at full, realtime speed, either - even on someone's nifty new PC. That didn't stop me from making it a lot faster recently. Maybe it'll eventually run full speed (especially with more work and more core counts), and the same is possibly true of CPU texture scaling.
PPSSPP doesn't go crazy with complex C++, generally. There are only a few templates and basic STL usage. The scaler code I think uses some template params, mostly for optimization. At the end of the day, as long as you can give it a function that takes an array of pixels in, and array of pixels to write to, a width, and a y range (for threading), you'll be fine. If you've got it mostly working but have questions, you can ask here or on Discord. Note: for the purposes of PPSSPP, you can assume SSE2 is supported if Intel. Supporting a non-SSE2 Intel CPU would require changes in a lot of places. Also it looks like your benchmark doesn't use threads, currently. Probably scales just fine, but a PPSSPP implementation should ideally use threads or else a 12-thread (or better) CPU might be worse off after all. That's the main thing I'd suggest trying to support. Though, maybe there's something to be said about considering an alternate threading model for that. Is it better to upscale 12 textures on 12 threads at once, or 1 texture on 12 threads? Might be an interesting thing to benchmark, especially with varied texture sizes. Might also be interesting to see if clang can be coaxed into vectorizing well enough for NEON, etc. -[Unknown] |
It's 'codebase' part that deters me -- build process and all ('C++' meant 'code that actually needs compilation').
I would probably just replace the insides of That said, the multi-threaded performance was indeed not benchmarked. |
Since my code works on 16x16 dst blocks (anything smaller is temporary padded), there is a perf hit for both small textures and thin slices. Multi-threading might produce such slices, since |
I'll take you up on that, I guess. So I compiled PPSSPP from
and a single link-time warning:
Resulting All settings are default, other than
That said, I did implement my proposed changes, and it seems to work, i.e. looks about the same as without them (so artifacts still present). I can create a pull request if desired. |
I have been messing around with texture loading lately, trying to reduce the amount of duplicated code between the backends. I might have made a mistake causing those glitches, will take a look. |
While waiting for the artifacts thing to be resolved, I'll comment on:
I have a rather limited knowledge of ARM, but it seems, that both GCC and clang do autovectorize the upscaler with
You can press |
It's been a while since that was set, it actually started life as a larger minimum dependent on threads: It might actually be better in all cases for it to be 16.
That just seems weird.
You can recompile FFmpeg by running
Great, that sounds positive. I think there were some fixes for texture scaling bugs, I have been a bit busy to keep up the last couple weeks, but it might already be better now. -[Unknown] |
Yes, this is surprising to me too. GCC also somehow figured the value of |
It might just be a false positive in |
The corrupted texture scaling should be fixed now. |
I'm still seeing the glitches: both in |
Issue seems to be fixed now, so implementation is done. 8x8 blocks, edges in 'clamp' mode (because that's what PPSSPP did earlier; would 'wrap' be better?). |
Well, can be closed. |
What should happen
Disclaimer: not sure if these ramblings actually belong in issues. Also it's possible that I messed something up, and the following is moot.
I'm only tackling (bi)cubic upscaling, not xBRZ or hybrid.
So, after reading some claims that "even most powerful CPU's are not powerful enough to upscale textures in real time", I decided to take a look at PPSSPP's CPU texture upscaler (this is it, right?), and... it does not seem very fast. Obviously, all cool kids use GPU to upscale their textures, but that's not the topic here. So I wrote an implementation of a cubic upscaler, and benchmarked it against PPSSPP's (and stb_image_resize.h for good measure).
Here are the results on a test machine (
Intel(R) Xeon(R) E-2286G CPU @ 4.00GHz
):Timings are reported in nanoseconds per dst pixel (otherwise, the metric is scale-dependent). All tests are single-threaded. The benchmark is compiled with
-march=native -O3
.Some observations in no particular order:
wrap
andclamp
modes (as well aszero
) for out-of-bounds samples (I think PPSSPP only doesclamp
). It also supports arbitrary positive integer scale (not limited to x2..x5 like PPSSPP's current implementation). The upscaling is done with a separable Mitchell-Netravali class filter with runtime-configurable B,C parameters.cubic_upscale.c
) actually compiles fine as both C and C++.stbir_resize_region
can be used).The speed-up may not be enough to entirely eliminate the stutter during upscaling, but it may help.
I'm not confident enough to muck with PPSSPP C++ codebase, so I tried making my upscaler reasonably self-contained. Hopefully, if someone is interested, it should be trivial to integrate.
Here is the code for both upscaler (
cubic_upscale.c
,cubic_upscale.h
) and benchmark (stb_image*
omitted, to fit into online compiler's code size limit):https://onlinegdb.com/7g36Bdzhco
Note: to get a meaningful benchmark in the online compiler you need to enable optimizations (cog button->
Extra Compiler Flags
), e.g.-march=native -O3
.Who would this benefit
Platform (if relevant)
No response
Games this would be useful in
Other emulators or software with a similar feature
No response
Checklist
The text was updated successfully, but these errors were encountered: