GPU readback ideas and plans #16900

hrydgard · 2023-02-03T08:45:51Z

A huge performance/accuracy problem happen when games try to read back memory from the GPU to make it accessible to the CPU.

Readbacks are needed for some games to render at all (Tactics Ogre, which also does some redundant readbacks though), or to simulate the CPU reading from the depth buffer (Syphon Filter, Wipeout, for lens flares, not yet actually implemented), or for automatic brightness adjustment (Motorstorm), etc.

On the PC and even more so on mobile devices, it's essential to have a frame or two "in progress", pipelined between the CPU and GPU at the same time. The CPU runs a frame or two ahead of the GPU. Stopping this pipeline to read back data basically tells the system to sleep, so not only do we lose time waiting, but CPUs and GPUs get clocked down and performance gets even worse.

RPCS3 has a funny trick where when they stop to wait for the GPU to catch up, they give the CPU some useless work to do in the meantime, reducing performance drops. This we could also do, but I have some other ideas.

Many of the uses of readbacks (excluding Tactics Ogre's) are pretty fuzzy - lens flares in Wipeout and Syphon Filter would still look OK if the readback was a frame late or so. If we could simply add a frame or two of latency, readbacks would no longer require stopping the GPU and waiting, but things would still look generally fine.

I have two different ideas for enabling this:

Readback queues - maintain a history of frame data for each framebuffer that we read back images from, and when asked for a readback, do one but have the result stored in the queue, and instead pop off an "old" image, that has now been read back to, and write that to PSP RAM/VRAM. This can be implemented even in OpenGL using PBOs, and should work okay in D3D11 as well I think. Vulkan can handle this no problem. This is pretty safe because the copy to memory happens when the game expects it, although the data won't be as fresh as the game thinks. That's probably generally OK though, at least in some games.
"Loose" async readback - simply set a fence right after the copy and have a CPU thread wait for that, and as soon as the GPU finishes with the readback the CPU can start writing it into PSP memory, fully asynchronously. This is considerably less safe (maybe a game just stopped using some VRAM as a depth buffer and starts using it for something else with the CPU, and then the readback hits a bit late) but might stil be OK. Only implementable with Vulkan, of our current backends.

Also there's a minor point that I haven't addressed - the correct time to stop the CPU and wait for the GPU is not exactly when the readback happens, as we do now, but the next time the game calls sceGeDrawSync because that's how the game knows that the GPU is done. Or I guess there's a signal mechanism too...

Implementation plan:

Properly implement blocking readbacks of depth buffers (we do have a path for this but it's not yet used and it can't be relied on on mobile since it relies on depth->depth stretch blits which are not universally supported).
Add a compat setting to enable depth readbacks for the games that need it

Then we have the two options outlined above, which will be implemented for both color and depth readbacks:

We might want both, the first one when using Vulkan and the other one for the other backends that can support it.

ghost · 2023-02-03T09:29:10Z

Related #16714, #11669, and #16537?

unknownbrackets · 2023-02-03T15:00:55Z

I would definitely want this to be optional in some way. I don't think I'd want this, especially in some games, depending on how it's implemented. I know we used to use PBOs and pull the previous frame so I remember some of the problems it caused. Incorrect text, flickering, etc. will be the result in some games - kinda like the absolutely safe and perfect vertex caching.

Of course it can work fine in some games, but I expect it to be most safe in the type of games I don't really like playing anyway, so I'm biased.

In my mind the safest way to do this, which would probably not be "ideal", would be to use frame-late readbacks in these conditions:

The game has been doing readbacks every flip for at least 3 flips in a row (including the current one.)
The readback is from depth (I've yet to see a case where depth specifically couldn't be async, outside debug readbacks obviously.)
The previous readback was never actually read by the CPU (i.e. after protecting the region) even after a single readback, and was within the last 10 flips.

But all of these would make it stutter initially so I anticipate that your planned implementation will do this based on a compat flag, and because it will help mobile a lot, probably be enabled everywhere except a few games someone has specifically realized the graphics bugs are caused by this behavior.

-[Unknown]

hrydgard · 2023-02-03T15:06:16Z

Yeah the primary use cases for this are Syphon Filter style lensflares, and Motorstorm's brightness adaptation. I'm not sure there are others that are safe enough.

Compat flag or enabling by trapping CPU reads from the swizzled VRAM regions will be a must, indeed - can't enable globally.

unknownbrackets · 2023-02-03T15:30:42Z

Well, there are games that literally do a readback every frame for the sole purpose of having a screenshot ready in case you decide to save. They never read it, or they read it only to memcpy it somewhere else. That's a use case that would be fine. Unfortunately, iirc, some of the same games also do other readbacks that wouldn't be as safe at other times.

One thing async readbacks could be useful for (but this would only hurt performance) is to handle expiring framebuffers better. There are many cases where games render to an area once or for a while, and then stop. For example, the discolored arms in the NBA games, Me and My Katamari, etc. But in other cases, games will reuse an old framebuffer after a while (Danganronpa, #8359.) If we async a safe region every frame, but only populated VRAM on the next draw, it would solve a bunch of these issues. That said, it'd definitely be slower than what we do now (not downloading.)

-[Unknown]

sum2012 · 2023-02-05T14:18:52Z

Does Dangan Ronpa also read back memory from the GPU ?
If yes,better remove the hack and use new logic.

hrydgard · 2023-02-05T15:15:41Z

Yes it does, the only hacky part is that we force a readback there where our normal checking doesn't detect one. But yes, Dangan Ronpa could probably fairly safely use these delayed readbacks that I'm implementing, for better performance.

hrydgard · 2023-02-05T16:03:47Z

Alright, I've gotten delayed depth readbacks working in Syphon Filter. With a full 3 frames of buffering, this works a lot worse than I expected. Turns out the game does really tight depth comparisons of the computed depth of each light, compared to a few samples of the depth buffer. This works great when standing still, but when moving around, the depths are quite a bit off from what the game expects (since it's delayed by 3 frames) and the result is unstable, flickery lights. Not good.

I think the fully-async method will have better results, but the only way to make the strictly frame-delayed variant look good is to reduce the amount of buffered frames, or to add a rather big fudge factor to the readback values to compensate, which will have its own issues.

hrydgard · 2023-02-06T16:09:05Z

I'm struggling to implement this in any sane way for OpenGL unfortunately, the issue is that to map the buffer for read, we have to be on the GL thread, and not on the main thread. So this requires another level of staging data, which gets quite ugly. I have something that works but it's unexpectedly slow.

Currently leaning towards initially making this stuff Vulkan-only...

hrydgard · 2023-02-08T09:16:44Z

In #16916 I'm making the Dangan Ronpa readback delayed like this, improving performance on Vulkan.

With the decoupling of the operation and the actual readback, we can imagine a bunch of other timing modes, that might be suitable for various games.

Wait for the GPU and perform readbacks at the very end of the frame or at sceGeDrawSync time, instead of immediately
Always wait for readbacks from the previous frame.

Some of these will be mostly equivalent to reducing the amount of frames of buffered graphics commands though. It might be that we should simply enforce it to be 1 or 2 for problematic readback situations.

unknownbrackets · 2023-02-08T14:52:40Z

Something that would be slightly unsafe but might work, which I've considered doing for software rendering as a speedhack (I've already tested, and there it does make it faster.)

Mark any readback during rendering "pending" either as a flag or in an array:

Depth readback, where required.
Block transfer command readback.
First-frame readback.

Afterward, if the game:

Renders something else to a block transfer src
Calls sceGeListSync (either wait mode or if returning a complete status)
Calls sceGeDrawSync (either wait mode or if returning a complete status)
Calls sceGeEdramGetAddr
Maybe if texsync or texflush is used - have read unconfirmed assertion that one of these stalls for block transfer, would need to test (possibly wouldn't matter, maybe better to just deal with overlapping block transfers smartly...)

Synchronously download the pending operations. On mobile this could be bad as it might group our stalls more, though.

Note: texture cache might need to look at the pending readbacks when building textures.

This sounds a bit similar to what you're saying, but I think in many games it wouldn't have an observable difference (as long as the game is written to "correctly" wait for sync before peeking with the CPU.) I did a bit of this as a test in softgpu (not the block transfer part, which still immediately stalled) and didn't see any issues in several games. But technically it could mean bugs for people with different CPU speeds/core counts etc...

-[Unknown]

hrydgard · 2023-02-16T11:01:42Z

Yeah, something like that could work, but not sure how large the benefit would be. Likely worth trying.

Unrelated note, in #11669 , it's mentioned that Ys Seven does a readback for the minimap a lot. Likely a good candidate for async/delayed readbacks.

unknownbrackets · 2023-02-17T03:53:50Z

I believe that game also does some readbacks in only certain areas for fire effects, etc.

-[Unknown]

sum2012 · 2024-01-22T12:30:12Z

Can we enable GPU readback from random proportion (0.5,0.3 ,etc) ?

hrydgard · 2024-01-22T12:45:17Z

That's an interesting idea, I think it has come up before but I forgot about it. However, it feels like it could lead to quite uneven framerates - still, that might be better than every frame being slow, for games where it would work.

hrydgard added the GE emulation Backend-independent GPU issues label Feb 3, 2023

hrydgard added this to the v1.15.0 milestone Feb 3, 2023

hrydgard mentioned this issue Feb 3, 2023

Minor refactors, preparing for improved readbacks #16902

Closed

This was referenced Feb 3, 2023

Depth readback with built-in stretchblit #16905

Merged

Fix Syphon Filter lens flares by adding compat flag to read back the depth buffer #16907

Merged

Prepare for adding async readback (use VMA for readback allocs, add a param) #16910

Merged

hrydgard mentioned this issue Feb 5, 2023

Some more plumbing of parameters, preparing for readback stuff #16914

Merged

This was referenced Feb 5, 2023

Implement delayed depth readbacks, Vulkan only #16916

Merged

Move GLFrameData out of GLRenderManager. #16919

Merged

hrydgard added the GPU readback Issue related to readbacks from the GPU to CPU label Feb 16, 2023

hrydgard modified the milestones: v1.15.0, v1.16.0 Mar 23, 2023

bassforte123 mentioned this issue Apr 15, 2023

Graphical glitches related to save states and minimizing #13529

Open

hrydgard modified the milestones: v1.16.0, v1.17.0 Aug 11, 2023

hrydgard modified the milestones: v1.17.0, v1.18.0 Dec 5, 2023

anr2me mentioned this issue Dec 25, 2023

MotorStorm Artic Edge too slow #18607

Open

5 tasks

sum2012 mentioned this issue Feb 1, 2024

Add "Half skip" Mode #18807

Closed

hrydgard modified the milestones: v1.18.0, v1.19.0 Apr 9, 2024

Repository owner deleted a comment from terremoth May 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU readback ideas and plans #16900

GPU readback ideas and plans #16900

hrydgard commented Feb 3, 2023 •

edited

Loading

ghost commented Feb 3, 2023

unknownbrackets commented Feb 3, 2023

hrydgard commented Feb 3, 2023 •

edited

Loading

unknownbrackets commented Feb 3, 2023

sum2012 commented Feb 5, 2023

hrydgard commented Feb 5, 2023

hrydgard commented Feb 5, 2023

hrydgard commented Feb 6, 2023

hrydgard commented Feb 8, 2023 •

edited

Loading

unknownbrackets commented Feb 8, 2023 •

edited

Loading

hrydgard commented Feb 16, 2023 •

edited

Loading

unknownbrackets commented Feb 17, 2023

sum2012 commented Jan 22, 2024

hrydgard commented Jan 22, 2024

GPU readback ideas and plans #16900

GPU readback ideas and plans #16900

Comments

hrydgard commented Feb 3, 2023 • edited Loading

ghost commented Feb 3, 2023

unknownbrackets commented Feb 3, 2023

hrydgard commented Feb 3, 2023 • edited Loading

unknownbrackets commented Feb 3, 2023

sum2012 commented Feb 5, 2023

hrydgard commented Feb 5, 2023

hrydgard commented Feb 5, 2023

hrydgard commented Feb 6, 2023

hrydgard commented Feb 8, 2023 • edited Loading

unknownbrackets commented Feb 8, 2023 • edited Loading

hrydgard commented Feb 16, 2023 • edited Loading

unknownbrackets commented Feb 17, 2023

sum2012 commented Jan 22, 2024

hrydgard commented Jan 22, 2024

hrydgard commented Feb 3, 2023 •

edited

Loading

hrydgard commented Feb 3, 2023 •

edited

Loading

hrydgard commented Feb 8, 2023 •

edited

Loading

unknownbrackets commented Feb 8, 2023 •

edited

Loading

hrydgard commented Feb 16, 2023 •

edited

Loading