Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU readback ideas and plans #16900

Open
4 of 7 tasks
hrydgard opened this issue Feb 3, 2023 · 14 comments
Open
4 of 7 tasks

GPU readback ideas and plans #16900

hrydgard opened this issue Feb 3, 2023 · 14 comments
Labels
GE emulation Backend-independent GPU issues GPU readback Issue related to readbacks from the GPU to CPU
Milestone

Comments

@hrydgard
Copy link
Owner

hrydgard commented Feb 3, 2023

A huge performance/accuracy problem happen when games try to read back memory from the GPU to make it accessible to the CPU.

Readbacks are needed for some games to render at all (Tactics Ogre, which also does some redundant readbacks though), or to simulate the CPU reading from the depth buffer (Syphon Filter, Wipeout, for lens flares, not yet actually implemented), or for automatic brightness adjustment (Motorstorm), etc.

On the PC and even more so on mobile devices, it's essential to have a frame or two "in progress", pipelined between the CPU and GPU at the same time. The CPU runs a frame or two ahead of the GPU. Stopping this pipeline to read back data basically tells the system to sleep, so not only do we lose time waiting, but CPUs and GPUs get clocked down and performance gets even worse.

RPCS3 has a funny trick where when they stop to wait for the GPU to catch up, they give the CPU some useless work to do in the meantime, reducing performance drops. This we could also do, but I have some other ideas.

Many of the uses of readbacks (excluding Tactics Ogre's) are pretty fuzzy - lens flares in Wipeout and Syphon Filter would still look OK if the readback was a frame late or so. If we could simply add a frame or two of latency, readbacks would no longer require stopping the GPU and waiting, but things would still look generally fine.

I have two different ideas for enabling this:

  • Readback queues - maintain a history of frame data for each framebuffer that we read back images from, and when asked for a readback, do one but have the result stored in the queue, and instead pop off an "old" image, that has now been read back to, and write that to PSP RAM/VRAM. This can be implemented even in OpenGL using PBOs, and should work okay in D3D11 as well I think. Vulkan can handle this no problem. This is pretty safe because the copy to memory happens when the game expects it, although the data won't be as fresh as the game thinks. That's probably generally OK though, at least in some games.
  • "Loose" async readback - simply set a fence right after the copy and have a CPU thread wait for that, and as soon as the GPU finishes with the readback the CPU can start writing it into PSP memory, fully asynchronously. This is considerably less safe (maybe a game just stopped using some VRAM as a depth buffer and starts using it for something else with the CPU, and then the readback hits a bit late) but might stil be OK. Only implementable with Vulkan, of our current backends.

Also there's a minor point that I haven't addressed - the correct time to stop the CPU and wait for the GPU is not exactly when the readback happens, as we do now, but the next time the game calls sceGeDrawSync because that's how the game knows that the GPU is done. Or I guess there's a signal mechanism too...

Implementation plan:

  • Properly implement blocking readbacks of depth buffers (we do have a path for this but it's not yet used and it can't be relied on on mobile since it relies on depth->depth stretch blits which are not universally supported).
  • Add a compat setting to enable depth readbacks for the games that need it

Then we have the two options outlined above, which will be implemented for both color and depth readbacks:

  • Implement loose async readbacks
  • Implement readback queues
    • Vulkan
    • OpenGL (might not, initially)
    • D3D11 (might not, initially)

We might want both, the first one when using Vulkan and the other one for the other backends that can support it.

@hrydgard hrydgard added the GE emulation Backend-independent GPU issues label Feb 3, 2023
@hrydgard hrydgard added this to the v1.15.0 milestone Feb 3, 2023
@ghost
Copy link

ghost commented Feb 3, 2023

Related #16714, #11669, and #16537?

@unknownbrackets
Copy link
Collaborator

I would definitely want this to be optional in some way. I don't think I'd want this, especially in some games, depending on how it's implemented. I know we used to use PBOs and pull the previous frame so I remember some of the problems it caused. Incorrect text, flickering, etc. will be the result in some games - kinda like the absolutely safe and perfect vertex caching.

Of course it can work fine in some games, but I expect it to be most safe in the type of games I don't really like playing anyway, so I'm biased.

In my mind the safest way to do this, which would probably not be "ideal", would be to use frame-late readbacks in these conditions:

  • The game has been doing readbacks every flip for at least 3 flips in a row (including the current one.)
  • The readback is from depth (I've yet to see a case where depth specifically couldn't be async, outside debug readbacks obviously.)
  • The previous readback was never actually read by the CPU (i.e. after protecting the region) even after a single readback, and was within the last 10 flips.

But all of these would make it stutter initially so I anticipate that your planned implementation will do this based on a compat flag, and because it will help mobile a lot, probably be enabled everywhere except a few games someone has specifically realized the graphics bugs are caused by this behavior.

-[Unknown]

@hrydgard
Copy link
Owner Author

hrydgard commented Feb 3, 2023

Yeah the primary use cases for this are Syphon Filter style lensflares, and Motorstorm's brightness adaptation. I'm not sure there are others that are safe enough.

Compat flag or enabling by trapping CPU reads from the swizzled VRAM regions will be a must, indeed - can't enable globally.

@unknownbrackets
Copy link
Collaborator

Well, there are games that literally do a readback every frame for the sole purpose of having a screenshot ready in case you decide to save. They never read it, or they read it only to memcpy it somewhere else. That's a use case that would be fine. Unfortunately, iirc, some of the same games also do other readbacks that wouldn't be as safe at other times.

One thing async readbacks could be useful for (but this would only hurt performance) is to handle expiring framebuffers better. There are many cases where games render to an area once or for a while, and then stop. For example, the discolored arms in the NBA games, Me and My Katamari, etc. But in other cases, games will reuse an old framebuffer after a while (Danganronpa, #8359.) If we async a safe region every frame, but only populated VRAM on the next draw, it would solve a bunch of these issues. That said, it'd definitely be slower than what we do now (not downloading.)

-[Unknown]

@sum2012
Copy link
Collaborator

sum2012 commented Feb 5, 2023

Does Dangan Ronpa also read back memory from the GPU ?
If yes,better remove the hack and use new logic.

@hrydgard
Copy link
Owner Author

hrydgard commented Feb 5, 2023

Yes it does, the only hacky part is that we force a readback there where our normal checking doesn't detect one. But yes, Dangan Ronpa could probably fairly safely use these delayed readbacks that I'm implementing, for better performance.

@hrydgard
Copy link
Owner Author

hrydgard commented Feb 5, 2023

Alright, I've gotten delayed depth readbacks working in Syphon Filter. With a full 3 frames of buffering, this works a lot worse than I expected. Turns out the game does really tight depth comparisons of the computed depth of each light, compared to a few samples of the depth buffer. This works great when standing still, but when moving around, the depths are quite a bit off from what the game expects (since it's delayed by 3 frames) and the result is unstable, flickery lights. Not good.

I think the fully-async method will have better results, but the only way to make the strictly frame-delayed variant look good is to reduce the amount of buffered frames, or to add a rather big fudge factor to the readback values to compensate, which will have its own issues.

@hrydgard
Copy link
Owner Author

hrydgard commented Feb 6, 2023

I'm struggling to implement this in any sane way for OpenGL unfortunately, the issue is that to map the buffer for read, we have to be on the GL thread, and not on the main thread. So this requires another level of staging data, which gets quite ugly. I have something that works but it's unexpectedly slow.

Currently leaning towards initially making this stuff Vulkan-only...

@hrydgard
Copy link
Owner Author

hrydgard commented Feb 8, 2023

In #16916 I'm making the Dangan Ronpa readback delayed like this, improving performance on Vulkan.

With the decoupling of the operation and the actual readback, we can imagine a bunch of other timing modes, that might be suitable for various games.

  • Wait for the GPU and perform readbacks at the very end of the frame or at sceGeDrawSync time, instead of immediately
  • Always wait for readbacks from the previous frame.

Some of these will be mostly equivalent to reducing the amount of frames of buffered graphics commands though. It might be that we should simply enforce it to be 1 or 2 for problematic readback situations.

@unknownbrackets
Copy link
Collaborator

unknownbrackets commented Feb 8, 2023

Something that would be slightly unsafe but might work, which I've considered doing for software rendering as a speedhack (I've already tested, and there it does make it faster.)

Mark any readback during rendering "pending" either as a flag or in an array:

  • Depth readback, where required.
  • Block transfer command readback.
  • First-frame readback.

Afterward, if the game:

  • Renders something else to a block transfer src
  • Calls sceGeListSync (either wait mode or if returning a complete status)
  • Calls sceGeDrawSync (either wait mode or if returning a complete status)
  • Calls sceGeEdramGetAddr
  • Maybe if texsync or texflush is used - have read unconfirmed assertion that one of these stalls for block transfer, would need to test (possibly wouldn't matter, maybe better to just deal with overlapping block transfers smartly...)

Synchronously download the pending operations. On mobile this could be bad as it might group our stalls more, though.

Note: texture cache might need to look at the pending readbacks when building textures.

This sounds a bit similar to what you're saying, but I think in many games it wouldn't have an observable difference (as long as the game is written to "correctly" wait for sync before peeking with the CPU.) I did a bit of this as a test in softgpu (not the block transfer part, which still immediately stalled) and didn't see any issues in several games. But technically it could mean bugs for people with different CPU speeds/core counts etc...

-[Unknown]

@hrydgard hrydgard added the GPU readback Issue related to readbacks from the GPU to CPU label Feb 16, 2023
@hrydgard
Copy link
Owner Author

hrydgard commented Feb 16, 2023

Yeah, something like that could work, but not sure how large the benefit would be. Likely worth trying.

Unrelated note, in #11669 , it's mentioned that Ys Seven does a readback for the minimap a lot. Likely a good candidate for async/delayed readbacks.

@unknownbrackets
Copy link
Collaborator

I believe that game also does some readbacks in only certain areas for fire effects, etc.

-[Unknown]

@sum2012
Copy link
Collaborator

sum2012 commented Jan 22, 2024

Can we enable GPU readback from random proportion (0.5,0.3 ,etc) ?

@hrydgard
Copy link
Owner Author

That's an interesting idea, I think it has come up before but I forgot about it. However, it feels like it could lead to quite uneven framerates - still, that might be better than every frame being slow, for games where it would work.

@hrydgard hrydgard modified the milestones: v1.18.0, v1.19.0 Apr 9, 2024
Repository owner deleted a comment from terremoth May 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
GE emulation Backend-independent GPU issues GPU readback Issue related to readbacks from the GPU to CPU
Projects
None yet
Development

No branches or pull requests

3 participants