-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor VRAM download, optimize rendered CLUTs #8389
Conversation
This way the GPU doesn't think it needs to load anything, it's all being overwritten. If we're only using part of the framebuffer, the other parts don't matter.
This avoids triggering logic that tries to get the sizing right, or optimize frequent copies. CLUTs often get estimated wrong, so it's better to copy just the correct range, always.
Sometimes we don't need the full width, such as when we're downloading a CLUT. In Brave Story, the CLUTs overlap in detected width, so this is a real improvement.
In Brave Story, the game reloads the CLUT frequently, but doesn't actually render to the CLUT that often. It also switches between a few different rendered CLUTs - so caching that we've downloaded is a HUGE win. In case someone reading this message is interested, it actually renders these CLUT tables from what appears to be a color wheel. Crazy huh?
Hmm, something wrong in 28a07c7. -[Unknown] |
Okay, so this probably explains my troubles with this game in the other pull. I assumed it would start the CLUT at x=0. This one is using several CLUTs on the same row. I already tried adjusting it to load using an offset, but I have a bug somewhere. It only works if I download the full 384 bytes the first time... hmm. -[Unknown] |
D'oh, and right after I posted that I saw the typo. -[Unknown] |
Used by Kurohyo 2. Highly unlikely to be a mis-estimate within stride.
9000bf1
to
c370c05
Compare
Sorry, I'm doing a terrible job testing. I just forgot to make sure it took the closest one, should be good now in both. -[Unknown] |
c370c05
to
6b98b99
Compare
Hm, I'm sure I'm missing something, but why not look for an exact match? |
Kurohyo loads from an offset X, so an exact match doesn't work. It loads 64 pixels of bytes or something at the same interval from a framebuffer that is 1px tall. In contrast, Brave Story uses multiple framebuffers right next to each other, at 0x400 bytes apart. -[Unknown] |
By the way, that's the same reason my on GPU stuff doesn't work for the game; since it never downloads the CLUT for the offsets, and doesn't match the framebuffer for them, those all render with a garbage CLUT. I'm pretty sure applying this offset will help there, but have not done it yet. -[Unknown] |
Look good now. |
Ah I see, right. |
Refactor VRAM download, optimize rendered CLUTs
@unknownbrackets Strangely, this badly breaks Ridge Racer on my S6 - lots of blackness and missing road. (To bisect back here I had to retroactively fix a bug where we would generate different shader versions for VS and FS). |
What caused that was that -[Unknown] |
Well, reverting this merge fixes Ridge Racer, so somehow it does. But I agree, I don't think it downloads at all... Oh, and it works fine on desktop, only on my S6 is it broken (haven't tried other android devices yet) |
This fixes #8252. I set out to fix that and I ended up greatly improving the performance of Brave Story in battles with the CLUT download.
Turns out, we were downloading too much, and doing it over and over. This optimization is huge for Brave Story, pretty much as good as #8246... but note that one without a bunch more work, that won't work with texture scaling.
Anyway, the only real downside of this is that it adds another flag that needs to be reset on every render. I think the driver overhead is larger and this is fairly cheap, hopefully.
Improvement:
Boss battle (fairy complicated enemies): 192% -> 696%
Rabbit battle: 469% -> 745%
These changes don't prevent #8246, but for reference, the same numbers with that branch:
Boss battle: 640%
Rabbit battle: 800%
Based on the way the game renders (using a few different CLUTs for a set of textures that draw the enemy), we end up re-rendering several textures many times. In the boss battle especially. This is why the changes in this pull are actually faster for that example.
-[Unknown]