-
-
Notifications
You must be signed in to change notification settings - Fork 21.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize AnimationMixer with new BitSet class. #92257
Conversation
Does this limit the number of processes to 32 or 64 for multiplexing in a single frame? We would need to make sure there are no use cases that exceed that. If it exceeds that, we may need to be able to combine multiple bitSets. |
This BitSet is flexible, it doesn't have a size limit. |
We don't have animation benchmarks in https://github.com/godotengine/godot-benchmarks yet, so I suppose we need to benchmark a more minimal comparison between the old manual approach and using BitSet. This likely means writing a microbenchmark and running it in |
I was getting deja vu but I realized I'd added equivalent in 3.x since 3 years ago for portals: 😁 That code is >20 years old though and this PR seems more Godot-like. 👍 |
Add that one to the pile of "things I accidentally did that lawnjelly already did in 3.x". 😄 |
Don't worry if this is neater I'll likely backport it to replace the old one, as long as it runs fast. 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like it but curious if it makes a difference. I second waiting for performance profiles.
There's been a twitter thread about at least 500 characters animating. Each with at least like 50+ bone tracks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. Tested locally with a modified TPS demo with 256 player characters and 4 enemies. Tested with a i7-1165G7 (integrated GPU) with power profile set to "High Performance".
In Master, my scene had a consistent 28-30 FPS (pretty consistently 29, but jumping to 30 or 28 for moment).
With this PR I have a consistent 30-31 FPS (pretty consistently 30, but jumping to 31 for a moment)
My demo scene (using 0.25 resolution scale to avoid GPU bottleneck
As you can see, in both cases the frame time is not super consistent. So it is pretty difficult to be confident. At the same time, removing allocations is always a good thing, so I suggest we go ahead with this PR.
for (const AnimationInstance &ai : animation_instances) { | ||
Ref<Animation> a = ai.animation_data.animation; | ||
real_t weight = ai.playback_info.weight; | ||
Vector<real_t> track_weights = ai.playback_info.track_weights; | ||
Vector<int> processed_indices; | ||
processed_indices.clear(); | ||
processed_indices.resize(track_count); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a maximum track count? If so and this is called often, you can also allocate on the stack with alloca
. This is done in a lot of performance sensitive cases. (I haven't profiled this.)
(BTW not trying to suggest a hard requirement here, as I haven't examined, this is just for ideas.)
The reuse of processed_indices
for all animation_instances should already reduce allocations significantly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clear() and resize() won't actually affect the capacity of the vector in most cases. What it's simply doing here is just setting the size to 0, and then the resize sets the new size and clears all bits to its default (false), which is why these two steps are joined together. Clear and resize won't reallocate anything unless capacity needs to increase. There's no hard maximum as far as I can tell on track count.
The only allocations that happen are when across multiple calls (in the same thread), track_count increases enough to justify increasing the capacity. I believe it's either through power of twos or by 1.5 growth ratios. I've talked with @RandomShaper in the past about alloca, but it comes with the drawback of potentially introducing a scalability problem if the function ever needs a very large amount of data and stack memory is very limited.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, and anyway we can apply an enhancement later for all the thread_local
vectors such that they are backed by memory from an arena allocator or something, getting the best of both worlds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree this looks much better than before, whether or not it is massive bottleneck.
Also bear in mind sometimes it is worth having fixed size bitset class (e.g. by template, or passing in the data to use) to avoid dynamic allocation and use stack instead.
Yup, the std bitset is actually fixed size and allows skipping the heap allocation. The one I took for inspiration here is basically QBitArray. We could probably just have another template with a fixed size that uses a primitive type to offer further optimizations if needed if the max size can be known ahead of time. |
4386849
to
a4b8b64
Compare
Rebased and added @lawnjelly's suggestion for inline. |
} | ||
|
||
_FORCE_INLINE_ void set(uint32_t p_index, bool p_value) { | ||
CRASH_BAD_UNSIGNED_INDEX(p_index, count); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like a redundant check since LocalVector operator[] has the same one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CC @DarioSamo Do you want to go ahead and remove the redundant check before we merge?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can no longer do this PR as @TokageItLab has changed the logic behind it to use hashes instead of indices. It needs to be re-evaluated completely from scratch. It's not possible to use a BitSet with hashes.
It's no longer possible to merge this PR as it needs to be re-evaluated against the changes done in However, we can probably salvage the BitSet class if we're interested on it. |
Ah, ya. Looking at the new implementation, I think the performance will be as bad as before, but we can't apply the BitSet optimization anymore (unless we want a bitset that covers all Based on the description of #94716 it sounds like we can't go back to indexing by I'll close this as salvageable as we might find another use for the Bit Set class |
This optimization went a bit out of scope as it adds a new BitSet class we've recognized we could use in different places. For example, the D3D12 driver currently relies on implementing this structure manually (@RandomShaper will recall that we needed to fix it in his PR as well).
The optimization is fairly straightforward and shouldn't cause any behavior differences as long as the BitSet class is implemented correctly, which I added some tests to verify that (although they might not be very extensive).
There's two problems with the function and it causes a non-negligible amount of allocations if the project uses a lot of animation trackers:
Both of these are pretty straightforward to solve by using the BitSet instead on TLS so allocation is only performed when it needs to increase the capacity once and retains the contents across multiple function calls. The second problem is resolved by decreasing the complexity of the linear lookup to a single index and mask check by using the BitSet.
I do believe we should benchmark this one properly, so I'm interested in hearing if @Calinou or someone else in charge of benchmarks has a good test case for this.