Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize AnimationMixer with new BitSet class. #92257

Closed

Conversation

DarioSamo
Copy link
Contributor

@DarioSamo DarioSamo commented May 22, 2024

This optimization went a bit out of scope as it adds a new BitSet class we've recognized we could use in different places. For example, the D3D12 driver currently relies on implementing this structure manually (@RandomShaper will recall that we needed to fix it in his PR as well).

The optimization is fairly straightforward and shouldn't cause any behavior differences as long as the BitSet class is implemented correctly, which I added some tests to verify that (although they might not be very extensive).

There's two problems with the function and it causes a non-negligible amount of allocations if the project uses a lot of animation trackers:

  • It uses a CoW Vector which always must be allocated from scratch and gets elements added as they're processed.
  • It uses a linear search on the vector to find if an index was added to the Vector.

Both of these are pretty straightforward to solve by using the BitSet instead on TLS so allocation is only performed when it needs to increase the capacity once and retains the contents across multiple function calls. The second problem is resolved by decreasing the complexity of the linear lookup to a single index and mask check by using the BitSet.

I do believe we should benchmark this one properly, so I'm interested in hearing if @Calinou or someone else in charge of benchmarks has a good test case for this.

@TokageItLab
Copy link
Member

Does this limit the number of processes to 32 or 64 for multiplexing in a single frame? We would need to make sure there are no use cases that exceed that. If it exceeds that, we may need to be able to combine multiple bitSets.

@DarioSamo
Copy link
Contributor Author

Does this limit the number of processes to 32 or 64 for multiplexing in a single frame? We would need to make sure there are no use cases that exceed that. If it exceeds that, we may need to be able to combine multiple bitSets.

This BitSet is flexible, it doesn't have a size limit.

@Calinou
Copy link
Member

Calinou commented May 22, 2024

I do believe we should benchmark this one properly, so I'm interested in hearing if @Calinou or someone else in charge of benchmarks has a good test case for this.

We don't have animation benchmarks in https://github.com/godotengine/godot-benchmarks yet, so I suppose we need to benchmark a more minimal comparison between the old manual approach and using BitSet. This likely means writing a microbenchmark and running it in main/main.cpp.

@lawnjelly
Copy link
Member

I was getting deja vu but I realized I'd added equivalent in 3.x since 3 years ago for portals: 😁
https://github.com/godotengine/godot/blob/3.x/core/bitfield_dynamic.h

That code is >20 years old though and this PR seems more Godot-like. 👍

@DarioSamo
Copy link
Contributor Author

I was getting deja vu but I realized I'd added equivalent in 3.x since 3 years ago for portals

Add that one to the pile of "things I accidentally did that lawnjelly already did in 3.x". 😄

@lawnjelly
Copy link
Member

Don't worry if this is neater I'll likely backport it to replace the old one, as long as it runs fast. 👍

Copy link
Member

@fire fire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it but curious if it makes a difference. I second waiting for performance profiles.

@fire
Copy link
Member

fire commented May 22, 2024

There's been a twitter thread about at least 500 characters animating. Each with at least like 50+ bone tracks

https://vxtwitter.com/duroxxigar/status/1779235340353450330

Copy link
Member

@clayjohn clayjohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Tested locally with a modified TPS demo with 256 player characters and 4 enemies. Tested with a i7-1165G7 (integrated GPU) with power profile set to "High Performance".

In Master, my scene had a consistent 28-30 FPS (pretty consistently 29, but jumping to 30 or 28 for moment).

With this PR I have a consistent 30-31 FPS (pretty consistently 30, but jumping to 31 for a moment)

My demo scene (using 0.25 resolution scale to avoid GPU bottleneck
Screenshot from 2024-05-22 16-20-59

Master
Screenshot from 2024-05-22 16-22-54

This PR
Screenshot from 2024-05-22 16-20-45

As you can see, in both cases the frame time is not super consistent. So it is pretty difficult to be confident. At the same time, removing allocations is always a good thing, so I suggest we go ahead with this PR.

for (const AnimationInstance &ai : animation_instances) {
Ref<Animation> a = ai.animation_data.animation;
real_t weight = ai.playback_info.weight;
Vector<real_t> track_weights = ai.playback_info.track_weights;
Vector<int> processed_indices;
processed_indices.clear();
processed_indices.resize(track_count);
Copy link
Member

@lawnjelly lawnjelly May 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a maximum track count? If so and this is called often, you can also allocate on the stack with alloca. This is done in a lot of performance sensitive cases. (I haven't profiled this.)

(BTW not trying to suggest a hard requirement here, as I haven't examined, this is just for ideas.)

The reuse of processed_indices for all animation_instances should already reduce allocations significantly.

Copy link
Contributor Author

@DarioSamo DarioSamo May 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clear() and resize() won't actually affect the capacity of the vector in most cases. What it's simply doing here is just setting the size to 0, and then the resize sets the new size and clears all bits to its default (false), which is why these two steps are joined together. Clear and resize won't reallocate anything unless capacity needs to increase. There's no hard maximum as far as I can tell on track count.

The only allocations that happen are when across multiple calls (in the same thread), track_count increases enough to justify increasing the capacity. I believe it's either through power of twos or by 1.5 growth ratios. I've talked with @RandomShaper in the past about alloca, but it comes with the drawback of potentially introducing a scalability problem if the function ever needs a very large amount of data and stack memory is very limited.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, and anyway we can apply an enhancement later for all the thread_local vectors such that they are backed by memory from an arena allocator or something, getting the best of both worlds.

Copy link
Member

@lawnjelly lawnjelly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree this looks much better than before, whether or not it is massive bottleneck.

Also bear in mind sometimes it is worth having fixed size bitset class (e.g. by template, or passing in the data to use) to avoid dynamic allocation and use stack instead.

@DarioSamo
Copy link
Contributor Author

DarioSamo commented May 23, 2024

Also bear in mind sometimes it is worth having fixed size bitset class (e.g. by template, or passing in the data to use) to avoid dynamic allocation and use stack instead.

Yup, the std bitset is actually fixed size and allows skipping the heap allocation. The one I took for inspiration here is basically QBitArray. We could probably just have another template with a fixed size that uses a primitive type to offer further optimizations if needed if the max size can be known ahead of time.

core/templates/bit_set.h Outdated Show resolved Hide resolved
@DarioSamo DarioSamo force-pushed the animation_mixer_alloc branch from 4386849 to a4b8b64 Compare May 28, 2024 16:26
@DarioSamo
Copy link
Contributor Author

Rebased and added @lawnjelly's suggestion for inline.

}

_FORCE_INLINE_ void set(uint32_t p_index, bool p_value) {
CRASH_BAD_UNSIGNED_INDEX(p_index, count);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a redundant check since LocalVector operator[] has the same one.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CC @DarioSamo Do you want to go ahead and remove the redundant check before we merge?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can no longer do this PR as @TokageItLab has changed the logic behind it to use hashes instead of indices. It needs to be re-evaluated completely from scratch. It's not possible to use a BitSet with hashes.

@DarioSamo DarioSamo marked this pull request as draft August 16, 2024 11:56
@DarioSamo
Copy link
Contributor Author

DarioSamo commented Aug 16, 2024

It's no longer possible to merge this PR as it needs to be re-evaluated against the changes done in master. Conceptually speaking, it can no longer work as the logic was changed to keep track of hashes instead of indices, requiring a different optimization altogether.

However, we can probably salvage the BitSet class if we're interested on it.

@clayjohn
Copy link
Member

It's no longer possible to merge this PR as it needs to be re-evaluated against the changes done in master. Conceptually speaking, it can no longer work as the logic was changed to keep track of hashes instead of indices, requiring a different optimization altogether.

However, we can probably salvage the BitSet class if we're interested on it.

Ah, ya. Looking at the new implementation, I think the performance will be as bad as before, but we can't apply the BitSet optimization anymore (unless we want a bitset that covers all uint32_ts).

Based on the description of #94716 it sounds like we can't go back to indexing by blend_idx either.

I'll close this as salvageable as we might find another use for the Bit Set class

@clayjohn clayjohn closed this Aug 16, 2024
@clayjohn clayjohn removed this from the 4.4 milestone Aug 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants