Optimize AnimationMixer with new BitSet class. #92257

DarioSamo · 2024-05-22T15:06:27Z

This optimization went a bit out of scope as it adds a new BitSet class we've recognized we could use in different places. For example, the D3D12 driver currently relies on implementing this structure manually (@RandomShaper will recall that we needed to fix it in his PR as well).

The optimization is fairly straightforward and shouldn't cause any behavior differences as long as the BitSet class is implemented correctly, which I added some tests to verify that (although they might not be very extensive).

There's two problems with the function and it causes a non-negligible amount of allocations if the project uses a lot of animation trackers:

It uses a CoW Vector which always must be allocated from scratch and gets elements added as they're processed.
It uses a linear search on the vector to find if an index was added to the Vector.

Both of these are pretty straightforward to solve by using the BitSet instead on TLS so allocation is only performed when it needs to increase the capacity once and retains the contents across multiple function calls. The second problem is resolved by decreasing the complexity of the linear lookup to a single index and mask check by using the BitSet.

I do believe we should benchmark this one properly, so I'm interested in hearing if @Calinou or someone else in charge of benchmarks has a good test case for this.

TokageItLab · 2024-05-22T15:35:23Z

Does this limit the number of processes to 32 or 64 for multiplexing in a single frame? We would need to make sure there are no use cases that exceed that. If it exceeds that, we may need to be able to combine multiple bitSets.

DarioSamo · 2024-05-22T16:02:38Z

Does this limit the number of processes to 32 or 64 for multiplexing in a single frame? We would need to make sure there are no use cases that exceed that. If it exceeds that, we may need to be able to combine multiple bitSets.

This BitSet is flexible, it doesn't have a size limit.

Calinou · 2024-05-22T16:38:32Z

I do believe we should benchmark this one properly, so I'm interested in hearing if @Calinou or someone else in charge of benchmarks has a good test case for this.

We don't have animation benchmarks in https://github.com/godotengine/godot-benchmarks yet, so I suppose we need to benchmark a more minimal comparison between the old manual approach and using BitSet. This likely means writing a microbenchmark and running it in main/main.cpp.

lawnjelly · 2024-05-22T17:25:28Z

I was getting deja vu but I realized I'd added equivalent in 3.x since 3 years ago for portals: 😁
https://github.com/godotengine/godot/blob/3.x/core/bitfield_dynamic.h

That code is >20 years old though and this PR seems more Godot-like. 👍

DarioSamo · 2024-05-22T17:30:04Z

I was getting deja vu but I realized I'd added equivalent in 3.x since 3 years ago for portals

Add that one to the pile of "things I accidentally did that lawnjelly already did in 3.x". 😄

lawnjelly · 2024-05-22T17:31:07Z

Don't worry if this is neater I'll likely backport it to replace the old one, as long as it runs fast. 👍

fire

I like it but curious if it makes a difference. I second waiting for performance profiles.

fire · 2024-05-22T22:17:05Z

There's been a twitter thread about at least 500 characters animating. Each with at least like 50+ bone tracks

https://vxtwitter.com/duroxxigar/status/1779235340353450330

clayjohn

Looks good to me. Tested locally with a modified TPS demo with 256 player characters and 4 enemies. Tested with a i7-1165G7 (integrated GPU) with power profile set to "High Performance".

In Master, my scene had a consistent 28-30 FPS (pretty consistently 29, but jumping to 30 or 28 for moment).

With this PR I have a consistent 30-31 FPS (pretty consistently 30, but jumping to 31 for a moment)

My demo scene (using 0.25 resolution scale to avoid GPU bottleneck

Master

This PR

As you can see, in both cases the frame time is not super consistent. So it is pretty difficult to be confident. At the same time, removing allocations is always a good thing, so I suggest we go ahead with this PR.

lawnjelly · 2024-05-23T07:33:29Z

scene/animation/animation_mixer.cpp

 	for (const AnimationInstance &ai : animation_instances) {
 		Ref<Animation> a = ai.animation_data.animation;
 		real_t weight = ai.playback_info.weight;
 		Vector<real_t> track_weights = ai.playback_info.track_weights;
-		Vector<int> processed_indices;
+		processed_indices.clear();
+		processed_indices.resize(track_count);


Is there a maximum track count? If so and this is called often, you can also allocate on the stack with alloca. This is done in a lot of performance sensitive cases. (I haven't profiled this.)

(BTW not trying to suggest a hard requirement here, as I haven't examined, this is just for ideas.)

The reuse of processed_indices for all animation_instances should already reduce allocations significantly.

clear() and resize() won't actually affect the capacity of the vector in most cases. What it's simply doing here is just setting the size to 0, and then the resize sets the new size and clears all bits to its default (false), which is why these two steps are joined together. Clear and resize won't reallocate anything unless capacity needs to increase. There's no hard maximum as far as I can tell on track count.

The only allocations that happen are when across multiple calls (in the same thread), track_count increases enough to justify increasing the capacity. I believe it's either through power of twos or by 1.5 growth ratios. I've talked with @RandomShaper in the past about alloca, but it comes with the drawback of potentially introducing a scalability problem if the function ever needs a very large amount of data and stack memory is very limited.

Yes, and anyway we can apply an enhancement later for all the thread_local vectors such that they are backed by memory from an arena allocator or something, getting the best of both worlds.

lawnjelly

Agree this looks much better than before, whether or not it is massive bottleneck.

Also bear in mind sometimes it is worth having fixed size bitset class (e.g. by template, or passing in the data to use) to avoid dynamic allocation and use stack instead.

DarioSamo · 2024-05-23T11:43:38Z

Also bear in mind sometimes it is worth having fixed size bitset class (e.g. by template, or passing in the data to use) to avoid dynamic allocation and use stack instead.

Yup, the std bitset is actually fixed size and allows skipping the heap allocation. The one I took for inspiration here is basically QBitArray. We could probably just have another template with a fixed size that uses a primitive type to offer further optimizations if needed if the max size can be known ahead of time.

core/templates/bit_set.h

…ization.

DarioSamo · 2024-05-28T16:27:49Z

Rebased and added @lawnjelly's suggestion for inline.

Nazarwadim · 2024-05-30T18:24:16Z

core/templates/bit_set.h

+	}
+
+	_FORCE_INLINE_ void set(uint32_t p_index, bool p_value) {
+		CRASH_BAD_UNSIGNED_INDEX(p_index, count);


Looks like a redundant check since LocalVector operator[] has the same one.

CC @DarioSamo Do you want to go ahead and remove the redundant check before we merge?

I can no longer do this PR as @TokageItLab has changed the logic behind it to use hashes instead of indices. It needs to be re-evaluated completely from scratch. It's not possible to use a BitSet with hashes.

DarioSamo · 2024-08-16T11:57:25Z

It's no longer possible to merge this PR as it needs to be re-evaluated against the changes done in master. Conceptually speaking, it can no longer work as the logic was changed to keep track of hashes instead of indices, requiring a different optimization altogether.

However, we can probably salvage the BitSet class if we're interested on it.

clayjohn · 2024-08-16T18:38:48Z

It's no longer possible to merge this PR as it needs to be re-evaluated against the changes done in master. Conceptually speaking, it can no longer work as the logic was changed to keep track of hashes instead of indices, requiring a different optimization altogether.

However, we can probably salvage the BitSet class if we're interested on it.

Ah, ya. Looking at the new implementation, I think the performance will be as bad as before, but we can't apply the BitSet optimization anymore (unless we want a bitset that covers all uint32_ts).

Based on the description of #94716 it sounds like we can't go back to indexing by blend_idx either.

I'll close this as salvageable as we might find another use for the Bit Set class

DarioSamo requested review from a team as code owners May 22, 2024 15:06

clayjohn added enhancement topic:animation performance labels May 22, 2024

clayjohn added this to the 4.4 milestone May 22, 2024

fire approved these changes May 22, 2024

View reviewed changes

clayjohn approved these changes May 22, 2024

View reviewed changes

lawnjelly reviewed May 23, 2024

View reviewed changes

lawnjelly approved these changes May 23, 2024

View reviewed changes

lawnjelly reviewed May 27, 2024

View reviewed changes

core/templates/bit_set.h Outdated Show resolved Hide resolved

Add new BitSet class with tests. Make AnimationMixer use it for optim…

a4b8b64

…ization.

DarioSamo force-pushed the animation_mixer_alloc branch from 4386849 to a4b8b64 Compare May 28, 2024 16:26

Nazarwadim reviewed May 30, 2024

View reviewed changes

Nazarwadim mentioned this pull request Jun 6, 2024

Optimize AnimationMixer blend process #92838

Merged

DarioSamo marked this pull request as draft August 16, 2024 11:56

clayjohn closed this Aug 16, 2024

clayjohn removed this from the 4.4 milestone Aug 16, 2024

clayjohn added the salvageable label Aug 16, 2024

AThousandShips added the archived label Aug 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize AnimationMixer with new BitSet class. #92257

Optimize AnimationMixer with new BitSet class. #92257

DarioSamo commented May 22, 2024 •

edited

Loading

TokageItLab commented May 22, 2024

DarioSamo commented May 22, 2024

Calinou commented May 22, 2024

lawnjelly commented May 22, 2024

DarioSamo commented May 22, 2024

lawnjelly commented May 22, 2024

fire left a comment •

edited

Loading

fire commented May 22, 2024 •

edited

Loading

clayjohn left a comment

lawnjelly May 23, 2024 •

edited

Loading

DarioSamo May 23, 2024 •

edited

Loading

RandomShaper May 28, 2024

lawnjelly left a comment •

edited

Loading

DarioSamo commented May 23, 2024 •

edited

Loading

DarioSamo commented May 28, 2024

Nazarwadim May 30, 2024

clayjohn Aug 16, 2024

DarioSamo Aug 16, 2024

DarioSamo commented Aug 16, 2024 •

edited

Loading

clayjohn commented Aug 16, 2024

Optimize AnimationMixer with new BitSet class. #92257

Optimize AnimationMixer with new BitSet class. #92257

Conversation

DarioSamo commented May 22, 2024 • edited Loading

TokageItLab commented May 22, 2024

DarioSamo commented May 22, 2024

Calinou commented May 22, 2024

lawnjelly commented May 22, 2024

DarioSamo commented May 22, 2024

lawnjelly commented May 22, 2024

fire left a comment • edited Loading

Choose a reason for hiding this comment

fire commented May 22, 2024 • edited Loading

clayjohn left a comment

Choose a reason for hiding this comment

lawnjelly May 23, 2024 • edited Loading

Choose a reason for hiding this comment

DarioSamo May 23, 2024 • edited Loading

Choose a reason for hiding this comment

RandomShaper May 28, 2024

Choose a reason for hiding this comment

lawnjelly left a comment • edited Loading

Choose a reason for hiding this comment

DarioSamo commented May 23, 2024 • edited Loading

DarioSamo commented May 28, 2024

Nazarwadim May 30, 2024

Choose a reason for hiding this comment

clayjohn Aug 16, 2024

Choose a reason for hiding this comment

DarioSamo Aug 16, 2024

Choose a reason for hiding this comment

DarioSamo commented Aug 16, 2024 • edited Loading

clayjohn commented Aug 16, 2024

DarioSamo commented May 22, 2024 •

edited

Loading

fire left a comment •

edited

Loading

fire commented May 22, 2024 •

edited

Loading

lawnjelly May 23, 2024 •

edited

Loading

DarioSamo May 23, 2024 •

edited

Loading

lawnjelly left a comment •

edited

Loading

DarioSamo commented May 23, 2024 •

edited

Loading

DarioSamo commented Aug 16, 2024 •

edited

Loading