Speed up creating and extending packed arrays from iterators up to 63× #1023

ttencate · 2025-01-20T20:20:11Z

This uses the iterator size hint to pre-allocate, which leads to 63× speedup in the best case. If the hint is pessimistic, it reads into a buffer to avoid repeated push() calls, which is still 44x as fast as the previous implementation.

GodotRust · 2025-01-20T20:30:26Z

API docs are being generated and will be shortly available at: https://godot-rust.github.io/docs/gdext/pr-1023

This uses the iterator size hint to pre-allocate, which leads to 63× speedup in the best case. If the hint is pessimistic, it reads into a buffer to avoid repeated push() calls, which is still 44x as fast as the previous implementation.

Bromeon

Thanks a lot, this sounds like a great improvement! 🚀

Could you elaborate the role of the intermediate stack buffer? Since it's possible to resize the packed array based on size_hint(), why not do that and write directly from the iterator to self.as_mut_slice()?

Also, ParamType::owned_to_arg() is no longer occurring in the resulting code, is that not necessary for genericity?

godot-core/src/builtin/collections/packed_array.rs

ttencate · 2025-01-21T15:21:37Z

Could you elaborate the role of the intermediate stack buffer? Since it's possible to resize the packed array based on size_hint(), why not do that and write directly from the iterator to self.as_mut_slice()?

That's what the "Fast part" does. The buffer is only needed if there are more items after that.

I guess there might be iterators whose size_hint() gets updated intelligently during iteration, but mostly I'd expect it to decrement down to 0 every time next() is called. So once we have processed size_hint() items, we have no idea how many more are coming, and that's when we proceed to the "Slower part" and start using the buffer.

The alternative (which I implemented initially) is to grow the array in increments of 32 elements, write to as_mut_slice(), and then shrink it again at the end. The problem with that is that you might end up over-allocating, because Godot extends the memory allocation in powers of two. In some cases, this would make the returned array use twice as much memory as in the current implementation – memory that is potentially retained for a very long time. The buffer avoids that nicely, at the expense of some extra copying.

Also, ParamType::owned_to_arg() is no longer occurring in the resulting code, is that not necessary for genericity?

Apparently not. We only implement Extend<$Element>, not something like Extend<E> where E: Into<$Element>.

Bromeon · 2025-01-21T16:37:07Z

So once we have processed size_hint() items, we have no idea how many more are coming, and that's when we proceed to the "Slower part" and start using the buffer.

If that's the slow part that only happens on "bad" implementations of size_hint(), is the complexity really necessary or can we fall back to the naive approach for remaining elements (less maintenance, lower risks of bugs)?

Do you know how often this occurs in practice?

ttencate · 2025-01-21T17:44:47Z

There are at least two categories of iterators that are common in the wild, for which we'd want good performance:

Exact size is known, e.g. just iterating over a Vec or BTreeSet. This uses the fast part and does nothing on the slower part, since the iterator is finished.
Exact size is not known and the lower bound is 0, e.g. Filter, FlatMap, FromFn. This does nothing on the fast part and then finishes the iterator through the slower part. Note that the slower path is only 1.5× slower than the fast path, and still 44× as fast as naive repeated push() calls.

This PR is sufficient to handle them both efficiently. We could eliminate the fast part (case 1) and not lose a lot of performance (maybe incur some memory fragmentation), but that's actually the straightforward and obvious part, so the maintainability gain is small.

This PR also happens to deal efficiently with anything in between, i.e. iterators that report a nonzero lower bound but may return more elements. One example of those would be a Chain of the above two cases.

Bromeon · 2025-01-21T18:39:17Z

Sounds good, thanks for elaborating! The 2kB buffer (512 ints) is probably also not a big issue, even on mobile/Wasm?

ttencate · 2025-01-21T19:11:19Z

A cursory search shows stack sizes of at least 1 MB on all platforms. If it becomes a problem after all, it's easy enough to adjust.

Dheatly23 · 2025-01-22T01:21:41Z

godot-core/src/builtin/collections/packed_array.rs

+                while let Some(item) = iter.next() {
+                    buf[0].write(item);
+                    let mut buf_len = 1;
+                    for (src, dst) in iter::zip(&mut iter, buf.iter_mut().skip(1)) {


If buffer is full, then iterator is advanced but the item is discarded.

Reference: https://doc.rust-lang.org/src/core/iter/adapters/zip.rs.html#165-170

Suggested change

for (src, dst) in iter::zip(&mut iter, buf.iter_mut().skip(1)) {

for (dst, src) in iter::zip(buf.iter_mut().skip(1), &mut iter) {

😱 Yikes, great catch! Maybe this is why I intuitively wrote it in a more explicit way to begin with. iter::zip looks so symmetrical (which is why I prefer it over Iterator::zip) but in this case, that's misleading.

I've updated the test to catch this, and rewrote the loop to be more explicit. The new test also caught another bug that all three of us missed: len += buf_len; was missing at the end of the loop. But I'm confident that it is correct now.

ttencate force-pushed the perf/packed_array_extend branch from 30677e3 to f2e267d Compare January 20, 2025 20:34

Bromeon reviewed Jan 21, 2025

View reviewed changes

godot-core/src/builtin/collections/packed_array.rs Outdated Show resolved Hide resolved

godot-core/src/builtin/collections/packed_array.rs Outdated Show resolved Hide resolved

godot-core/src/builtin/collections/packed_array.rs Outdated Show resolved Hide resolved

Bromeon added c: core Core components performance Performance problems and optimizations labels Jan 21, 2025

Dheatly23 reviewed Jan 21, 2025

View reviewed changes

godot-core/src/builtin/collections/packed_array.rs Outdated Show resolved Hide resolved

godot-core/src/builtin/collections/packed_array.rs Outdated Show resolved Hide resolved

ttencate added 3 commits January 21, 2025 16:26

Rewrap lines

c801b5d

Clean up inner loop

26b20d3

Use move_from_slice

c45917c

ttencate requested review from Bromeon and Dheatly23 January 21, 2025 15:50

Dheatly23 reviewed Jan 22, 2025

View reviewed changes

Improve tests and fix two bugs

e8efd08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up creating and extending packed arrays from iterators up to 63× #1023

Speed up creating and extending packed arrays from iterators up to 63× #1023

ttencate commented Jan 20, 2025

GodotRust commented Jan 20, 2025

Bromeon left a comment

ttencate commented Jan 21, 2025

Bromeon commented Jan 21, 2025

ttencate commented Jan 21, 2025

Bromeon commented Jan 21, 2025

ttencate commented Jan 21, 2025

Dheatly23 Jan 22, 2025 •

edited

Loading

ttencate Jan 22, 2025

	for (src, dst) in iter::zip(&mut iter, buf.iter_mut().skip(1)) {
	for (dst, src) in iter::zip(buf.iter_mut().skip(1), &mut iter) {

Speed up creating and extending packed arrays from iterators up to 63× #1023

Are you sure you want to change the base?

Speed up creating and extending packed arrays from iterators up to 63× #1023

Conversation

ttencate commented Jan 20, 2025

GodotRust commented Jan 20, 2025

Bromeon left a comment

Choose a reason for hiding this comment

ttencate commented Jan 21, 2025

Bromeon commented Jan 21, 2025

ttencate commented Jan 21, 2025

Bromeon commented Jan 21, 2025

ttencate commented Jan 21, 2025

Dheatly23 Jan 22, 2025 • edited Loading

Choose a reason for hiding this comment

ttencate Jan 22, 2025

Choose a reason for hiding this comment

Dheatly23 Jan 22, 2025 •

edited

Loading