Inefficient codegen collecting slice iterator into array #126000

orlp · 2024-06-04T23:40:41Z

While working on Itertools::collect_array for itertools I wanted to compare the efficiency of

pub fn try_into(sl: &[u32]) -> Option<[u32; N]> {
    Some(*<&[u32; N]>::try_from(sl.get(0..N)?).unwrap())
}

pub fn collect_array(sl: &[u32]) -> Option<[u32; N]> {
    sl.iter().copied().collect_array()
}

You can see my comparison here: https://rust.godbolt.org/z/qPW3K8aTx.

I was rather shocked, both look good for N = 4, but for N = 16 we see the following nice implementation for try_into:

try_into:
        mov     rax, rdi
        xor     ecx, ecx
        cmp     rdx, 15
        jbe     .LBB0_2
        vmovups ymm0, ymmword ptr [rsi + 32]
        vmovups ymmword ptr [rax + 36], ymm0
        vmovups ymm0, ymmword ptr [rsi]
        vmovups ymmword ptr [rax + 4], ymm0
        mov     ecx, 1
.LBB0_2:
        mov     dword ptr [rax], ecx
        vzeroupper
        ret

But the following abomination for collect_array:

collect_array:
        mov     rax, rdi
        xor     ecx, ecx
        test    rdx, rdx
        je      .LBB1_14
        cmp     rdx, 1
        je      .LBB1_14
        cmp     rdx, 2
        je      .LBB1_14
        cmp     rdx, 3
        je      .LBB1_14
        cmp     rdx, 4
        je      .LBB1_14
        cmp     rdx, 5
        je      .LBB1_14
        cmp     rdx, 6
        je      .LBB1_14
        cmp     rdx, 7
        je      .LBB1_14
        cmp     rdx, 8
        je      .LBB1_14
        cmp     rdx, 9
        je      .LBB1_14
        cmp     rdx, 10
        je      .LBB1_14
        cmp     rdx, 11
        je      .LBB1_14
        and     rdx, -4
        cmp     rdx, 12
        je      .LBB1_14
        push    rbp
        push    r15
        push    r14
        push    r12
        push    rbx
        mov     ecx, dword ptr [rsi]
        mov     edx, dword ptr [rsi + 4]
        mov     edi, dword ptr [rsi + 8]
        mov     r8d, dword ptr [rsi + 12]
        mov     r9d, dword ptr [rsi + 16]
        mov     r10d, dword ptr [rsi + 20]
        mov     r11d, dword ptr [rsi + 24]
        mov     ebx, dword ptr [rsi + 28]
        mov     ebp, dword ptr [rsi + 32]
        mov     r14d, dword ptr [rsi + 36]
        mov     r15d, dword ptr [rsi + 40]
        mov     r12d, dword ptr [rsi + 44]
        mov     dword ptr [rax + 4], ecx
        mov     dword ptr [rax + 8], edx
        mov     dword ptr [rax + 12], edi
        mov     dword ptr [rax + 16], r8d
        mov     dword ptr [rax + 20], r9d
        mov     dword ptr [rax + 24], r10d
        mov     dword ptr [rax + 28], r11d
        mov     dword ptr [rax + 32], ebx
        mov     dword ptr [rax + 36], ebp
        mov     dword ptr [rax + 40], r14d
        mov     dword ptr [rax + 44], r15d
        mov     dword ptr [rax + 48], r12d
        vmovups xmm0, xmmword ptr [rsi + 48]
        vmovups xmmword ptr [rax + 52], xmm0
        mov     ecx, 1
        pop     rbx
        pop     r12
        pop     r14
        pop     r15
        pop     rbp
.LBB1_14:
        mov     dword ptr [rax], ecx
        ret

While it not folding the consecutive address movs into efficient SIMD movs is disappointing, I would argue that there's probably a bug somewhere since it compares rdx THIRTEEN TIMES IN A ROW to ultimately just check if it is < 16.

This doesn't appear to be a recent regression, the same happens in 1.60 through nightly, and it happens on both x86 as well as ARM.

The text was updated successfully, but these errors were encountered:

orlp · 2024-06-04T23:45:46Z

And in case anyone wonders, the exact same bad code is generated if one uses arrayvec::ArrayVec rather than my ArrayBuilder:

pub fn next_array<I, T, const N: usize>(it: &mut I) -> Option<[T; N]>
where
    I: Iterator<Item = T>,
{
    let mut builder = arrayvec::ArrayVec::new();
    for _ in 0..N {
        builder.push(it.next()?);
    }
    builder.into_inner().ok()
}

the8472 · 2024-06-05T17:32:33Z

You could look at what next_chunk does in std, it uses MaybeUninit internally. And Copied has some specialization on top.
Though itertool's job actually should be easier here since you're just returning an Option instead of a Result with a remainder.

DianQK · 2024-06-13T14:09:03Z

I reduced to the following code, but this is another issue:

#[no_mangle]
pub fn src(sl: &[u32]) -> Option<u32> {
    let mut r = 0;
    let it = &mut sl.iter();
    for _ in 0..2 {
        r += it.next()?;
    }
    Some(r)
}

#[no_mangle]
pub fn tgt(sl: &[u32]) -> Option<u32> {
    let mut r = 0;
    let it = &mut sl.iter();
    r += it.next()?;
    r += it.next()?;
    Some(r)
}

https://rust.godbolt.org/z/596qb413n

orlp added the C-bug Category: This is a bug. label Jun 4, 2024

rustbot added the needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. label Jun 4, 2024

orlp mentioned this issue Jun 5, 2024

Add next_array and collect_array rust-itertools/itertools#560

Merged

saethlin changed the title ~~Inefficient codegen collecting slice iterator into array, arguably bugged~~ Inefficient codegen collecting slice iterator into array Jun 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inefficient codegen collecting slice iterator into array #126000

Inefficient codegen collecting slice iterator into array #126000

orlp commented Jun 4, 2024 •

edited

Loading

orlp commented Jun 4, 2024

the8472 commented Jun 5, 2024

DianQK commented Jun 13, 2024

Inefficient codegen collecting slice iterator into array #126000

Inefficient codegen collecting slice iterator into array #126000

Comments

orlp commented Jun 4, 2024 • edited Loading

orlp commented Jun 4, 2024

the8472 commented Jun 5, 2024

DianQK commented Jun 13, 2024

orlp commented Jun 4, 2024 •

edited

Loading