Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inefficient codegen collecting slice iterator into array #126000

Open
orlp opened this issue Jun 4, 2024 · 3 comments
Open

Inefficient codegen collecting slice iterator into array #126000

orlp opened this issue Jun 4, 2024 · 3 comments
Labels
A-codegen Area: Code generation A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-bug Category: This is a bug. I-heavy Issue: Problems and improvements with respect to binary size of generated code. I-slow Issue: Problems and improvements with respect to performance of generated code.

Comments

@orlp
Copy link
Contributor

orlp commented Jun 4, 2024

While working on Itertools::collect_array for itertools I wanted to compare the efficiency of

pub fn try_into(sl: &[u32]) -> Option<[u32; N]> {
    Some(*<&[u32; N]>::try_from(sl.get(0..N)?).unwrap())
}

pub fn collect_array(sl: &[u32]) -> Option<[u32; N]> {
    sl.iter().copied().collect_array()
}

You can see my comparison here: https://rust.godbolt.org/z/qPW3K8aTx.

I was rather shocked, both look good for N = 4, but for N = 16 we see the following nice implementation for try_into:

try_into:
        mov     rax, rdi
        xor     ecx, ecx
        cmp     rdx, 15
        jbe     .LBB0_2
        vmovups ymm0, ymmword ptr [rsi + 32]
        vmovups ymmword ptr [rax + 36], ymm0
        vmovups ymm0, ymmword ptr [rsi]
        vmovups ymmword ptr [rax + 4], ymm0
        mov     ecx, 1
.LBB0_2:
        mov     dword ptr [rax], ecx
        vzeroupper
        ret

But the following abomination for collect_array:

collect_array:
        mov     rax, rdi
        xor     ecx, ecx
        test    rdx, rdx
        je      .LBB1_14
        cmp     rdx, 1
        je      .LBB1_14
        cmp     rdx, 2
        je      .LBB1_14
        cmp     rdx, 3
        je      .LBB1_14
        cmp     rdx, 4
        je      .LBB1_14
        cmp     rdx, 5
        je      .LBB1_14
        cmp     rdx, 6
        je      .LBB1_14
        cmp     rdx, 7
        je      .LBB1_14
        cmp     rdx, 8
        je      .LBB1_14
        cmp     rdx, 9
        je      .LBB1_14
        cmp     rdx, 10
        je      .LBB1_14
        cmp     rdx, 11
        je      .LBB1_14
        and     rdx, -4
        cmp     rdx, 12
        je      .LBB1_14
        push    rbp
        push    r15
        push    r14
        push    r12
        push    rbx
        mov     ecx, dword ptr [rsi]
        mov     edx, dword ptr [rsi + 4]
        mov     edi, dword ptr [rsi + 8]
        mov     r8d, dword ptr [rsi + 12]
        mov     r9d, dword ptr [rsi + 16]
        mov     r10d, dword ptr [rsi + 20]
        mov     r11d, dword ptr [rsi + 24]
        mov     ebx, dword ptr [rsi + 28]
        mov     ebp, dword ptr [rsi + 32]
        mov     r14d, dword ptr [rsi + 36]
        mov     r15d, dword ptr [rsi + 40]
        mov     r12d, dword ptr [rsi + 44]
        mov     dword ptr [rax + 4], ecx
        mov     dword ptr [rax + 8], edx
        mov     dword ptr [rax + 12], edi
        mov     dword ptr [rax + 16], r8d
        mov     dword ptr [rax + 20], r9d
        mov     dword ptr [rax + 24], r10d
        mov     dword ptr [rax + 28], r11d
        mov     dword ptr [rax + 32], ebx
        mov     dword ptr [rax + 36], ebp
        mov     dword ptr [rax + 40], r14d
        mov     dword ptr [rax + 44], r15d
        mov     dword ptr [rax + 48], r12d
        vmovups xmm0, xmmword ptr [rsi + 48]
        vmovups xmmword ptr [rax + 52], xmm0
        mov     ecx, 1
        pop     rbx
        pop     r12
        pop     r14
        pop     r15
        pop     rbp
.LBB1_14:
        mov     dword ptr [rax], ecx
        ret

While it not folding the consecutive address movs into efficient SIMD movs is disappointing, I would argue that there's probably a bug somewhere since it compares rdx THIRTEEN TIMES IN A ROW to ultimately just check if it is < 16.

This doesn't appear to be a recent regression, the same happens in 1.60 through nightly, and it happens on both x86 as well as ARM.

@orlp orlp added the C-bug Category: This is a bug. label Jun 4, 2024
@rustbot rustbot added the needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. label Jun 4, 2024
@orlp
Copy link
Contributor Author

orlp commented Jun 4, 2024

And in case anyone wonders, the exact same bad code is generated if one uses arrayvec::ArrayVec rather than my ArrayBuilder:

pub fn next_array<I, T, const N: usize>(it: &mut I) -> Option<[T; N]>
where
    I: Iterator<Item = T>,
{
    let mut builder = arrayvec::ArrayVec::new();
    for _ in 0..N {
        builder.push(it.next()?);
    }
    builder.into_inner().ok()
}

@the8472
Copy link
Member

the8472 commented Jun 5, 2024

You could look at what next_chunk does in std, it uses MaybeUninit internally. And Copied has some specialization on top.
Though itertool's job actually should be easier here since you're just returning an Option instead of a Result with a remainder.

@saethlin saethlin added A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. I-slow Issue: Problems and improvements with respect to performance of generated code. A-codegen Area: Code generation I-heavy Issue: Problems and improvements with respect to binary size of generated code. and removed needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. labels Jun 13, 2024
@DianQK
Copy link
Member

DianQK commented Jun 13, 2024

I reduced to the following code, but this is another issue:

#[no_mangle]
pub fn src(sl: &[u32]) -> Option<u32> {
    let mut r = 0;
    let it = &mut sl.iter();
    for _ in 0..2 {
        r += it.next()?;
    }
    Some(r)
}

#[no_mangle]
pub fn tgt(sl: &[u32]) -> Option<u32> {
    let mut r = 0;
    let it = &mut sl.iter();
    r += it.next()?;
    r += it.next()?;
    Some(r)
}

https://rust.godbolt.org/z/596qb413n

@saethlin saethlin changed the title Inefficient codegen collecting slice iterator into array, arguably bugged Inefficient codegen collecting slice iterator into array Jun 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-codegen Area: Code generation A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-bug Category: This is a bug. I-heavy Issue: Problems and improvements with respect to binary size of generated code. I-slow Issue: Problems and improvements with respect to performance of generated code.
Projects
None yet
Development

No branches or pull requests

5 participants