Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement hstack and vstack. #1198

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

aweinstock314
Copy link
Contributor

The implementation uses a trait to fold over tuples, summing the dimensions in one direction and checking for equality in the other, and then uses fixed_{rows,columns}_mut if the dimensions are static, or {rows,columns}_mut if the dimensions are dynamic, together with copy_from to construct the output matrix.

The implementation uses a trait to fold over tuples, summing the
dimensions in one direction and checking for equality in the other, and
then uses `fixed_{rows,columns}_mut` if the dimensions are static, or
`{rows,columns}_mut` if the dimensions are dynamic, together with
`copy_from` to construct the output matrix.
@aweinstock314
Copy link
Contributor Author

Partially addresses #1146.

@Andlon
Copy link
Collaborator

Andlon commented Jan 31, 2023

Thanks, matrix concatenation/stacking is sorely lacking in nalgebra. There's already an ongoing PR for concatenating matrices (#1080), but I think unfortunately that the author of the PR @birktj seems to have had a change in priorities and did not pursue it further. Do you have some thoughts on how the approach taken in that PR, and the comments left there, relates to what you propose here?

@aweinstock314
Copy link
Contributor Author

I don't see a reason not to merge both.

The cat! macro is terser (e.g. cat![a, b; c, d] is shorter than vstack((&hstack((&a, &b)), &hstack((&c, &d))))), as well as being potentially more efficient in that case (since currently, the hstack's will allocate temporary matrices, and while that could probably be improved with this approach with something like vstack_lazy((&hstack_lazy((&a, &b)), &hstack_lazy((&c, &d)))).build(), that would be even more verbose).

hstack/vstack don't require any proc macros, so they're available to users of nalgebra that disable the macros feature, and I'd imagine they probably compile faster. They also behave identically to the numpy functions of the same names (up to the distinction of taking references to matrices instead of matrices), which should help with familiarity/porting numpy code to nalgebra.

Regarding codegen, hstack/vstack do use fixed_{rows,columns}_mut when applicable, which might be more consistently optimized than cat!, which uses generic_slice_mut (though it looks like the shape is inserted into the genrated code as a let binding immediately before the call, which is probably similarly optimizable).

…onst parameters to the end of declarations to support the rustc version used by the cuda test suite.
@Andlon
Copy link
Collaborator

Andlon commented Feb 2, 2023

This isn't articulated anywhere, but I have been thinking about stacking/concatenation lately, and I was anyway kind of starting to prefer stack! over bcat! (which was never a good name tbh), and I was thinking that vstack! and hstack! would then just defer to stack!.

My personal opinion is that nalgebra should only have one way of doing stacking/concatenation with macros, I think having two distinctly different approaches is problematic for the user experience - especially if they have different semantics or limitations/restrictions.

I think you make a good point about compile-times, so I decided to quickly test it. I compiled the current version of nalgebra (0.32.1) in two different ways (with cargo clean in between) and recorded the Cargo timings:

cargo build --timings --no-default-features
cargo build --timings --no-default-features --feature macros

Here are the results without the macros feature:
image

Here are the results including the macros feature:
image

We see that the build times are almost the same, because syn and nalgebra-macros can be compiled at the same time as simba. Results might be different on different setups (for example on a dual-core laptop), but it appears that the macros feature does not add very substantially to compile times on my computer, at least.

I must also admit that I actually prefer the proc macro approach. The macro_rules! machinery is very hard to parse and to work with in comparison, and I worry about its long-term maintainability.

Based on all of this, I would at least on a purely technical level prefer to rename the WIP bcat macro to stack and implement hstack and vstack in terms of stack. At the same time, I'd hate to see the time you've spent on this go to waste, and I really do not want to discourage you from making future contributions. If I had known you were working on this we could perhaps have had parts of this discussion before you put the effort in to implement it... In any case, @sebcrozet has the final word here, and this is just my opinion. There might also be other concerns that I have not considered.

@Andlon
Copy link
Collaborator

Andlon commented Feb 2, 2023

A small note to my compile-time measurements: Looking at the results again, it's clear that there's quite a bit of parallel compilation of small crates going on, which is why the compile times are so similar. I have a 12-core Ryzen 3900x, so I suspect the numbers might look very different single-threaded or, say, a dual-core laptop as I mentioned.

(This doesn't change my opinion, but I wanted to be more transparent/clear on the measurements I made)

@aweinstock314
Copy link
Contributor Author

Since you refer to them as "vstack! and hstack!", and that "nalgebra should only have one way of doing stacking/concatenation with macros", I think it's worth clarifying that vstack and hstack aren't macros with this approach, they're generic functions.
The only macro_rules! macro here is impl_visit_tuple, which expands into

impl<A, Func: Visitor<A>> VisitTuple<Func> for (A,) { /* ... */ }
impl<B, A, Func: Visitor<B>> VisitTuple<Func> for (B, A) { /* ... */ }
impl<C, B, A, Func: Visitor<C>> VisitTuple<Func> for (C, B, A) { /* ... */ }

and so on, currently up to 8-tuples (these could be written manually if it would be clearer without macro_rules!; I wrote the initial implementations for up to 4-tuples manually before writing the the macro in order to figure out what code the macro should generate).

The VisitTuple trait only needs to be defined once, and both vstack and hstack are implemented by taking trait bounds on it. This is what permits vstack((&a)), vstack((&a, &b)), vstack((&a, &b, &c)) and so on to work as trait functions instead of macros.

If it's desired for one of these systems to be defined in terms of the other, I could implement lazy versions of vstack/hstack with the Block trait from the bcat! gist (https://gist.github.com/rust-play/221a1aab484d6432d1203c92fcefcae4) in terms of VisitTuple, as something like

impl<T, R, C> Block<T> for VCat<X> where
    X: Copy +
        VisitTuple<VStackShapeInit, Output=VStackShape<R, C>>
        VisitTuple<VCatBlocks<...>, Output=<...>>
{   
    type Rows = R;
    type Cols = C;
    fn shape(&self) -> (Self::Rows, Self::Cols) {
        let shape = <X as VisitTuple<_>>::visit(VStackShapeInit, *self);
        (shape.r, shape.c)
    }

    fn populate<S>(&self, m: &mut Matrix<T, Self::Rows, Self::Cols, S>) {
        /* probably implementable directly in terms of `VStack` instead of a new `VCatBlocks` if 
        pub struct VStack<T, R, C, S, R2> {
            out: Matrix<T, R, C, S>,
            current_row: R2,
        }
        is replaced with
        pub struct VStack<'a, T, R, C, S, R2> {
            out: &'a mut Matrix<T, R, C, S>,
            current_row: R2,
        }
        */
    }
}   

This would make it so that bcat![a, b, c; d, e, f; g, h, i] could expand into

VCat((  
    HCat((&a, &b, &c)),
    HCat((&d, &e, &f)),
    HCat((&g, &h, &i))
)).build()

, where

impl<T> Block<T> {
    fn build(&self) -> MatrixMN<T, Self::Rows, Self::Cols> {
        let mut m = allocate_block_output(self);
        self.populate(&mut m);
        m
    }
}

is a normal function that can also be used by strict versions of vstack/hstack, which addresses the issue of the current vstack/hstack allocating temporary matrices, while still allowing numpy-compatible function signatures that share the machinery and work without feature="macros" (defining vstack/hstack in terms of the current cat! would not allow them to work without feature="macros").

…HStack`/`VStack` to facilitate implementing lazy stacking.
@aweinstock314
Copy link
Contributor Author

Update: after having added the necessary lifetime parameters to the HStack/VStack structs, I'm more confident that the Block impl for tuples of arbitrary size would look like:

impl<T, R, C, X> Block<T> for VCat<X> where
    X: Copy +
        VisitTuple<VStackShapeInit, Output=VStackShape<R, C>>
{
    type Rows = R;
    type Cols = C;
    fn shape(&self) -> (Self::Rows, Self::Cols) {
        let shape = <X as VisitTuple<_>>::visit(VStackShapeInit, *self);
        (shape.r, shape.c)
    }

    fn populate<S>(&self, m: &mut Matrix<T, Self::Rows, Self::Cols, S>) where
        X: for<'a> VisitTuple<VStack<'a, T, R, C, S, Const<0>>, Output=VStack<'a, T, R, C, S, R>
    {   
        let vstack_visitor = VStack { out: m, current_row: Const::<0> };
        let _ = <X as VisitTuple<_>>::visit(vstack_visitor, *self);
    }
}

@Andlon
Copy link
Collaborator

Andlon commented Feb 3, 2023

@aweinstock314: thanks for clarifying, indeed I hadn't realized that vstack and hstack are functions, not macros.

I'm impressed by what you're proposing here. At the same time, I feel that there are many reasons to prefer the more general stack! macro (and implement hstack! and vstack! in terms of this). Here are some off the top of my head:

  • We can maintain consistent syntax with matrix!
  • hstack((&a, &b, &c)) seems far more noisy than hstack!(a, b, c) to me (the macro could take references without requiring explicit references, see discussion in Add macro for concatenating matrices #1080)
  • hstack!, vstack! and stack! would all share the same underlying restrictions/semantics since they're all just special cases of stack!
  • My impression is that the macro-based solution is both more maintainable and extensible.

Although merging both could be an option, I'm still personally reluctant to have two incompatible ways of stacking matrices in nalgebra 🤔 I think this could confuse users.

Overall I'm not personally swayed by the compile-time argument, but I'd be happy for others to chime in here with some perspectives.

…rmediate allocations when building a matrix with a mix of horizontal and vertical stacking.
@aweinstock314
Copy link
Contributor Author

I've implemented the lazy versions of hstack and vstack, and verified that they optimize properly.

#[inline(never)]
#[no_mangle]
fn codegen_hvstack(
    a: &Matrix4<u32>,
    b: &Matrix4<u32>,
    c: &Matrix4<u32>,
    d: &Matrix4<u32>,
) -> SMatrix<u32, 8, 8> {
    vstack((HStackLazy((a, b)), HStackLazy((c, d))))
}

compiles to

0000000000008dc0 <codegen_hvstack>:
    8dc0:   48 89 f8                mov    %rdi,%rax
    8dc3:   0f 10 06                movups (%rsi),%xmm0
    8dc6:   0f 11 07                movups %xmm0,(%rdi)
    8dc9:   0f 10 46 10             movups 0x10(%rsi),%xmm0
    8dcd:   0f 11 47 20             movups %xmm0,0x20(%rdi)
    8dd1:   0f 10 46 20             movups 0x20(%rsi),%xmm0
    8dd5:   0f 11 47 40             movups %xmm0,0x40(%rdi)
    8dd9:   0f 10 46 30             movups 0x30(%rsi),%xmm0
    8ddd:   0f 11 47 60             movups %xmm0,0x60(%rdi)
    8de1:   0f 10 02                movups (%rdx),%xmm0
    8de4:   0f 11 87 80 00 00 00    movups %xmm0,0x80(%rdi)
    8deb:   0f 10 42 10             movups 0x10(%rdx),%xmm0
    8def:   0f 11 87 a0 00 00 00    movups %xmm0,0xa0(%rdi)
    8df6:   0f 10 42 20             movups 0x20(%rdx),%xmm0
    8dfa:   0f 11 87 c0 00 00 00    movups %xmm0,0xc0(%rdi)
    8e01:   0f 10 42 30             movups 0x30(%rdx),%xmm0
    8e05:   0f 11 87 e0 00 00 00    movups %xmm0,0xe0(%rdi)
    8e0c:   0f 10 01                movups (%rcx),%xmm0
    8e0f:   0f 11 47 10             movups %xmm0,0x10(%rdi)
    8e13:   0f 10 41 10             movups 0x10(%rcx),%xmm0
    8e17:   0f 11 47 30             movups %xmm0,0x30(%rdi)
    8e1b:   0f 10 41 20             movups 0x20(%rcx),%xmm0
    8e1f:   0f 11 47 50             movups %xmm0,0x50(%rdi)
    8e23:   0f 10 41 30             movups 0x30(%rcx),%xmm0
    8e27:   0f 11 47 70             movups %xmm0,0x70(%rdi)
    8e2b:   41 0f 10 00             movups (%r8),%xmm0
    8e2f:   0f 11 87 90 00 00 00    movups %xmm0,0x90(%rdi)
    8e36:   41 0f 10 40 10          movups 0x10(%r8),%xmm0
    8e3b:   0f 11 87 b0 00 00 00    movups %xmm0,0xb0(%rdi)
    8e42:   41 0f 10 40 20          movups 0x20(%r8),%xmm0
    8e47:   0f 11 87 d0 00 00 00    movups %xmm0,0xd0(%rdi)
    8e4e:   41 0f 10 40 30          movups 0x30(%r8),%xmm0
    8e53:   0f 11 87 f0 00 00 00    movups %xmm0,0xf0(%rdi)
    8e5a:   c3                      ret
    8e5b:   0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)

@aweinstock314
Copy link
Contributor Author

One other possible advantage of the trait-based approach is that since it has access to the types (which if I'm understanding correctly, proc macros/syn don't), it's possible to get better type inference with this approach.

The current implementation only propagates sizes outwards (i.e. determines the size of the output matrix if the sizes of the input matrices are known), but it should be possible to support inferring at most one unknown-width matrix per hstack (and correspondingly, one unknown-height matrix per vstack) if the output width is known by specifying, for each tuple entry, that sum of the preceding entries' widths plus the current entry's width is equal to the output's width minus the sum of the succeeding entries' widths.

Could you elaborate on what makes you suspect that the trait approach is less maintainable/extensible than the proc macro approach? I feel that, while it is more verbose when expressed as trait bounds, the underlying logic (i.e. walking through the matrices to be concatenated, keeping track of the current position in the output matrix, and copying into slices of the output matrix) is essentially the same.

@Andlon
Copy link
Collaborator

Andlon commented Apr 14, 2023

I just completed a new round of reviews for #1080. My personal position is still that we should only have one way of stacking/concatenating matrices in nalgebra, and I think the stack!/cat! macro is both more powerful/generic, ergonomic and works in a manner consistent with our current matrix! macro. Again, I want to stress that I am impressed by what you have achieved here, and I hate to see your efforts go unmerged, but I am not swayed by the arguments for additionally merging the function-based approach.

Could you elaborate on what makes you suspect that the trait approach is less maintainable/extensible than the proc macro approach? I feel that, while it is more verbose when expressed as trait bounds, the underlying logic (i.e. walking through the matrices to be concatenated, keeping track of the current position in the output matrix, and copying into slices of the output matrix) is essentially the same.

I think there are several facets here. I think generally working with non-trivial macro_rules! macros is much more difficult than proc macros. I think the API surface of hstack and vstack is more complicated from a user perspective, in part because the trait bounds are quite difficult to relate to, whereas stack! basically just writes code more or less like you would write it by hand (start with a zero matrix and copy over block by block), although of course writing proc macros is also no walk in the park. However, the majority of the logic of collecting the syntax of [a, b; c, d] into a matrix of expressions is already shared with the existing matrix! macro. My experience with macro_rules! + trait combinations is that you easily run into restrictions of macro_rules! and the type system itself, whereas with proc macros there are basically no limitations: you get tokens in and you produce whatever tokens you want out.

I think it would be useful to have @sebcrozet's opinion here, if he has the time to chime in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants