Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add functions to support interleaved complex num #27

Merged
merged 27 commits into from
Jul 17, 2024

Conversation

smu160
Copy link
Member

@smu160 smu160 commented Apr 30, 2024

@calebzulawski @Shnatsel

A draft PR to address issue #22.

@smu160 smu160 added the enhancement New feature or request label Apr 30, 2024
@smu160 smu160 self-assigned this May 1, 2024
@codecov-commenter
Copy link

codecov-commenter commented May 1, 2024

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 99.16%. Comparing base (20fafba) to head (713b3a6).

Additional details and impacted files
@@           Coverage Diff           @@
##             main      #27   +/-   ##
=======================================
  Coverage   99.16%   99.16%           
=======================================
  Files           8        8           
  Lines         841      841           
=======================================
  Hits          834      834           
  Misses          7        7           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@smu160 smu160 marked this pull request as ready for review May 1, 2024 18:08
@smu160
Copy link
Member Author

smu160 commented May 1, 2024

Hi @calebzulawski,

We now have the basics of what we'd need to support signals that are stored as an array of complex nums. Looking forward to hearing your thoughts!

Best,
Saveliy

@smu160 smu160 requested a review from Shnatsel May 1, 2024 18:11
Cargo.toml Outdated
@@ -13,6 +13,7 @@ exclude = ["assets", "scripts", "benches"]
[dependencies]
num-traits = "0.2.18"
multiversion = "0.7"
num-complex = "0.4.5"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be an optional feature, maybe even a non-default one but enabled on docs.rs. The associated functions then need to be annotated as available with this feature only; there is a rustdoc feature for that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made the following changes:

  • num-complex is now a optional and non-default feature
  • associated functions and tests are annotated as available with this feature only
  • doc will not creates docs for all features (not sure if that's the desired outcome here?)
  • update the CI pipeline to run tests for all features (since I had to annotate tests that use Complex)

src/lib.rs Outdated Show resolved Hide resolved
@calebzulawski
Copy link
Contributor

Something like this looks good to me! Are you open to possibly using a more trait-oriented design in the future? I wouldn't want to hold up this functionality, but I think various input and output types could be combined into one generic function.

@smu160
Copy link
Member Author

smu160 commented May 3, 2024

Hi @calebzulawski,

Thank you for the review! I think your trait-oriented solution would be great. Just to clarify, do we want to wait on merging this PR? Or do we want this PR as an interim solution?

Best,
Saveliy

@calebzulawski
Copy link
Contributor

I would probably merge it to have something to start from--just something to consider before a 1.0 release

@Shnatsel
Copy link
Collaborator

Shnatsel commented May 3, 2024

I've taken the liberty of rewriting separate_re_im and pushing to your branch.

I couldn't get it to vectorize on godbolt, but at least it should have the right public API, and we can always optimize it later.

@Shnatsel
Copy link
Collaborator

Shnatsel commented May 3, 2024

I tried a bunch more things and couldn't get it to autovectorize: https://godbolt.org/z/7Kd8v4vPG

We might need something like bytemuck here to view a Complex as a &[T; 2] or even better &[Complex<T>] as &[T] with double the length

@calebzulawski
Copy link
Contributor

I think the approach I would take is iterate by chunks, bytemuck to [T; N * 2], Simd::deinterleave, wrapped in #[multiversion]

@Shnatsel
Copy link
Collaborator

Shnatsel commented May 3, 2024

Yeah that sounds good to me

@calebzulawski
Copy link
Contributor

I am working on a simd-complex crate--this seems like an important enough algorithm to include there as an AoSoA <-> AoS conversion

smu160 added 5 commits May 8, 2024 12:57
- Make num-complex optional and non-default, and all functions that take
  and test interleaved FFT are now under the same feature

- Bump num-complex version 0.4.5 --> 0.4.6

- Enable num-complex for docs/docs.rs

- Update github action workflow to run tests for all features

- Fix formatting in docs to fix links
@smu160
Copy link
Member Author

smu160 commented May 15, 2024

@Shnatsel

I have a rough version of bytemuck + Simd deinterleave, on godbolt.

Let me know your thoughts. Thank you!

@Shnatsel
Copy link
Collaborator

I've taken a look at the Godbolt version. Wow, that is some aggressive loop unrolling! It looks good to me though.

One more thing to try: we need to benchmark if iterating over vec![0; n] is faster or slower than Vec::with_capacity(n) plus Vec::extend_from_slice(deinterleaved.to_array()) in the main loop. The latter incurs slightly more instructions but avoids a memset to zero out both vectors.

Use bytemuck + SIMD::deinterleave to rearrange input data from a slice
of Complex values into 2 slices of f32 or f64 values
src/lib.rs Outdated Show resolved Hide resolved
@smu160
Copy link
Member Author

smu160 commented May 22, 2024

@Shnatsel I added the implementation using bytemuck + portable SIMD deinterleave. The solution in godbolt had a bounds check, so that was removed as well. You can see the difference in godbolt.

@smu160
Copy link
Member Author

smu160 commented May 22, 2024

@Shnatsel

One more thing to try: we need to benchmark if iterating over vec![0; n] is faster or slower than Vec::with_capacity(n) plus Vec::extend_from_slice(deinterleaved.to_array()) in the main loop. The latter incurs slightly more instructions but avoids a memset to zero out both vectors.

I added a benchmark, using criterion, to test this idea. The benchmark is probably not leveraging criterion to its full potential, so let me know.

Running benches/bench.rs (target/release/deps/bench-3ef076ccef6e83be)
Gnuplot not found, using plotters backend
separate_re_im/with_zeroed_vecs/1024
                        time:   [405.64 ns 408.02 ns 411.05 ns]
                        thrpt:  [2.4912 Gelem/s 2.5097 Gelem/s 2.5244 Gelem/s]
Found 16 outliers among 100 measurements (16.00%)
  3 (3.00%) high mild
  13 (13.00%) high severe
separate_re_im/with_capacity/1024
                        time:   [422.60 ns 425.43 ns 428.32 ns]
                        thrpt:  [2.3907 Gelem/s 2.4070 Gelem/s 2.4231 Gelem/s]
Found 18 outliers among 100 measurements (18.00%)
  1 (1.00%) low mild
  2 (2.00%) high mild
  15 (15.00%) high severe
separate_re_im/with_zeroed_vecs/4096
                        time:   [1.9314 µs 1.9407 µs 1.9508 µs]
                        thrpt:  [2.0996 Gelem/s 2.1106 Gelem/s 2.1207 Gelem/s]
Found 10 outliers among 100 measurements (10.00%)
  4 (4.00%) low mild
  5 (5.00%) high mild
  1 (1.00%) high severe
separate_re_im/with_capacity/4096
                        time:   [1.7073 µs 1.7213 µs 1.7380 µs]
                        thrpt:  [2.3567 Gelem/s 2.3795 Gelem/s 2.3991 Gelem/s]
Found 19 outliers among 100 measurements (19.00%)
  17 (17.00%) low mild
  2 (2.00%) high severe
separate_re_im/with_zeroed_vecs/16384
                        time:   [8.1350 µs 8.4114 µs 8.7453 µs]
                        thrpt:  [1.8735 Gelem/s 1.9478 Gelem/s 2.0140 Gelem/s]
separate_re_im/with_capacity/16384
                        time:   [8.5694 µs 8.6258 µs 8.7400 µs]
                        thrpt:  [1.8746 Gelem/s 1.8994 Gelem/s 1.9119 Gelem/s]
Found 9 outliers among 100 measurements (9.00%)
  3 (3.00%) high mild
  6 (6.00%) high severe
separate_re_im/with_zeroed_vecs/65536
                        time:   [32.902 µs 33.322 µs 33.821 µs]
                        thrpt:  [1.9377 Gelem/s 1.9668 Gelem/s 1.9918 Gelem/s]
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe
separate_re_im/with_capacity/65536
                        time:   [39.719 µs 39.987 µs 40.303 µs]
                        thrpt:  [1.6261 Gelem/s 1.6389 Gelem/s 1.6500 Gelem/s]
Found 27 outliers among 100 measurements (27.00%)
  3 (3.00%) low severe
  11 (11.00%) low mild
  3 (3.00%) high mild
  10 (10.00%) high severe
separate_re_im/with_zeroed_vecs/262144
                        time:   [243.93 µs 247.17 µs 251.19 µs]
                        thrpt:  [1.0436 Gelem/s 1.0606 Gelem/s 1.0747 Gelem/s]
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe
separate_re_im/with_capacity/262144
                        time:   [276.91 µs 279.97 µs 283.36 µs]
                        thrpt:  [925.12 Melem/s 936.31 Melem/s 946.67 Melem/s]
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe
Benchmarking separate_re_im/with_zeroed_vecs/1048576: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 6.0s, enable flat sampling, or reduce sample count to 60.
separate_re_im/with_zeroed_vecs/1048576
                        time:   [1.2126 ms 1.2465 ms 1.2828 ms]
                        thrpt:  [817.42 Melem/s 841.24 Melem/s 864.72 Melem/s]
Found 16 outliers among 100 measurements (16.00%)
  2 (2.00%) low mild
  1 (1.00%) high mild
  13 (13.00%) high severe
Benchmarking separate_re_im/with_capacity/1048576: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 7.4s, enable flat sampling, or reduce sample count to 50.
separate_re_im/with_capacity/1048576
                        time:   [1.4784 ms 1.4848 ms 1.4942 ms]
                        thrpt:  [701.76 Melem/s 706.22 Melem/s 709.26 Melem/s]
Found 7 outliers among 100 measurements (7.00%)
  2 (2.00%) high mild
  5 (5.00%) high severe

Edit: Added in results

Copy link
Collaborator

@Shnatsel Shnatsel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, let's get this merged!

@smu160 smu160 merged commit 08eedcb into main Jul 17, 2024
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants