Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite reshaping functions to improve performance #285

Merged
merged 29 commits into from
Oct 10, 2022
Merged

Conversation

etiennebacher
Copy link
Member

data_to_wide() and data_to_long() are rewritten to use stack() and unstack() instead of reshape() (this was suggested in nathaneastwood/poorman#48). This significantly improves the performance of these two functions and doesn't modify any of the use cases and tests.

Benchmarks

I just show the results here, the code to remake the benchmarks is in WIP/_BENCHMARKS_RESHAPE.R:

### DATA_TO_LONG ==========================================

ex1_l
#> # A tibble: 3 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 old           56.8s    56.8s    0.0176    4.05GB   0.0528
#> 2 new            9.2s     9.2s    0.109      2.1GB   0.109 
#> 3 tidyr         774ms    774ms    1.29      1.14GB   1.29
ex2_l
#> # A tibble: 3 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 old          3.05ms   3.79ms      247.    87.4KB     0   
#> 2 new           1.8ms   2.07ms      462.    83.7KB     4.66
#> 3 tidyr        2.39ms   3.22ms      278.    98.9KB     0
ex3_l
#> # A tibble: 3 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 old        246.58ms 276.52ms      3.63   42.36MB   0.151 
#> 2 new        200.69ms 233.01ms      4.28    5.02MB   0.0873
#> 3 tidyr        3.31ms   4.02ms    233.      1.79MB   4.75
ex4_l
#> # A tibble: 3 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 old         656.3ms  713.5ms      1.32   325.5MB    0.147
#> 2 new         203.1ms  229.4ms      4.22   125.7MB    0.468
#> 3 tidyr        20.3ms   23.2ms     42.2     33.5MB    0
ex5_l
#> # A tibble: 3 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 old         863.8ms  879.7ms      1.12     353MB    0.280
#> 2 new         326.1ms  338.8ms      2.92   111.1MB    0    
#> 3 tidyr        18.3ms   18.6ms     53.5     32.6MB    0

### DATA_TO_WIDE ==========================================

ex1_w
#> # A tibble: 3 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 old          3.68ms   4.41ms      212.     468KB        0
#> 2 new          3.94ms   4.42ms      212.     943KB        0
#> 3 tidyr         4.3ms    4.9ms      193.     436KB        0
ex2_w
#> # A tibble: 3 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 old         522.3ms    1.37s     0.765  445.75MB    0.191
#> 2 new          56.2ms  57.81ms    17.2     11.59MB    0    
#> 3 tidyr        11.1ms  12.63ms    80.9      1.78MB    0
ex3_w
#> # A tibble: 3 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 old           653ms    2.02s     0.545  446.11MB    0.136
#> 2 new          64.9ms  77.16ms    13.2     11.67MB    0    
#> 3 tidyr        12.3ms   16.7ms    61.2      1.81MB    0
ex4_w
#> # A tibble: 3 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 old           1.23s    1.25s     0.784    82.6MB    0.196
#> 2 new        367.93ms 505.28ms     2.10     42.1MB    0.234
#> 3 tidyr       36.44ms  42.25ms    23.0      16.6MB    0
ex5_w
#> # A tibble: 3 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 old           9.22s    9.22s     0.108   557.8MB    0.108
#> 2 new        797.88ms 797.88ms     1.25    287.7MB    0    
#> 3 tidyr      135.69ms 135.69ms     7.37     97.1MB    0
ex6_w
#> # A tibble: 3 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 old           2.15m    2.15m   0.00776    6.03GB    0.271
#> 2 new             11s      11s   0.0909     3.25GB    0.182
#> 3 tidyr          2.4s     2.4s   0.416    930.38MB    0.416

NEWS.md Outdated Show resolved Hide resolved
@etiennebacher

This comment was marked as outdated.

NAMESPACE Outdated Show resolved Hide resolved
R/data_to_long.R Outdated Show resolved Hide resolved
Co-authored-by: Indrajeet Patil <[email protected]>
@codecov-commenter
Copy link

codecov-commenter commented Oct 7, 2022

Codecov Report

Merging #285 (bf87604) into main (8698506) will increase coverage by 0.14%.
The diff coverage is 90.00%.

@@            Coverage Diff             @@
##             main     #285      +/-   ##
==========================================
+ Coverage   85.60%   85.74%   +0.14%     
==========================================
  Files          53       54       +1     
  Lines        3695     3753      +58     
==========================================
+ Hits         3163     3218      +55     
- Misses        532      535       +3     
Impacted Files Coverage Δ
R/data_to_long.R 89.13% <89.13%> (ø)
R/data_to_wide.R 90.62% <90.62%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

Copy link
Member

@IndrajeetPatil IndrajeetPatil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the amazing work, @etiennebacher!

I didn't even know about stack() and unstack() 👀

Feel free to merge when you think this is ready. Also, all code snippets relevant for development, like benchmarking, can live in the /dev folder (https://github.com/easystats/datawizard/tree/main/dev).

@strengejacke
Copy link
Member

strengejacke commented Oct 9, 2022

Sorry, I'm working on a paper revision, and this PR included to much code for a timely review ;-)
But benchmarks are impressive!

@nathaneastwood
Copy link

These are really impressive performance improvements! Well done @etiennebacher!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants