Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Multithread memmap #592

Merged
merged 20 commits into from
Dec 12, 2023
Merged

[Feature] Multithread memmap #592

merged 20 commits into from
Dec 12, 2023

Conversation

vmoens
Copy link
Contributor

@vmoens vmoens commented Dec 8, 2023

TODO:

  • Doc
  • Tests
  • hide the executor logit under a private _memmap_
  • Make a memmap version that does not change the tenosrdict inplace

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Dec 8, 2023
Copy link

github-actions bot commented Dec 8, 2023

$\color{#D29922}\textsf{\Large⚠\kern{0.2cm}\normalsize Warning}$ Result of CPU Benchmark Tests

Total Benchmarks: 113. Improved: $\large\color{#35bf28}4$. Worsened: $\large\color{#d91a1a}16$.

Expand to view detailed results
Name Max Mean Ops Ops on Repo HEAD Change
test_plain_set_nested 32.6910μs 16.4405μs 60.8253 KOps/s 62.0754 KOps/s $\color{#d91a1a}-2.01\%$
test_plain_set_stack_nested 0.2049ms 0.1433ms 6.9772 KOps/s 7.0160 KOps/s $\color{#d91a1a}-0.55\%$
test_plain_set_nested_inplace 68.1970μs 18.9244μs 52.8419 KOps/s 54.8609 KOps/s $\color{#d91a1a}-3.68\%$
test_plain_set_stack_nested_inplace 0.2440ms 0.1788ms 5.5936 KOps/s 5.6648 KOps/s $\color{#d91a1a}-1.26\%$
test_items 25.3770μs 2.5491μs 392.2952 KOps/s 403.8522 KOps/s $\color{#d91a1a}-2.86\%$
test_items_nested 0.3398ms 0.2698ms 3.7062 KOps/s 3.6800 KOps/s $\color{#35bf28}+0.71\%$
test_items_nested_locked 0.9370ms 0.2706ms 3.6951 KOps/s 3.6821 KOps/s $\color{#35bf28}+0.35\%$
test_items_nested_leaf 0.2417ms 0.1668ms 5.9948 KOps/s 5.9776 KOps/s $\color{#35bf28}+0.29\%$
test_items_stack_nested 2.3760ms 1.4973ms 667.8644 Ops/s 673.2859 Ops/s $\color{#d91a1a}-0.81\%$
test_items_stack_nested_leaf 2.1307ms 1.3408ms 745.8178 Ops/s 742.7414 Ops/s $\color{#35bf28}+0.41\%$
test_items_stack_nested_locked 2.9134ms 0.7817ms 1.2792 KOps/s 1.2996 KOps/s $\color{#d91a1a}-1.56\%$
test_keys 33.8530μs 3.8593μs 259.1157 KOps/s 256.0052 KOps/s $\color{#35bf28}+1.22\%$
test_keys_nested 0.5508ms 0.1413ms 7.0757 KOps/s 6.6715 KOps/s $\textbf{\color{#35bf28}+6.06\%}$
test_keys_nested_locked 0.3052ms 0.1399ms 7.1480 KOps/s 7.1026 KOps/s $\color{#35bf28}+0.64\%$
test_keys_nested_leaf 0.2866ms 0.1401ms 7.1390 KOps/s 7.0905 KOps/s $\color{#35bf28}+0.68\%$
test_keys_stack_nested 1.5466ms 1.4040ms 712.2384 Ops/s 710.7292 Ops/s $\color{#35bf28}+0.21\%$
test_keys_stack_nested_leaf 2.2087ms 1.4017ms 713.4438 Ops/s 711.5908 Ops/s $\color{#35bf28}+0.26\%$
test_keys_stack_nested_locked 0.7901ms 0.6819ms 1.4665 KOps/s 1.4492 KOps/s $\color{#35bf28}+1.20\%$
test_values 8.5310μs 1.1789μs 848.2589 KOps/s 864.3089 KOps/s $\color{#d91a1a}-1.86\%$
test_values_nested 91.9210μs 49.5734μs 20.1721 KOps/s 20.0002 KOps/s $\color{#35bf28}+0.86\%$
test_values_nested_locked 98.4130μs 49.6480μs 20.1418 KOps/s 20.1219 KOps/s $\color{#35bf28}+0.10\%$
test_values_nested_leaf 0.1033ms 44.5761μs 22.4335 KOps/s 22.5617 KOps/s $\color{#d91a1a}-0.57\%$
test_values_stack_nested 1.8326ms 1.1991ms 833.9643 Ops/s 839.2398 Ops/s $\color{#d91a1a}-0.63\%$
test_values_stack_nested_leaf 1.3569ms 1.1875ms 842.1101 Ops/s 847.3135 Ops/s $\color{#d91a1a}-0.61\%$
test_values_stack_nested_locked 1.0054ms 0.5093ms 1.9633 KOps/s 1.9099 KOps/s $\color{#35bf28}+2.80\%$
test_membership 15.8600μs 1.3630μs 733.6513 KOps/s 746.2782 KOps/s $\color{#d91a1a}-1.69\%$
test_membership_nested 21.8210μs 2.7867μs 358.8453 KOps/s 359.7178 KOps/s $\color{#d91a1a}-0.24\%$
test_membership_nested_leaf 19.6860μs 2.7989μs 357.2769 KOps/s 358.2138 KOps/s $\color{#d91a1a}-0.26\%$
test_membership_stacked_nested 41.6270μs 11.8173μs 84.6217 KOps/s 80.1730 KOps/s $\textbf{\color{#35bf28}+5.55\%}$
test_membership_stacked_nested_leaf 34.5240μs 11.8211μs 84.5947 KOps/s 84.6542 KOps/s $\color{#d91a1a}-0.07\%$
test_membership_nested_last 25.8580μs 5.8348μs 171.3842 KOps/s 172.3043 KOps/s $\color{#d91a1a}-0.53\%$
test_membership_nested_leaf_last 38.3110μs 5.9580μs 167.8419 KOps/s 171.2949 KOps/s $\color{#d91a1a}-2.02\%$
test_membership_stacked_nested_last 0.2283ms 0.1660ms 6.0243 KOps/s 6.0194 KOps/s $\color{#35bf28}+0.08\%$
test_membership_stacked_nested_leaf_last 43.6020μs 13.8338μs 72.2868 KOps/s 72.3531 KOps/s $\color{#d91a1a}-0.09\%$
test_nested_getleaf 37.0990μs 10.5126μs 95.1243 KOps/s 93.2589 KOps/s $\color{#35bf28}+2.00\%$
test_nested_get 30.0760μs 10.0211μs 99.7895 KOps/s 98.7054 KOps/s $\color{#35bf28}+1.10\%$
test_stacked_getleaf 0.7322ms 0.6335ms 1.5784 KOps/s 1.5520 KOps/s $\color{#35bf28}+1.70\%$
test_stacked_get 1.1473ms 0.6059ms 1.6505 KOps/s 1.6523 KOps/s $\color{#d91a1a}-0.11\%$
test_nested_getitemleaf 63.3880μs 10.5863μs 94.4615 KOps/s 93.9827 KOps/s $\color{#35bf28}+0.51\%$
test_nested_getitem 31.0480μs 10.1386μs 98.6327 KOps/s 99.6718 KOps/s $\color{#d91a1a}-1.04\%$
test_stacked_getitemleaf 0.7318ms 0.6391ms 1.5646 KOps/s 1.5556 KOps/s $\color{#35bf28}+0.58\%$
test_stacked_getitem 0.7759ms 0.6077ms 1.6457 KOps/s 1.6351 KOps/s $\color{#35bf28}+0.65\%$
test_lock_nested 56.2178ms 0.4762ms 2.1000 KOps/s 2.4052 KOps/s $\textbf{\color{#d91a1a}-12.69\%}$
test_lock_stack_nested 72.2923ms 6.3834ms 156.6569 Ops/s 152.4783 Ops/s $\color{#35bf28}+2.74\%$
test_unlock_nested 1.0170ms 0.4255ms 2.3503 KOps/s 2.0695 KOps/s $\textbf{\color{#35bf28}+13.57\%}$
test_unlock_stack_nested 70.2606ms 6.1245ms 163.2792 Ops/s 163.4588 Ops/s $\color{#d91a1a}-0.11\%$
test_flatten_speed 0.4588ms 0.2657ms 3.7630 KOps/s 3.7561 KOps/s $\color{#35bf28}+0.18\%$
test_unflatten_speed 0.7782ms 0.4459ms 2.2427 KOps/s 2.2160 KOps/s $\color{#35bf28}+1.20\%$
test_common_ops 5.5184ms 0.7094ms 1.4097 KOps/s 1.5658 KOps/s $\textbf{\color{#d91a1a}-9.97\%}$
test_creation 15.1580μs 2.0118μs 497.0572 KOps/s 500.8651 KOps/s $\color{#d91a1a}-0.76\%$
test_creation_empty 27.9920μs 9.3741μs 106.6769 KOps/s 122.5054 KOps/s $\textbf{\color{#d91a1a}-12.92\%}$
test_creation_nested_1 28.9840μs 12.1371μs 82.3919 KOps/s 90.7723 KOps/s $\textbf{\color{#d91a1a}-9.23\%}$
test_creation_nested_2 41.4070μs 17.4693μs 57.2432 KOps/s 68.6027 KOps/s $\textbf{\color{#d91a1a}-16.56\%}$
test_clone 0.2575ms 12.4902μs 80.0626 KOps/s 80.7736 KOps/s $\color{#d91a1a}-0.88\%$
test_getitem[int] 62.5770μs 12.0051μs 83.2982 KOps/s 82.8571 KOps/s $\color{#35bf28}+0.53\%$
test_getitem[slice_int] 74.9890μs 23.7434μs 42.1170 KOps/s 42.4021 KOps/s $\color{#d91a1a}-0.67\%$
test_getitem[range] 0.1223ms 42.8248μs 23.3510 KOps/s 23.4048 KOps/s $\color{#d91a1a}-0.23\%$
test_getitem[tuple] 57.5770μs 19.4986μs 51.2857 KOps/s 51.5905 KOps/s $\color{#d91a1a}-0.59\%$
test_getitem[list] 0.1004ms 37.4006μs 26.7375 KOps/s 27.1882 KOps/s $\color{#d91a1a}-1.66\%$
test_setitem_dim[int] 51.9570μs 30.1005μs 33.2220 KOps/s 34.4650 KOps/s $\color{#d91a1a}-3.61\%$
test_setitem_dim[slice_int] 96.7610μs 53.8245μs 18.5789 KOps/s 18.9882 KOps/s $\color{#d91a1a}-2.16\%$
test_setitem_dim[range] 0.1399ms 73.3625μs 13.6309 KOps/s 14.0054 KOps/s $\color{#d91a1a}-2.67\%$
test_setitem_dim[tuple] 72.2450μs 43.4562μs 23.0117 KOps/s 24.1594 KOps/s $\color{#d91a1a}-4.75\%$
test_setitem 0.2168ms 18.2514μs 54.7903 KOps/s 58.4206 KOps/s $\textbf{\color{#d91a1a}-6.21\%}$
test_set 0.2227ms 17.8169μs 56.1266 KOps/s 60.5281 KOps/s $\textbf{\color{#d91a1a}-7.27\%}$
test_set_shared 4.9355ms 0.1397ms 7.1591 KOps/s 7.1091 KOps/s $\color{#35bf28}+0.70\%$
test_update 0.1533ms 20.2069μs 49.4880 KOps/s 54.5509 KOps/s $\textbf{\color{#d91a1a}-9.28\%}$
test_update_nested 0.1465ms 27.8067μs 35.9626 KOps/s 39.4380 KOps/s $\textbf{\color{#d91a1a}-8.81\%}$
test_set_nested 0.1462ms 19.6319μs 50.9375 KOps/s 55.2200 KOps/s $\textbf{\color{#d91a1a}-7.76\%}$
test_set_nested_new 0.1690ms 23.8536μs 41.9224 KOps/s 45.0039 KOps/s $\textbf{\color{#d91a1a}-6.85\%}$
test_select 97.5220μs 48.4772μs 20.6282 KOps/s 22.0468 KOps/s $\textbf{\color{#d91a1a}-6.43\%}$
test_unbind_speed 0.4019ms 0.3446ms 2.9017 KOps/s 2.9253 KOps/s $\color{#d91a1a}-0.81\%$
test_unbind_speed_stack0 62.1491ms 4.1986ms 238.1741 Ops/s 226.7453 Ops/s $\textbf{\color{#35bf28}+5.04\%}$
test_unbind_speed_stack1 2.0839μs 0.6296μs 1.5884 MOps/s 1.5556 MOps/s $\color{#35bf28}+2.11\%$
test_split 59.2533ms 1.6703ms 598.7016 Ops/s 599.3525 Ops/s $\color{#d91a1a}-0.11\%$
test_chunk 56.9097ms 1.6505ms 605.8761 Ops/s 604.7532 Ops/s $\color{#35bf28}+0.19\%$
test_creation[device0] 0.7547ms 0.2944ms 3.3969 KOps/s 3.3701 KOps/s $\color{#35bf28}+0.79\%$
test_creation_from_tensor 58.7447ms 0.3653ms 2.7376 KOps/s 3.0586 KOps/s $\textbf{\color{#d91a1a}-10.50\%}$
test_add_one[memmap_tensor0] 0.2885ms 25.4628μs 39.2730 KOps/s 38.8741 KOps/s $\color{#35bf28}+1.03\%$
test_contiguous[memmap_tensor0] 30.4360μs 5.7998μs 172.4190 KOps/s 172.6585 KOps/s $\color{#d91a1a}-0.14\%$
test_stack[memmap_tensor0] 0.1179ms 19.5108μs 51.2537 KOps/s 52.4564 KOps/s $\color{#d91a1a}-2.29\%$
test_memmaptd_index 0.4094ms 0.2011ms 4.9715 KOps/s 4.9746 KOps/s $\color{#d91a1a}-0.06\%$
test_memmaptd_index_astensor 0.3559ms 0.2586ms 3.8665 KOps/s 3.8916 KOps/s $\color{#d91a1a}-0.65\%$
test_memmaptd_index_op 1.0442ms 0.5186ms 1.9282 KOps/s 1.9958 KOps/s $\color{#d91a1a}-3.39\%$
test_reshape_pytree 76.5430μs 22.8308μs 43.8006 KOps/s 42.3029 KOps/s $\color{#35bf28}+3.54\%$
test_reshape_td 69.8100μs 31.0751μs 32.1801 KOps/s 33.0390 KOps/s $\color{#d91a1a}-2.60\%$
test_view_pytree 57.1270μs 22.8623μs 43.7401 KOps/s 42.1337 KOps/s $\color{#35bf28}+3.81\%$
test_view_td 22.7730μs 4.9168μs 203.3824 KOps/s 203.5568 KOps/s $\color{#d91a1a}-0.09\%$
test_unbind_pytree 57.1660μs 26.6390μs 37.5389 KOps/s 37.3591 KOps/s $\color{#35bf28}+0.48\%$
test_unbind_td 99.3150μs 55.0312μs 18.1715 KOps/s 18.2722 KOps/s $\color{#d91a1a}-0.55\%$
test_split_pytree 90.1960μs 25.9858μs 38.4825 KOps/s 36.9721 KOps/s $\color{#35bf28}+4.09\%$
test_split_td 95.0460μs 43.7170μs 22.8744 KOps/s 23.1311 KOps/s $\color{#d91a1a}-1.11\%$
test_add_pytree 71.6130μs 31.6149μs 31.6306 KOps/s 30.6951 KOps/s $\color{#35bf28}+3.05\%$
test_add_td 0.1544ms 46.5804μs 21.4683 KOps/s 22.0936 KOps/s $\color{#d91a1a}-2.83\%$
test_distributed 22.2010μs 6.1994μs 161.3062 KOps/s 167.1240 KOps/s $\color{#d91a1a}-3.48\%$
test_tdmodule 0.9687ms 23.6401μs 42.3010 KOps/s 47.2084 KOps/s $\textbf{\color{#d91a1a}-10.40\%}$
test_tdmodule_dispatch 0.1915ms 41.2759μs 24.2272 KOps/s 25.5508 KOps/s $\textbf{\color{#d91a1a}-5.18\%}$
test_tdseq 44.6930μs 25.8464μs 38.6901 KOps/s 39.3421 KOps/s $\color{#d91a1a}-1.66\%$
test_tdseq_dispatch 0.3840ms 45.3841μs 22.0341 KOps/s 23.1165 KOps/s $\color{#d91a1a}-4.68\%$
test_instantiation_functorch 1.9069ms 1.3335ms 749.9312 Ops/s 767.5280 Ops/s $\color{#d91a1a}-2.29\%$
test_instantiation_td 1.5532ms 1.0172ms 983.0669 Ops/s 991.5911 Ops/s $\color{#d91a1a}-0.86\%$
test_exec_functorch 0.2177ms 0.1607ms 6.2221 KOps/s 6.2603 KOps/s $\color{#d91a1a}-0.61\%$
test_exec_functional_call 0.3592ms 0.1477ms 6.7683 KOps/s 6.7604 KOps/s $\color{#35bf28}+0.12\%$
test_exec_td 0.2564ms 0.1494ms 6.6916 KOps/s 7.1193 KOps/s $\textbf{\color{#d91a1a}-6.01\%}$
test_exec_td_decorator 0.8052ms 0.1769ms 5.6537 KOps/s 5.8255 KOps/s $\color{#d91a1a}-2.95\%$
test_vmap_mlp_speed[True-True] 1.0294ms 0.8972ms 1.1146 KOps/s 1.1201 KOps/s $\color{#d91a1a}-0.50\%$
test_vmap_mlp_speed[True-False] 0.5834ms 0.4687ms 2.1333 KOps/s 2.1316 KOps/s $\color{#35bf28}+0.08\%$
test_vmap_mlp_speed[False-True] 1.1721ms 0.7810ms 1.2804 KOps/s 1.2804 KOps/s $-0.00\%$
test_vmap_mlp_speed[False-False] 0.6177ms 0.3850ms 2.5976 KOps/s 2.5528 KOps/s $\color{#35bf28}+1.76\%$
test_vmap_mlp_speed_decorator[True-True] 2.6808ms 1.7925ms 557.8855 Ops/s 567.7900 Ops/s $\color{#d91a1a}-1.74\%$
test_vmap_mlp_speed_decorator[True-False] 1.1523ms 0.5216ms 1.9171 KOps/s 1.9531 KOps/s $\color{#d91a1a}-1.84\%$
test_vmap_mlp_speed_decorator[False-True] 1.8863ms 1.4939ms 669.3800 Ops/s 663.4180 Ops/s $\color{#35bf28}+0.90\%$
test_vmap_mlp_speed_decorator[False-False] 0.8070ms 0.3932ms 2.5431 KOps/s 2.5259 KOps/s $\color{#35bf28}+0.68\%$

Copy link

github-actions bot commented Dec 8, 2023

$\color{#D29922}\textsf{\Large⚠\kern{0.2cm}\normalsize Warning}$ Result of GPU Benchmark Tests

Total Benchmarks: 127. Improved: $\large\color{#35bf28}2$. Worsened: $\large\color{#d91a1a}3$.

Expand to view detailed results
Name Max Mean Ops Ops on Repo HEAD Change
test_plain_set_nested 77.6440μs 12.4817μs 80.1174 KOps/s 78.7275 KOps/s $\color{#35bf28}+1.77\%$
test_plain_set_stack_nested 0.1467ms 0.1139ms 8.7804 KOps/s 8.3753 KOps/s $\color{#35bf28}+4.84\%$
test_plain_set_nested_inplace 40.4620μs 13.7963μs 72.4831 KOps/s 71.0111 KOps/s $\color{#35bf28}+2.07\%$
test_plain_set_stack_nested_inplace 0.1764ms 0.1418ms 7.0541 KOps/s 7.0112 KOps/s $\color{#35bf28}+0.61\%$
test_items 28.2220μs 4.6421μs 215.4197 KOps/s 214.1616 KOps/s $\color{#35bf28}+0.59\%$
test_items_nested 0.3984ms 0.3425ms 2.9197 KOps/s 2.8959 KOps/s $\color{#35bf28}+0.82\%$
test_items_nested_locked 0.3877ms 0.3447ms 2.9010 KOps/s 2.8641 KOps/s $\color{#35bf28}+1.29\%$
test_items_nested_leaf 0.2217ms 0.2004ms 4.9902 KOps/s 4.9036 KOps/s $\color{#35bf28}+1.77\%$
test_items_stack_nested 1.5319ms 1.4707ms 679.9390 Ops/s 688.0857 Ops/s $\color{#d91a1a}-1.18\%$
test_items_stack_nested_leaf 1.3768ms 1.2824ms 779.8028 Ops/s 777.3654 Ops/s $\color{#35bf28}+0.31\%$
test_items_stack_nested_locked 2.3611ms 0.8154ms 1.2265 KOps/s 1.2194 KOps/s $\color{#35bf28}+0.58\%$
test_keys 15.1610μs 4.5538μs 219.5957 KOps/s 219.0237 KOps/s $\color{#35bf28}+0.26\%$
test_keys_nested 0.4602ms 90.4366μs 11.0575 KOps/s 10.9615 KOps/s $\color{#35bf28}+0.88\%$
test_keys_nested_locked 0.1339ms 89.4325μs 11.1816 KOps/s 11.0821 KOps/s $\color{#35bf28}+0.90\%$
test_keys_nested_leaf 42.3907ms 86.2191μs 11.5984 KOps/s 12.1600 KOps/s $\color{#d91a1a}-4.62\%$
test_keys_stack_nested 1.3235ms 1.2530ms 798.0933 Ops/s 789.6106 Ops/s $\color{#35bf28}+1.07\%$
test_keys_stack_nested_leaf 1.3268ms 1.2536ms 797.7321 Ops/s 803.0791 Ops/s $\color{#d91a1a}-0.67\%$
test_keys_stack_nested_locked 0.6823ms 0.6102ms 1.6388 KOps/s 1.6334 KOps/s $\color{#35bf28}+0.33\%$
test_values 8.8307μs 1.8896μs 529.2050 KOps/s 527.2241 KOps/s $\color{#35bf28}+0.38\%$
test_values_nested 63.2940μs 42.8338μs 23.3461 KOps/s 23.3856 KOps/s $\color{#d91a1a}-0.17\%$
test_values_nested_locked 65.0330μs 45.0455μs 22.1998 KOps/s 22.1791 KOps/s $\color{#35bf28}+0.09\%$
test_values_nested_leaf 56.7930μs 37.2403μs 26.8526 KOps/s 26.8809 KOps/s $\color{#d91a1a}-0.11\%$
test_values_stack_nested 1.1663ms 1.1004ms 908.7643 Ops/s 908.0421 Ops/s $\color{#35bf28}+0.08\%$
test_values_stack_nested_leaf 1.1699ms 1.0854ms 921.3085 Ops/s 918.0863 Ops/s $\color{#35bf28}+0.35\%$
test_values_stack_nested_locked 0.6822ms 0.4836ms 2.0680 KOps/s 2.0210 KOps/s $\color{#35bf28}+2.32\%$
test_membership 4.8162μs 0.9255μs 1.0805 MOps/s 1.0721 MOps/s $\color{#35bf28}+0.79\%$
test_membership_nested 14.1610μs 2.0514μs 487.4800 KOps/s 457.9140 KOps/s $\textbf{\color{#35bf28}+6.46\%}$
test_membership_nested_leaf 16.4210μs 2.0580μs 485.9077 KOps/s 473.5376 KOps/s $\color{#35bf28}+2.61\%$
test_membership_stacked_nested 31.6010μs 10.7905μs 92.6739 KOps/s 94.0408 KOps/s $\color{#d91a1a}-1.45\%$
test_membership_stacked_nested_leaf 44.4120μs 10.8011μs 92.5830 KOps/s 94.2439 KOps/s $\color{#d91a1a}-1.76\%$
test_membership_nested_last 20.3210μs 4.5185μs 221.3137 KOps/s 221.8566 KOps/s $\color{#d91a1a}-0.24\%$
test_membership_nested_leaf_last 39.8320μs 4.5326μs 220.6243 KOps/s 222.3804 KOps/s $\color{#d91a1a}-0.79\%$
test_membership_stacked_nested_last 0.2108ms 0.1337ms 7.4809 KOps/s 7.4223 KOps/s $\color{#35bf28}+0.79\%$
test_membership_stacked_nested_leaf_last 36.3120μs 12.6675μs 78.9422 KOps/s 79.7596 KOps/s $\color{#d91a1a}-1.02\%$
test_nested_getleaf 18.6410μs 8.3566μs 119.6657 KOps/s 118.3269 KOps/s $\color{#35bf28}+1.13\%$
test_nested_get 31.1320μs 7.8982μs 126.6116 KOps/s 125.4415 KOps/s $\color{#35bf28}+0.93\%$
test_stacked_getleaf 0.6295ms 0.5612ms 1.7820 KOps/s 1.8019 KOps/s $\color{#d91a1a}-1.10\%$
test_stacked_get 0.5566ms 0.5243ms 1.9074 KOps/s 1.9326 KOps/s $\color{#d91a1a}-1.30\%$
test_nested_getitemleaf 70.5520μs 8.4318μs 118.5990 KOps/s 118.3853 KOps/s $\color{#35bf28}+0.18\%$
test_nested_getitem 30.7500μs 7.9550μs 125.7077 KOps/s 125.0268 KOps/s $\color{#35bf28}+0.54\%$
test_stacked_getitemleaf 0.5943ms 0.5596ms 1.7870 KOps/s 1.7972 KOps/s $\color{#d91a1a}-0.57\%$
test_stacked_getitem 0.6096ms 0.5424ms 1.8435 KOps/s 1.9145 KOps/s $\color{#d91a1a}-3.71\%$
test_lock_nested 1.5012ms 0.4069ms 2.4575 KOps/s 2.4302 KOps/s $\color{#35bf28}+1.13\%$
test_lock_stack_nested 62.7289ms 5.8282ms 171.5792 Ops/s 167.3592 Ops/s $\color{#35bf28}+2.52\%$
test_unlock_nested 0.9317ms 0.4036ms 2.4777 KOps/s 2.4492 KOps/s $\color{#35bf28}+1.17\%$
test_unlock_stack_nested 61.8630ms 5.9549ms 167.9285 Ops/s 167.2287 Ops/s $\color{#35bf28}+0.42\%$
test_flatten_speed 0.4500ms 0.1876ms 5.3316 KOps/s 5.3043 KOps/s $\color{#35bf28}+0.52\%$
test_unflatten_speed 0.4046ms 0.3539ms 2.8258 KOps/s 2.8457 KOps/s $\color{#d91a1a}-0.70\%$
test_common_ops 1.0575ms 0.5656ms 1.7681 KOps/s 1.8104 KOps/s $\color{#d91a1a}-2.34\%$
test_creation 13.9510μs 1.5913μs 628.4023 KOps/s 632.7778 KOps/s $\color{#d91a1a}-0.69\%$
test_creation_empty 36.1020μs 6.2766μs 159.3208 KOps/s 152.9666 KOps/s $\color{#35bf28}+4.15\%$
test_creation_nested_1 25.1810μs 8.1939μs 122.0417 KOps/s 118.9544 KOps/s $\color{#35bf28}+2.60\%$
test_creation_nested_2 38.5720μs 12.5508μs 79.6763 KOps/s 90.9810 KOps/s $\textbf{\color{#d91a1a}-12.43\%}$
test_clone 76.3930μs 12.5559μs 79.6440 KOps/s 79.1519 KOps/s $\color{#35bf28}+0.62\%$
test_getitem[int] 34.8220μs 10.8676μs 92.0169 KOps/s 90.6512 KOps/s $\color{#35bf28}+1.51\%$
test_getitem[slice_int] 43.2820μs 20.2533μs 49.3746 KOps/s 48.7174 KOps/s $\color{#35bf28}+1.35\%$
test_getitem[range] 67.9030μs 36.0313μs 27.7537 KOps/s 27.9547 KOps/s $\color{#d91a1a}-0.72\%$
test_getitem[tuple] 38.4420μs 18.3812μs 54.4035 KOps/s 53.7479 KOps/s $\color{#35bf28}+1.22\%$
test_getitem[list] 0.2933ms 32.1920μs 31.0636 KOps/s 30.2395 KOps/s $\color{#35bf28}+2.73\%$
test_setitem_dim[int] 41.3520μs 23.3936μs 42.7467 KOps/s 41.2910 KOps/s $\color{#35bf28}+3.53\%$
test_setitem_dim[slice_int] 58.7430μs 41.4552μs 24.1224 KOps/s 23.2084 KOps/s $\color{#35bf28}+3.94\%$
test_setitem_dim[range] 80.8540μs 58.9638μs 16.9596 KOps/s 16.7180 KOps/s $\color{#35bf28}+1.45\%$
test_setitem_dim[tuple] 66.4930μs 37.1156μs 26.9428 KOps/s 26.5970 KOps/s $\color{#35bf28}+1.30\%$
test_setitem 83.3350μs 16.0252μs 62.4019 KOps/s 59.5563 KOps/s $\color{#35bf28}+4.78\%$
test_set 85.0250μs 15.5542μs 64.2914 KOps/s 64.0821 KOps/s $\color{#35bf28}+0.33\%$
test_set_shared 3.0273ms 99.9083μs 10.0092 KOps/s 10.0757 KOps/s $\color{#d91a1a}-0.66\%$
test_update 94.5650μs 17.2443μs 57.9902 KOps/s 57.5475 KOps/s $\color{#35bf28}+0.77\%$
test_update_nested 0.1170ms 23.1561μs 43.1852 KOps/s 43.4081 KOps/s $\color{#d91a1a}-0.51\%$
test_set_nested 94.3150μs 16.4407μs 60.8247 KOps/s 60.3421 KOps/s $\color{#35bf28}+0.80\%$
test_set_nested_new 97.2450μs 19.5751μs 51.0854 KOps/s 50.6414 KOps/s $\color{#35bf28}+0.88\%$
test_select 0.1059ms 40.5126μs 24.6837 KOps/s 24.2847 KOps/s $\color{#35bf28}+1.64\%$
test_to 70.0430μs 49.1901μs 20.3293 KOps/s 19.6753 KOps/s $\color{#35bf28}+3.32\%$
test_to_nonblocking 66.3040μs 30.9163μs 32.3454 KOps/s 31.3330 KOps/s $\color{#35bf28}+3.23\%$
test_unbind_speed 0.3613ms 0.3277ms 3.0518 KOps/s 3.0688 KOps/s $\color{#d91a1a}-0.55\%$
test_unbind_speed_stack0 60.7399ms 3.8979ms 256.5460 Ops/s 239.4895 Ops/s $\textbf{\color{#35bf28}+7.12\%}$
test_unbind_speed_stack1 1.8771μs 0.5229μs 1.9125 MOps/s 1.9121 MOps/s $\color{#35bf28}+0.02\%$
test_split 54.3458ms 1.6303ms 613.3768 Ops/s 605.3905 Ops/s $\color{#35bf28}+1.32\%$
test_chunk 53.8432ms 1.6158ms 618.8695 Ops/s 610.7712 Ops/s $\color{#35bf28}+1.33\%$
test_creation[device0] 0.3780ms 0.3054ms 3.2739 KOps/s 3.2652 KOps/s $\color{#35bf28}+0.27\%$
test_creation[device1] 0.7026ms 0.3127ms 3.1978 KOps/s 3.2344 KOps/s $\color{#d91a1a}-1.13\%$
test_creation_from_tensor 59.7632ms 0.3694ms 2.7069 KOps/s 2.9891 KOps/s $\textbf{\color{#d91a1a}-9.44\%}$
test_add_one[memmap_tensor0] 0.1384ms 23.3121μs 42.8962 KOps/s 43.3167 KOps/s $\color{#d91a1a}-0.97\%$
test_add_one[memmap_tensor1] 0.1908ms 70.6277μs 14.1588 KOps/s 14.1322 KOps/s $\color{#35bf28}+0.19\%$
test_contiguous[memmap_tensor0] 25.7710μs 5.8550μs 170.7939 KOps/s 170.5869 KOps/s $\color{#35bf28}+0.12\%$
test_contiguous[memmap_tensor1] 50.7430μs 20.7591μs 48.1716 KOps/s 46.9792 KOps/s $\color{#35bf28}+2.54\%$
test_stack[memmap_tensor0] 48.4330μs 18.9893μs 52.6612 KOps/s 52.7061 KOps/s $\color{#d91a1a}-0.09\%$
test_stack[memmap_tensor1] 0.1199ms 70.1977μs 14.2455 KOps/s 14.2003 KOps/s $\color{#35bf28}+0.32\%$
test_memmaptd_index 0.2814ms 0.2383ms 4.1969 KOps/s 4.2452 KOps/s $\color{#d91a1a}-1.14\%$
test_memmaptd_index_astensor 0.3553ms 0.2938ms 3.4039 KOps/s 3.4865 KOps/s $\color{#d91a1a}-2.37\%$
test_memmaptd_index_op 0.6194ms 0.5335ms 1.8745 KOps/s 1.8713 KOps/s $\color{#35bf28}+0.17\%$
test_reshape_pytree 44.3820μs 20.6634μs 48.3947 KOps/s 48.3375 KOps/s $\color{#35bf28}+0.12\%$
test_reshape_td 54.2330μs 28.9875μs 34.4976 KOps/s 35.6441 KOps/s $\color{#d91a1a}-3.22\%$
test_view_pytree 35.3020μs 20.3417μs 49.1600 KOps/s 49.0640 KOps/s $\color{#35bf28}+0.20\%$
test_view_td 19.2110μs 3.9900μs 250.6240 KOps/s 250.7046 KOps/s $\color{#d91a1a}-0.03\%$
test_unbind_pytree 49.0620μs 25.2068μs 39.6719 KOps/s 39.2588 KOps/s $\color{#35bf28}+1.05\%$
test_unbind_td 76.0540μs 51.0447μs 19.5907 KOps/s 19.7212 KOps/s $\color{#d91a1a}-0.66\%$
test_split_pytree 46.6330μs 23.2825μs 42.9507 KOps/s 42.7756 KOps/s $\color{#35bf28}+0.41\%$
test_split_td 67.4030μs 39.5805μs 25.2649 KOps/s 25.8405 KOps/s $\color{#d91a1a}-2.23\%$
test_add_pytree 65.2530μs 30.1745μs 33.1406 KOps/s 32.9230 KOps/s $\color{#35bf28}+0.66\%$
test_add_td 59.3130μs 40.0119μs 24.9926 KOps/s 24.7951 KOps/s $\color{#35bf28}+0.80\%$
test_distributed 25.5410μs 5.5325μs 180.7490 KOps/s 183.7950 KOps/s $\color{#d91a1a}-1.66\%$
test_tdmodule 30.1820μs 16.0433μs 62.3314 KOps/s 61.1235 KOps/s $\color{#35bf28}+1.98\%$
test_tdmodule_dispatch 0.1917ms 31.2264μs 32.0242 KOps/s 31.6915 KOps/s $\color{#35bf28}+1.05\%$
test_tdseq 38.4220μs 19.0708μs 52.4363 KOps/s 51.4407 KOps/s $\color{#35bf28}+1.94\%$
test_tdseq_dispatch 52.6120μs 34.2138μs 29.2280 KOps/s 29.1254 KOps/s $\color{#35bf28}+0.35\%$
test_instantiation_functorch 1.9302ms 1.6668ms 599.9675 Ops/s 605.8343 Ops/s $\color{#d91a1a}-0.97\%$
test_instantiation_td 1.7322ms 1.1543ms 866.3127 Ops/s 876.3498 Ops/s $\color{#d91a1a}-1.15\%$
test_exec_functorch 0.2150ms 0.1518ms 6.5893 KOps/s 6.5393 KOps/s $\color{#35bf28}+0.76\%$
test_exec_functional_call 0.2071ms 0.1479ms 6.7607 KOps/s 6.7650 KOps/s $\color{#d91a1a}-0.06\%$
test_exec_td 0.1727ms 0.1392ms 7.1865 KOps/s 7.2394 KOps/s $\color{#d91a1a}-0.73\%$
test_exec_td_decorator 0.6784ms 0.1723ms 5.8029 KOps/s 5.7536 KOps/s $\color{#35bf28}+0.86\%$
test_vmap_mlp_speed[True-True] 1.5129ms 1.0126ms 987.5767 Ops/s 971.8905 Ops/s $\color{#35bf28}+1.61\%$
test_vmap_mlp_speed[True-False] 0.6509ms 0.5862ms 1.7060 KOps/s 1.6906 KOps/s $\color{#35bf28}+0.91\%$
test_vmap_mlp_speed[False-True] 1.0162ms 0.9284ms 1.0772 KOps/s 1.0546 KOps/s $\color{#35bf28}+2.14\%$
test_vmap_mlp_speed[False-False] 0.5586ms 0.5166ms 1.9356 KOps/s 1.8864 KOps/s $\color{#35bf28}+2.61\%$
test_vmap_mlp_speed_decorator[True-True] 2.4623ms 1.9245ms 519.6022 Ops/s 510.0448 Ops/s $\color{#35bf28}+1.87\%$
test_vmap_mlp_speed_decorator[True-False] 1.0093ms 0.6260ms 1.5975 KOps/s 1.5852 KOps/s $\color{#35bf28}+0.78\%$
test_vmap_mlp_speed_decorator[False-True] 2.0483ms 1.6772ms 596.2231 Ops/s 589.2840 Ops/s $\color{#35bf28}+1.18\%$
test_vmap_mlp_speed_decorator[False-False] 0.8051ms 0.5323ms 1.8787 KOps/s 1.8687 KOps/s $\color{#35bf28}+0.54\%$
test_vmap_transformer_speed[True-True] 12.1137ms 11.8623ms 84.3007 Ops/s 83.2874 Ops/s $\color{#35bf28}+1.22\%$
test_vmap_transformer_speed[True-False] 7.8972ms 7.8155ms 127.9502 Ops/s 126.7045 Ops/s $\color{#35bf28}+0.98\%$
test_vmap_transformer_speed[False-True] 11.9941ms 11.7674ms 84.9804 Ops/s 84.3543 Ops/s $\color{#35bf28}+0.74\%$
test_vmap_transformer_speed[False-False] 7.8000ms 7.7220ms 129.4994 Ops/s 127.6952 Ops/s $\color{#35bf28}+1.41\%$
test_vmap_transformer_speed_decorator[True-True] 61.4318ms 60.5925ms 16.5037 Ops/s 16.1076 Ops/s $\color{#35bf28}+2.46\%$
test_vmap_transformer_speed_decorator[True-False] 20.6922ms 18.9282ms 52.8312 Ops/s 52.2232 Ops/s $\color{#35bf28}+1.16\%$
test_vmap_transformer_speed_decorator[False-True] 0.1341s 59.0991ms 16.9207 Ops/s 17.9130 Ops/s $\textbf{\color{#d91a1a}-5.54\%}$
test_vmap_transformer_speed_decorator[False-False] 20.2969ms 18.5478ms 53.9148 Ops/s 53.3061 Ops/s $\color{#35bf28}+1.14\%$

tensordict/_td.py Outdated Show resolved Hide resolved
tensordict/_td.py Outdated Show resolved Hide resolved
@vmoens
Copy link
Contributor Author

vmoens commented Dec 9, 2023

@laurencer thanks for the review.
I created the futures (execution time is roughly identical as before). I thought that executor.__exit__() would wait for submitted jobs to complete but after testing it, it doesn't seem to be the case. Do we want to optionally let users proceed with the main script without waiting for the futures to complete (and return the futures)?

I also prevented conflicting executor and num_threads, but eventually I think executor will only be an arg of a sub-call to a new _memmap private function (as well as futures to restrict the args of the public method to what they should be).

@vmoens vmoens added the enhancement New feature or request label Dec 11, 2023
@vmoens vmoens marked this pull request as ready for review December 11, 2023 14:40
@vmoens
Copy link
Contributor Author

vmoens commented Dec 11, 2023

I implemented some benchmarks too, incl. torch.save for comparison.
On my machine, saving memmap tensors is 10x faster than torch.save. If we add the overhead of constructing the tensordict, it's still >3x faster.

------------------------------------------------------------------------------------------- benchmark: 4 tests ------------------------------------------------------------------------------------------
Name (time in ms)                      Min                 Max                Mean              StdDev              Median                 IQR            Outliers      OPS            Rounds  Iterations
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_serialize_weights             49.8471 (1.0)       81.7734 (1.0)       60.9847 (1.0)       12.8616 (5.14)      57.4084 (1.0)       18.2674 (4.61)          1;0  16.3975 (1.0)           6           1
test_serialize_model              144.1813 (2.89)     149.7437 (1.83)     147.8376 (2.42)       2.5030 (1.0)      149.2933 (2.60)       3.9661 (1.0)           1;0   6.7642 (0.41)          5           1
test_serialize_model_pickle       328.4964 (6.59)     706.6764 (8.64)     516.0788 (8.46)     147.0942 (58.77)    482.3225 (8.40)     217.4180 (54.82)         2;0   1.9377 (0.12)          5           1
test_serialize_weights_pickle     389.3548 (7.81)     602.8630 (7.37)     507.1754 (8.32)      86.2705 (34.47)    483.8566 (8.43)     129.5374 (32.66)         2;0   1.9717 (0.12)          5           1
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

cc @janeyx99 @fegin

@fegin
Copy link

fegin commented Dec 11, 2023

@vmoens Thanks for the information. Just out of curiosity, have you try to compare the performance if the target storage is a memory file system, like tmpfs?

@vmoens
Copy link
Contributor Author

vmoens commented Dec 11, 2023

That would be memmap_() (without file associated). I can check that!

EDIT
@fegin I tested this, here are the results with a model on CUDA

----------------------------------------------------------------------------------------------- benchmark: 6 tests -----------------------------------------------------------------------------------------------
Name (time in ms)                          Min                   Max                  Mean              StdDev                Median                 IQR            Outliers     OPS            Rounds  Iterations
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_serialize_model                  121.6663 (1.0)        130.4117 (1.0)        124.9532 (1.0)        3.5488 (3.30)       123.2064 (1.0)        4.9186 (5.18)          1;0  8.0030 (1.0)           5           1
test_serialize_weights                121.9595 (1.00)       179.9572 (1.38)       136.3691 (1.09)      24.6168 (22.88)      128.7176 (1.04)      19.9220 (20.97)         1;1  7.3330 (0.92)          5           1
test_serialize_weights_filesystem     180.8761 (1.49)       184.9871 (1.42)       182.4715 (1.46)       1.6260 (1.51)       181.9611 (1.48)       2.7244 (2.87)          1;0  5.4803 (0.68)          6           1
test_serialize_model_filesystem       185.4803 (1.52)       188.0399 (1.44)       186.1431 (1.49)       1.0760 (1.0)        185.7062 (1.51)       0.9500 (1.0)           1;1  5.3722 (0.67)          5           1
test_serialize_model_pickle           543.6802 (4.47)     1,378.4349 (10.57)    1,193.0845 (9.55)     363.2853 (337.62)   1,350.2534 (10.96)    214.7851 (226.08)        1;1  0.8382 (0.10)          5           1
test_serialize_weights_pickle         553.2777 (4.55)     1,378.2485 (10.57)    1,196.6860 (9.58)     360.4329 (334.97)   1,354.8406 (11.00)    248.4223 (261.48)        1;1  0.8356 (0.10)          5           1
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

The difference between filesystem and not seems to be quite machine/distro-dependent though.

With LLAMA2 7B serialization on disk, we get:

saving llama2 7b with torch.save: 30 sec
saving llama2 7b with tensordict: 11 sec
loading llama2 7b with torch.load: 24 sec
loading llama2 7b with tensordict: 8 sec

@vmoens vmoens merged commit e3353f1 into main Dec 12, 2023
41 of 45 checks passed
@vmoens vmoens deleted the multithread-memmap branch December 12, 2023 15:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants