Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core][compiled graphs] Fix and re-enable shared memory channel buffering support #49044

Closed
ruisearch42 opened this issue Dec 3, 2024 · 3 comments · Fixed by #49755
Closed
Assignees
Labels
beta Beta release feture bug Something that is supposed to be working; but isn't compiled-graphs core Issues that should be addressed in Ray Core

Comments

@ruisearch42
Copy link
Contributor

Description

#43826 was supported in #47272 . However, due to performance issues, it was turned off (buffer_size is set to 1). We need to fix and reenable this feature.

Use case

No response

@ruisearch42 ruisearch42 added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) beta Beta release feture compiled-graphs labels Dec 3, 2024
@jcotant1 jcotant1 added core Issues that should be addressed in Ray Core and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Dec 3, 2024
@stephanie-wang stephanie-wang self-assigned this Dec 3, 2024
@stephanie-wang
Copy link
Contributor

Hmm I'm actually not able to reproduce this, seems like we're okay to just increase the default num buffers.

#49050:

[unstable]_local_put_local_get,_single_channel_calls = [70313.8544074326, 240.64879571503113]
--
  | [unstable]_local_put_1_remote_get,_single_channel_calls = [92760.90457755627, 11623.231349900625]
  | [unstable]_local_put_n_remote_get,_single_channel_calls = [2839.039489909441, 53.10805558081656]
  | [unstable]_local_put_1_remote_get,_n_channels_calls = [4464.876632409017, 76.55683991749665]
  | [unstable]_local_put_n_remote_get,_n_channels_calls = [4301.232907415234, 913.633835251876]
  | [unstable]_single_actor_DAG_calls = [979.0540716128761, 17.866187268887646]
  | [unstable]_compiled_single_actor_DAG_calls = [17865.870643156726, 105.48774878710981]
  | [unstable]_compiled_single_actor_asyncio_DAG_calls = [3375.009809857613, 11.352355873641569]
  | [unstable]_scatter_gather_DAG_calls,_n=32_actors = [73.81770738505236, 0.7801573092408468]
  | [unstable]_compiled_scatter_gather_DAG_calls,_n=32_actors = [1159.7161601121559, 229.68575927743714]
  | [unstable]_compiled_scatter_gather_asyncio_DAG_calls,_n=32_actors = [483.66645840998495, 23.83833318078703]
  | [unstable]_chain_DAG_calls,_n=32_actors = [32.587348727022395, 0.7988237889024955]
  | [unstable]_compiled_chain_DAG_calls,_n=32_actors = [821.6376677447137, 15.212133890804099]
  | [unstable]_compiled_chain_asyncio_DAG_calls,_n=32_actors = [548.9419764009706, 1.9705289055835995]
  | [unstable]_multiple_args_with_small_payloads_DAG_calls,_n=8_actors = [141.71332907201105, 6.627092839023008]
  | [unstable]_compiled_multiple_args_with_small_payloads_DAG_calls,_n=8_actors = [3904.0943570581685, 17.788989829272015]
  | [unstable]_multiple_args_with_medium_payloads_DAG_calls,_n=8_actors = [33.47706490100181, 1.487347936494732]
  | [unstable]_compiled_multiple_args_with_medium_payloads_DAG_calls,_n=8_actors = [310.9364473852743, 9.810230358945574]
  | [unstable]_multiple_args_with_large_payloads_DAG_calls,_n=8_actors = [1.3361992753495433, 0.056027178367644315]
  | [unstable]_compiled_multiple_args_with_large_payloads_DAG_calls,_n=8_actors = [9.318240601981515, 0.17500882513012356]
  | [unstable]_single_actor_with_all_args_with_small_payloads_DAG_calls,_n=1_actors = [5208.854175023359, 30.493025936061184]

vs master:

[unstable]_local_put_local_get,_single_channel_calls = [56376.21936863016, 141.55781318554565]
--
  | [unstable]_local_put_1_remote_get,_single_channel_calls = [111328.76600519221, 4185.934012627729]
  | [unstable]_local_put_n_remote_get,_single_channel_calls = [3155.2366270270422, 14.059119590563787]
  | [unstable]_local_put_1_remote_get,_n_channels_calls = [3837.6699611922604, 2.423009669688208]
  | [unstable]_local_put_n_remote_get,_n_channels_calls = [4224.514150338189, 748.5842191167661]
  | [unstable]_single_actor_DAG_calls = [958.5571059971033, 11.188539718845824]
  | [unstable]_compiled_single_actor_DAG_calls = [17649.423661347464, 90.43910209512293]
  | [unstable]_compiled_single_actor_asyncio_DAG_calls = [3306.997631023034, 24.56699986581915]
  | [unstable]_scatter_gather_DAG_calls,_n=32_actors = [71.75884698907845, 1.5869585406082631]
  | [unstable]_compiled_scatter_gather_DAG_calls,_n=32_actors = [1110.2836068867148, 141.794087838474]
  | [unstable]_compiled_scatter_gather_asyncio_DAG_calls,_n=32_actors = [450.8476402487272, 17.296731789531698]
  | [unstable]_chain_DAG_calls,_n=32_actors = [31.096264410393182, 1.0671578547507894]
  | [unstable]_compiled_chain_DAG_calls,_n=32_actors = [817.7977870848244, 13.435156094597975]
  | [unstable]_compiled_chain_asyncio_DAG_calls,_n=32_actors = [528.963245880861, 7.601556180525647]
  | [unstable]_multiple_args_with_small_payloads_DAG_calls,_n=8_actors = [136.98971423914048, 4.204565809422393]
  | [unstable]_compiled_multiple_args_with_small_payloads_DAG_calls,_n=8_actors = [3673.5775670093462, 222.74022209505685]
  | [unstable]_multiple_args_with_medium_payloads_DAG_calls,_n=8_actors = [33.41456367287914, 0.6121419351325709]
  | [unstable]_compiled_multiple_args_with_medium_payloads_DAG_calls,_n=8_actors = [295.98139220301175, 4.396832730052045]
  | [unstable]_multiple_args_with_large_payloads_DAG_calls,_n=8_actors = [1.3140666469064939, 0.056535648383938174]
  | [unstable]_compiled_multiple_args_with_large_payloads_DAG_calls,_n=8_actors = [9.469645093260244, 0.10319720230902002]
  | [unstable]_single_actor_with_all_args_with_small_payloads_DAG_calls,_n=1_actors = [5144.458454239577, 5.309447432258948]

@ruisearch42
Copy link
Contributor Author

I think it is mainly multi-node perf regression:
image

See right most of the graph (on the 10th).

@stephanie-wang stephanie-wang removed their assignment Jan 7, 2025
@stephanie-wang
Copy link
Contributor

Since the regression only occurs in multi-node, one option is to only enable buffering for intra-node channels.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
beta Beta release feture bug Something that is supposed to be working; but isn't compiled-graphs core Issues that should be addressed in Ray Core
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants