Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CELEBORN-1792][FOLLOWUP] Keep resume for a while after resumeByPinnedMemory #3099

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

TheodoreLx
Copy link

What changes were proposed in this pull request?

In the switchServingState after resumeByPinnedMemory, keep the resume channel to prevent the channel from frequently resuming and pausing before memoryUsage decreases to pausePushDataThreshold.

Why are the changes needed?

Frequent channel resume and pause will result in slow data reception and failure to quickly reduce memoryUsage to below pausePushDataThreshold.

Does this PR introduce any user-facing change?

no

How was this patch tested?

ut

@SteNicholas
Copy link
Member

@TheodoreLx, please use dev/reformat to format code style.

@TheodoreLx
Copy link
Author

@SteNicholas The code has been reformatted

@RexXiong
Copy link
Contributor

RexXiong commented Feb 17, 2025

Could you fix the ut [CELEBORN-1792] Test MemoryManager resume by pinned memory? @TheodoreLx

@TheodoreLx
Copy link
Author

Here are some test results about #3018 I want to share:

After triggering resumeByPinnedMemory, it will enter pause immediately. The duration of resumeByPinnedMemory is less than 10ms, and the worker will soon fall into the pause deadlock state again until resumeByPinnedMemory is triggered again in 10 seconds. In other words, worker wastes about 10 seconds staying in the pause state.
image
image

2025-02-19 16:53:11,053 resumeByPinnedMemory
2025-02-19 16:53:11,353 pause immediately
2025-02-19 16:53:15,749 stuck in pause deadlock again, diskBuffer=0 and all channels pause
2025-02-19 16:53:21,067 resumeByPinnedMemory again after 10s

Copy link
Contributor

@RexXiong RexXiong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

return false;
}
if (System.currentTimeMillis() - pinnedMemoryLastCheckTime >= pinnedMemoryCheckInterval) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

introduce a variable for System.currentTimeMillis() ?

Copy link

codecov bot commented Feb 24, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 32.68%. Comparing base (fdf1883) to head (24a94fc).
Report is 22 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3099      +/-   ##
==========================================
+ Coverage   32.62%   32.68%   +0.06%     
==========================================
  Files         336      338       +2     
  Lines       20083    20273     +190     
  Branches     1798     1812      +14     
==========================================
+ Hits         6551     6625      +74     
- Misses      13165    13279     +114     
- Partials      367      369       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

resumeReplicate();
} else {
logger.info("Trigger action: PAUSE PUSH");
pausePushDataStartTime = System.currentTimeMillis();
resumingByPinnedMemory = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resumingByPinnedMemory need also change to false when servingState changes to NONE_PAUSED state

Copy link
Contributor

@RexXiong RexXiong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@FMX
Copy link
Contributor

FMX commented Feb 25, 2025

@TheodoreLx There is an issue with this PR. The trimCounter should be moved into the block of the if(!tryResumeByPinnedMemory(servingState, lastState)).

The current implementation will cause the metric of PAUSE_PUSH_DATA_TIME and PAUSE_PUSH_DATA_AND_REPLICATE_TIME to be larger than the time spent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants