Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix O(n^2) behavior in the log_monitor #5569

Merged
merged 12 commits into from
Aug 29, 2019

Conversation

pcmoritz
Copy link
Contributor

@pcmoritz pcmoritz commented Aug 28, 2019

Why are these changes needed?

Currently, the log monitor can get choked if there are too many logfiles around. If the log directory is too large, the glob.glob call in

log_file_paths = glob.glob("{}/worker*[.out|.err]".format(
can take a large amount of time, and it will be called in each iteration of the log monitor loop. Typically this happens if there are many log files from dead actors. In this PR, we move log files from dead actors to a subdirectory, so glob.glob keeps being fast.

Related issue number

Fixes #5320

Linter

  • I've run scripts/format.sh to lint the changes in this PR.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16613/
Test FAILed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16612/
Test PASSed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16615/
Test FAILed.

del a

while True:
log_file_paths = glob.glob("{}/worker*.out".format(logs_dir))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be checking logs/old instead of logs?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And we should also check it before the del a call to make sure it is empty.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, you are right, the test was wrong. I updated it and it is now passing the right test.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you see the other part of my comment, which was

And we should also check it before the del a call to make sure it is empty.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe even check specifically that function f finished appears in the log file? Up to you.

with open(log_file_path, "r") as f:
if "function f finished\n" in f.readlines():
done = True
time.sleep(0.2)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0.2 is awfully long, if you're going to sleep, then how about 0.01 or something like that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part is now removed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16617/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16619/
Test PASSed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16627/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16634/
Test FAILed.

@pcmoritz
Copy link
Contributor Author

Jenkins retest this please

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16632/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16633/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16635/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16636/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16637/
Test PASSed.

# The process is not alive any more, so move the log file
# out of the log directory so glob.glob will not be slowed
# by it.
target = os.path.join(self.logs_dir, "old",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of moving it how about we add it to a blacklist of files to ignore? I don't think the glob is the issue just the opening the files.

Copy link
Contributor Author

@pcmoritz pcmoritz Aug 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Running the glob on ~200.000 logfiles takes about 15s, so I assumed it was a problem but not super sure. I hate moving the logfiles too.

@pcmoritz pcmoritz merged commit 04b8696 into ray-project:master Aug 29, 2019
@pcmoritz pcmoritz deleted the fix-too-many-logfiles branch August 29, 2019 21:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

many_drivers, node_failures stress test gets stuck after a while with 100% CPU in log monitor
4 participants