Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

restart worker during tests depending on resident memory size #13577

Merged
merged 1 commit into from
Oct 15, 2015

Conversation

amitmurthy
Copy link
Contributor

This is an alternative take on #13567 (for fixing #11553)

It restarts workers once the resident memory size of a worker is above 200MB (configurable via an ENV variable JULIA_TEST_MAXRSS_MB)

One other difference from the existing setup is that all tests are always run (even if some fail)

@amitmurthy amitmurthy changed the title restart worker during tests depending on resident memory size RFC: restart worker during tests depending on resident memory size Oct 13, 2015
@amitmurthy
Copy link
Contributor Author

We ought to be able to go back upto 4 or 8 workers with this PR.

@amitmurthy
Copy link
Contributor Author

Oops! On Travis Linux the RSS size reported seems to be wrong. Was correct on my local OSX machine. Will work on that.

@amitmurthy amitmurthy changed the title RFC: restart worker during tests depending on resident memory size RFC/WIP: restart worker during tests depending on resident memory size Oct 13, 2015
@printf(" in %6.2f seconds\n", tt)
nothing
end
@printf(" in %6.2f seconds, rss %7.2f MB\n", tt, Sys.get_rusage().ru_maxrss / 2^20)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rusage rss is in kb so this is off by a factor of 1024 http://linux.die.net/man/2/getrusage

edit: but apparently it's right on osx, and useless on windows. wonder if recent libuv is any better.

@yuyichao
Copy link
Contributor

We ought to be able to go back upto 4 or 8 workers with this PR.

There isn't too much benefit for doing that since AFAIK there are only 2 dedicated CPUs for each worker.

@amitmurthy
Copy link
Contributor Author

There isn't too much benefit for doing that since AFAIK there are only 2 dedicated CPUs for each worker.

Given that there is some IO testing involved, it may be worth to test if there is any improvement with 4 workers.

@@ -742,6 +742,30 @@ DLLEXPORT jl_sym_t* jl_get_ARCH()
return ARCH;
}

DLLEXPORT size_t jl_maxrss()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it worth using the libuv function directly from julia instead of this? (https://github.com/nodejs/node-v0.x-archive/blob/master/deps/uv/include/uv.h#L1020-L1039)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had that in the first commit, but Windows is not supported in the libuv interface - it returns 0 for maxrss size. Hence decided to just have a simple solution that works on all 3 test platforms.

@tkelman
Copy link
Contributor

tkelman commented Oct 13, 2015

I feel like this shouldn't be on by default, it should be a ci only env var. Especially in make testall1 it can identify bugs or memory leaks to intentionally run all in one process, so should preserve the ability to do that manually.

@amitmurthy
Copy link
Contributor Author

I agree. Will make it configurable.

@kshyatt kshyatt added parallelism Parallel or distributed computation test This change adds or pertains to unit tests labels Oct 13, 2015

#elif defined(_OS_LINUX_) || defined(_OS_DARWIN_)
struct rusage rusage;
getrusage( RUSAGE_SELF, &rusage );
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any idea about bsd?

@amitmurthy
Copy link
Contributor Author

Have updated the following:

  • Single process test will run as before. Will stop upon any error or when max RSS value is greater than env var JULIA_TEST_MAXRSS_MB. Unlike the multi process case where all tests are always executed.
  • This commit has default value of JULIA_TEST_MAXRSS_MB set to 500MB - just for testing on all 3 CI platforms. After this PR is merged, we can increase the default value to a large number, say 100,000 MB and control it only for required CI platforms like Linux 64-bit. This would ensure that local runs are not subject to worker restart unless the developer explicitly specifies it in his/her local environment.
  • Have incorporated OS_FREEBSD for returning maxrss - same code as for DARWIN.

@amitmurthy amitmurthy changed the title RFC/WIP: restart worker during tests depending on resident memory size restart worker during tests depending on resident memory size Oct 15, 2015
@amitmurthy
Copy link
Contributor Author

Have squashed and rebased. Does #13569 render this PR moot?

@tkelman
Copy link
Contributor

tkelman commented Oct 15, 2015

No, not yet anyway. Docker experiments are showing something that would work for master, but would be a little more complicated to get working for PR's.

This doesn't look like it's yet defaulting to a large out-of-the-way number for local builds.

@amitmurthy
Copy link
Contributor Author

That will have to be a different PR which ups the default value of JULIA_TEST_MAXRSS_MB in runtests.jl and also sets it in the environment only for Travis Linux 64-bit (I don't know where that has to be specified).

It is low (500MB) in this PR so that the worker restart stuff can be tested for all current CI platforms.

@tkelman
Copy link
Contributor

tkelman commented Oct 15, 2015

Okay, I guess that works. If you want to set an env var only for 64 bit linux Travis, add an export roughly here:

julia/.travis.yml

Lines 39 to 40 in 8a86270

else
sudo apt-get install patchelf gfortran llvm-3.3-dev libsuitesparse-dev libopenblas-dev liblapack-dev libarpack2-dev libfftw3-dev libgmp-dev libpcre3-dev libunwind7-dev libopenlibm-dev libmpfr-dev -y;
(and don't forget the end-of-line semicolon since it's bash-in-yml)

@amitmurthy
Copy link
Contributor Author

CI is green. Will merge in a couple of hours if there are no objections.

amitmurthy added a commit that referenced this pull request Oct 15, 2015
restart worker during tests depending on resident memory size
@amitmurthy amitmurthy merged commit 379e9c2 into master Oct 15, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parallelism Parallel or distributed computation test This change adds or pertains to unit tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants