Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DArray : memory not fully recovered upon gc() #8912

Closed
amitmurthy opened this issue Nov 6, 2014 · 15 comments
Closed

DArray : memory not fully recovered upon gc() #8912

amitmurthy opened this issue Nov 6, 2014 · 15 comments
Labels
parallelism Parallel or distributed computation

Comments

@amitmurthy
Copy link
Contributor

DArrays do not seem to be fully garbage collected.

context: https://groups.google.com/d/msg/julia-users/O8-Axv7wVZE/Xe-_8LIGKhAJ

@amitmurthy amitmurthy added the parallelism Parallel or distributed computation label Nov 6, 2014
@amitmurthy
Copy link
Contributor Author

Starting with one julia worker, and executing the below multiple times,

a=remotecall(2, ()->ones(10^8));
a=1;gc(); Base.flush_gc_msgs();remotecall(2,gc);

I notice a leak the first time around but not on subsequent invocations of the remoetcall

However, if I change the remotecall to remotecall_fetch and execute

a=remotecall_fetch(2, ()->ones(10^8));
a=1;gc(); Base.flush_gc_msgs();remotecall(2,gc);

there is additional memory retained in the worker every time around. There is also an increase in the memory size of the master, but that is only for the first time.

I suspect there are two issues here:

I am pretty sure one is #6597, since both remotecall and remotecall_fetch result in a new task on the worker.

The other is probably #4508 / #6876

@nowozin
Copy link

nowozin commented Nov 11, 2014

I noticed another thing which may be related or a separate issue:

On Windows 8.1, 64 bit, Julia 0.4.0-dev+1318 7a7110b, when using distribute() to create a DArray over 12 cores from an array of size 100, it seems the full memory is used on each julia subprocess.

That is, I have a memory usage of 2.6GB in the process that calls distribute(A), where around 1.8GB are in the array A, and afterwards all 11 processes (using addprocs(11)), use the same amount of memory, 2.6GB that is. This was not the case three weeks ago on 0.4.0-dev.

Another symptom is that the distribute() call is much slower than before, and during the several minutes or so it is taking, only a few Julia processes are active. For example, initially the main Julia process uses 8% CPU (1/12), and one other Julia process is also using 8% CPU (1/12), but there are 10 processes which have 0% CPU activity. After a minute there are three processes, each using 8%, after another minute, there are four processes, etc. until all processes are active and 100% CPU is utilized. Then, the distribute() call returns.

This is very different from what I observed when originally developing the code a few weeks ago and I have not changed it since.

@denizyuret
Copy link
Contributor

Here are a couple of more problem reports for reference:
https://groups.google.com/d/topic/julia-users/q39vyGQF4Fs/discussion
https://groups.google.com/d/topic/julia-users/zsT2qfwDuHA/discussion

As a workaround, removing all the workers (rmprocs(workers())) and restarting them (addprocs(ncpu)) every iteration seems to work.

I suspect the problem may go deeper than distributed arrays: In my case the distributed array is not that big, but the result I fetch from the workers (via pmap) is. The memory usage is consistent with those fetched values not being properly garbage collected. I will post a simple example to replicate if I can come up with one.

@denizyuret
Copy link
Contributor

OK, here is the example:

mypid = ccall((:getpid, "libc"), Int32, ())
addprocs(10)
for i=1:10
    p=pmap(workers()) do x
        rand(1<<27)
    end
    p=nothing
    @everywhere gc()
    run(pipe(`ps auxww`,`grep $mypid`))
end

and here is the output (10GB memory added to master every iteration):

dyuret   31594  4.3  4.0 19759168 10685080 pts/2 Rl+ 21:19   0:12 julia
dyuret   31594  6.1  8.0 30292596 21218468 pts/2 Sl+ 21:19   0:17 julia
dyuret   31594  7.8 11.9 40782492 31708344 pts/2 Sl+ 21:19   0:23 julia
dyuret   31594  9.4 15.9 51276484 42202348 pts/2 Sl+ 21:19   0:28 julia
dyuret   31594 10.9 19.9 61762284 52688156 pts/2 Sl+ 21:19   0:33 julia
dyuret   31594 12.3 23.8 72248084 63173968 pts/2 Sl+ 21:19   0:39 julia
dyuret   31594 13.7 27.8 82733884 73659792 pts/2 Sl+ 21:19   0:44 julia
dyuret   31594 15.1 31.7 93219684 84145604 pts/2 Sl+ 21:19   0:49 julia
dyuret   31594 16.4 35.7 103705484 94631416 pts/2 Sl+ 21:19   0:54 julia
dyuret   31594 17.7 35.7 103705484 94631424 pts/2 Sl+ 21:19   0:59 julia

@samuela
Copy link
Contributor

samuela commented Apr 20, 2015

What's the status on this? Until we have a fix, this issue seems significant enough to warrant labeling DArrays as an experimental feature. At the very least I think something should be mentioned about this in the docs.

@jiahao
Copy link
Member

jiahao commented Apr 20, 2015

@samuela DArrays no longer exist in base.

@samuela
Copy link
Contributor

samuela commented Apr 20, 2015

Oh, cool beans! Should this issue be closed then? What should be used in place of DArrays now?

@pao
Copy link
Member

pao commented Apr 20, 2015

https://github.com/JuliaParallel/DistributedArrays.jl took over the DArray code.

I'm not sure where the actual issue lies, but since it's been crossreferenced by folks who should have some idea, I'll leave this open.

@amitmurthy
Copy link
Contributor Author

After 3bbc5fc

mypid = getpid()
addprocs(10)
for i=1:200
    p=pmap(workers()) do x
        ones(10^7)
    end
    p=nothing
    @everywhere gc()
    run(pipe(`ps auxww`,`grep $mypid`))
end

I find that the master slowly grows to around 10GB of resident memory after 20 iterations which holds steady for the next 180 iterations. This 10GB is not released even after all the iterations complete.

@carnaval , @vtjnash any explanation for this?

@vtjnash
Copy link
Member

vtjnash commented Apr 23, 2015

it seems possible that gc() is just not getting called enough locally to propagate the gc messages, but I'm not sure that fully explains it.

@jakebolewski
Copy link
Member

why would that impact memory usage on the master node?

@amitmurthy
Copy link
Contributor Author

With

julia> using DistributedArrays

julia> for i in 1:100
         d=dones(2*10^8)
         a=convert(Array,d)
         @everywhere gc()
       end

top shows

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                             
 4459 amitm     20   0 19.787g 5.408g   2160 R  91.1 34.7   1:07.21 julia                                                                                               
 4467 amitm     20   0 14.397g 4.622g  10132 S  19.0 29.7   0:14.08 julia                                                                                               
 4468 amitm     20   0 13.001g 2.461g  10124 S  18.3 15.8   0:13.14 julia

with the master process varying between 30% to 40% of system memory and the workers between 15% to 30%

It is no longer a leak, the loop runs to completion, but at the end, the memory is not being released. Does libuv or the malloc implementation cache memory buffers anticipating future use?

I'll try and test the stuff on OSX later in the day to see if the behavior is limited to Linux.

@vtjnash
Copy link
Member

vtjnash commented Apr 23, 2015

libuv tries really hard not to allocate anything. but malloc and the julia gc will hold onto some amount of memory. 10GB sounds a bit high. although, i guess if there was something on every page, it would have trouble releasing the memory fully.

@amitmurthy
Copy link
Contributor Author

@vtjnash , can you take a look at #10960 sometime? While debugging parallel code, it is a little difficult to proceed when there is no guarantee that finalizers on RemoteRefs are being called in all circumstances.

@amitmurthy
Copy link
Contributor Author

closed by 6b94780

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parallelism Parallel or distributed computation
Projects
None yet
Development

No branches or pull requests

8 participants