Memory leak? #66

mirestrepo · 2017-03-20T14:35:10Z

Hi,

I'm computing pairwise distances for a large matrix (500 x 100,000) - the memory footprint keeps growing indefinitely. I am pre-allocating the output matrix so I don't think that should be happening. In fact, the process eventually get killed by the kernel (after using 60Gb+ of men)... I suspect a memory leak. The code I'm running looks something like

using Base.Threads
using Distances
using JLD


function main(data)

    nvectors = size(data,2)
    js = Matrix{Float64}(nvectors, nvectors)

    pairwise!(js, JSDivergence(), data, data)

    println("Done computing distances")

    writedlm("./mallet_composition_500_JS.txt", js)
end


@time const data = jldopen("./mallet_composition_500.jld", "r") do file
    read(file, "data")
end
data_small = data[:, 1:100]
@time main(data_small)
@time main(data)

Any insights? I haven't profiled for memory leaks in Julia before, but I'll try to see if I can help with more specifics.

Thanks!

The text was updated successfully, but these errors were encountered:

KristofferC · 2017-03-20T14:44:32Z

Aren't you trying to allocate a 74 GB matrix (1e5 x 1e5 Float64s)?

StefanKarpinski · 2017-03-20T14:54:24Z

The main "mystery" is why the output allocation would succeed initially but then continue to consume memory as the array is used... which happens because the kernel cheats and doesn't actually allocate any memory for pages until they're used, at which point the kernel gives your process its own real RAM-backed page for that memory.

mirestrepo · 2017-03-20T15:01:17Z

@KristofferC, yes you are correct. My desktop has large resources, so I thought I could make it and since that initial allocation was working I was confused, but @StefanKarpinski explanation makes sense. Any of you have recommendations for handling this size problem? Splitting it up, map-reduce style?

nalimilan · 2017-03-20T15:03:46Z

To be clear, do you have 100,000 variables with 500 observations, or 500 variables with 100,000 observations? IIRC this package is using a different convention than many other software and stores variables as rows.

mirestrepo · 2017-03-20T15:06:15Z

500 variables, 100,000 observations

nalimilan · 2017-03-20T15:06:49Z

OK, so try with the transposed matrix. :-)

mirestrepo · 2017-03-20T15:12:26Z

I'm not sure I follow that. I need the pair-wise distance between observations.... and I believe this package uses column-to-column computations

If distance is symmetric I could try computing only triangular matrix

nalimilan · 2017-03-20T15:14:56Z

Yes, but what I mean is that maybe your data contains variables as columns, while pairwise expects them as rows (#35)? That might explain why so much memory is used. I haven't been able to understand what's the format of data just by looking at the code.

EDIT: Sorry, I misread, so that's not a rows vs. columns issue. So the full matrix really needs 100_000^2 * 8 bytes, i.e. 75GB. You could use Float32 and halve this requirement. Not sure whether it's possible to use a diagonal matrix.

mirestrepo · 2017-03-20T15:37:10Z

Float32 is an idea, but before sacrificing precision... is it possible to work with memory-mapped matrices or something like that?

mirestrepo · 2017-03-20T16:03:15Z

Thanks all for answers. I'll close this issue, since it's not a leak problem. If anyone has any further recommendations, they're greatly appreciated

nalimilan · 2017-03-20T16:21:18Z

See ?Mmap.mmap for mmapped arrays.

StefanKarpinski · 2017-03-20T17:19:49Z

For this sort of computation, you generally need to figure out a clever way to avoid doing all the pairwise distance computations. It's hard to imagine that you need all of the pairs of distances, so there may be some way to avoid doing most of the computation. One probabilistic approach is to use locality-sensitive hashing to decide which pairs to look at (only the ones in the same hash bucket). There are various other exact and approximate approaches (see this wikipedia page for starters).

mirestrepo closed this as completed Mar 20, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory leak? #66

Memory leak? #66

mirestrepo commented Mar 20, 2017 •

edited

Loading

KristofferC commented Mar 20, 2017

StefanKarpinski commented Mar 20, 2017 •

edited

Loading

mirestrepo commented Mar 20, 2017

nalimilan commented Mar 20, 2017

mirestrepo commented Mar 20, 2017

nalimilan commented Mar 20, 2017

mirestrepo commented Mar 20, 2017 •

edited

Loading

nalimilan commented Mar 20, 2017 •

edited

Loading

mirestrepo commented Mar 20, 2017

mirestrepo commented Mar 20, 2017

nalimilan commented Mar 20, 2017

StefanKarpinski commented Mar 20, 2017

Memory leak? #66

Memory leak? #66

Comments

mirestrepo commented Mar 20, 2017 • edited Loading

KristofferC commented Mar 20, 2017

StefanKarpinski commented Mar 20, 2017 • edited Loading

mirestrepo commented Mar 20, 2017

nalimilan commented Mar 20, 2017

mirestrepo commented Mar 20, 2017

nalimilan commented Mar 20, 2017

mirestrepo commented Mar 20, 2017 • edited Loading

nalimilan commented Mar 20, 2017 • edited Loading

mirestrepo commented Mar 20, 2017

mirestrepo commented Mar 20, 2017

nalimilan commented Mar 20, 2017

StefanKarpinski commented Mar 20, 2017

mirestrepo commented Mar 20, 2017 •

edited

Loading

StefanKarpinski commented Mar 20, 2017 •

edited

Loading

mirestrepo commented Mar 20, 2017 •

edited

Loading

nalimilan commented Mar 20, 2017 •

edited

Loading