Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak? #66

Closed
mirestrepo opened this issue Mar 20, 2017 · 12 comments
Closed

Memory leak? #66

mirestrepo opened this issue Mar 20, 2017 · 12 comments

Comments

@mirestrepo
Copy link

mirestrepo commented Mar 20, 2017

Hi,

I'm computing pairwise distances for a large matrix (500 x 100,000) - the memory footprint keeps growing indefinitely. I am pre-allocating the output matrix so I don't think that should be happening. In fact, the process eventually get killed by the kernel (after using 60Gb+ of men)... I suspect a memory leak. The code I'm running looks something like

using Base.Threads
using Distances
using JLD


function main(data)

    nvectors = size(data,2)
    js = Matrix{Float64}(nvectors, nvectors)

    pairwise!(js, JSDivergence(), data, data)

    println("Done computing distances")

    writedlm("./mallet_composition_500_JS.txt", js)
end


@time const data = jldopen("./mallet_composition_500.jld", "r") do file
    read(file, "data")
end
data_small = data[:, 1:100]
@time main(data_small)
@time main(data)

Any insights? I haven't profiled for memory leaks in Julia before, but I'll try to see if I can help with more specifics.

Thanks!

@KristofferC
Copy link
Member

Aren't you trying to allocate a 74 GB matrix (1e5 x 1e5 Float64s)?

@StefanKarpinski
Copy link
Contributor

StefanKarpinski commented Mar 20, 2017

The main "mystery" is why the output allocation would succeed initially but then continue to consume memory as the array is used... which happens because the kernel cheats and doesn't actually allocate any memory for pages until they're used, at which point the kernel gives your process its own real RAM-backed page for that memory.

@mirestrepo
Copy link
Author

@KristofferC, yes you are correct. My desktop has large resources, so I thought I could make it and since that initial allocation was working I was confused, but @StefanKarpinski explanation makes sense. Any of you have recommendations for handling this size problem? Splitting it up, map-reduce style?

@nalimilan
Copy link
Member

To be clear, do you have 100,000 variables with 500 observations, or 500 variables with 100,000 observations? IIRC this package is using a different convention than many other software and stores variables as rows.

@mirestrepo
Copy link
Author

500 variables, 100,000 observations

@nalimilan
Copy link
Member

OK, so try with the transposed matrix. :-)

@mirestrepo
Copy link
Author

mirestrepo commented Mar 20, 2017

I'm not sure I follow that. I need the pair-wise distance between observations.... and I believe this package uses column-to-column computations

If distance is symmetric I could try computing only triangular matrix

@nalimilan
Copy link
Member

nalimilan commented Mar 20, 2017

Yes, but what I mean is that maybe your data contains variables as columns, while pairwise expects them as rows (#35)? That might explain why so much memory is used. I haven't been able to understand what's the format of data just by looking at the code.

EDIT: Sorry, I misread, so that's not a rows vs. columns issue. So the full matrix really needs 100_000^2 * 8 bytes, i.e. 75GB. You could use Float32 and halve this requirement. Not sure whether it's possible to use a diagonal matrix.

@mirestrepo
Copy link
Author

Float32 is an idea, but before sacrificing precision... is it possible to work with memory-mapped matrices or something like that?

@mirestrepo
Copy link
Author

Thanks all for answers. I'll close this issue, since it's not a leak problem. If anyone has any further recommendations, they're greatly appreciated

@nalimilan
Copy link
Member

See ?Mmap.mmap for mmapped arrays.

@StefanKarpinski
Copy link
Contributor

For this sort of computation, you generally need to figure out a clever way to avoid doing all the pairwise distance computations. It's hard to imagine that you need all of the pairs of distances, so there may be some way to avoid doing most of the computation. One probabilistic approach is to use locality-sensitive hashing to decide which pairs to look at (only the ones in the same hash bucket). There are various other exact and approximate approaches (see this wikipedia page for starters).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants