Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ones(2^33) is slow #1795

Closed
ViralBShah opened this issue Dec 20, 2012 · 17 comments
Closed

ones(2^33) is slow #1795

ViralBShah opened this issue Dec 20, 2012 · 17 comments
Labels
performance Must go faster

Comments

@ViralBShah
Copy link
Member

I am not sure what the expected time for something like this should be, but it takes almost a minute to do ones(2^33).

julia> @time a = ones(2^33)
elapsed time: 58.07947397232056 seconds
@ViralBShah
Copy link
Member Author

This C program took almost the same amount of time. I wonder if one can do better.

$ time ./bigmalloc
1.000000

real    0m52.652s
user    0m9.349s
sys 0m40.359s
// gcc bigmalloc.c -o bigmalloc -std=c99 -O2
#include <stdlib.h>
#include <stdio.h>

int main()
{
  double *p = (double *) malloc ((size_t) 8589934592 * sizeof(double));
  for (long i=0; i<8589934592; ++i)
    p[i] = 1.0;

  printf("%f\n", p[8589934591]);

  return(0);
}

@pao
Copy link
Member

pao commented Dec 20, 2012

That touches 64 GB of memory, doesn't it? How much RAM does the system have? Which malloc() is this (glibc, Mac, some other malloc?) Seems like a lot needs to go right for this allocation to go smoothly.

@StefanKarpinski
Copy link
Member

For zeros we could do the trick of mmapping /dev/zero, which should be much faster. Doesn't address the ones case though. I don't think there's a great way to do that unless we want to get into the business of representing constant arrays specially.

@ViralBShah
Copy link
Member Author

This is on julia.mit.edu with 1TB of RAM running a recent ubuntu (12.04).

@pao
Copy link
Member

pao commented Dec 20, 2012

That would give you some room to work.

@ViralBShah
Copy link
Member Author

Almost all the time goes in the for loop in the C code. The malloc takes less than a second.

@StefanKarpinski
Copy link
Member

The filling in of data could be done in parallel using some of the 80 cores on that machine. That would make it go faster. I would generally like to have easy ways to express simple data-parallel code better.

@timholy
Copy link
Member

timholy commented Dec 20, 2012

How about using memfill? For a manual Julia version you could have a for loop that "seeds" with a few 1.0's, and then calls memcpy on two-fold-increasing chunks:

a = Array(T, 2^33)
n = 16
a[1:n] = one(T)
copy_to(a, n, a, 1, n); n *= 2
copy_to(a, n, a, 1, n); n *= 2
copy_to(a, n, a, 1, n); n *= 2
...

@ViralBShah
Copy link
Member Author

Same thing is touched upon in #1790. Although I would prefer not having multiple models of parallelism, it certainly would be nice to be able to have a data parallel model to use multiple cores without launching 80 julias.

@ViralBShah
Copy link
Member Author

@timholy Does memfill work faster? If so, we could even use it inside fill.

@timholy
Copy link
Member

timholy commented Dec 20, 2012

I don't know. It's not part of the C standard library, it's an addon. I've never tested it. If it just loops (and doesn't use that two fold growing trick), the version I wrote out might be faster.

@StefanKarpinski
Copy link
Member

I would be rather surprised if the copying trick is faster. Once n gets bigger than your cache that's going to cause totally spurious memory fetches that you don't need.

@ViralBShah
Copy link
Member Author

I tried memset in my C code to set a constant integer instead of using the for loop, and that finishes in 18 seconds.

@timholy
Copy link
Member

timholy commented Dec 20, 2012

Yes, I just looked and noticed that fill!, which is being used by ones, is using memset. So you're already doing whatever is most efficient.

I tested the copy trick. It's not bad, but it's not quite as fast as memset.

@ViralBShah
Copy link
Member Author

I guess there really isn't much that can be done about this for sequential julia.

@JeffBezanson
Copy link
Member

This is more than 1GB/second, which doesn't strike me as too bad.
It might be worth considering arrays stored compressed in memory. See https://github.com/FrancescAlted/blosc

@ViralBShah
Copy link
Member Author

Of course it's quite good. rand(2^33) also takes the same time. It just feels slow. Parallelism is the answer here. Closing this issue, as it is really not a julia performance problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster
Projects
None yet
Development

No branches or pull requests

5 participants