Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

match() is slow #7066

Closed
swadey opened this issue Jun 1, 2014 · 8 comments
Closed

match() is slow #7066

swadey opened this issue Jun 1, 2014 · 8 comments
Labels
io Involving the I/O subsystem: libuv, read, write, etc. performance Must go faster

Comments

@swadey
Copy link
Contributor

swadey commented Jun 1, 2014

I believe this is related to #3719, but this is affecting 75f7732.

I'm reading string from a file then matching strings with a simple regex and pulling out the capture. The julia code is consistently 15-45x slower than similar python code. Here's code in julia and python:

File Edit Options Buffers Tools Help                                                                                                                                                  
using GZip                                                                                                                                                                            

global const pattern = r"^.*\.(.*)$"                                                                                                                                                  

function name2tld(name :: String)                                                                                                                                                     
  m = match(pattern, name)                                                                                                                                                            
  return string(m == nothing ? name : m.captures[1])                                                                                                                                  
end                                                                                                                                                                                   

function test(limit)                                                                                                                                                                  
  names = (String)[]                                                                                                                                                                  
  @time gzopen("pld-index.gz") do f                                                                                                                                                   
    for (i, line) in enumerate(eachline(f))                                                                                                                                           
      if i == limit                                                                                                                                                                   
        break                                                                                                                                                                         
      end                                                                                                                                                                             
      l = split(strip(line), '\t')                                                                                                                                                    
      push!(names, l[1])                                                                                                                                                              
    end                                                                                                                                                                               
  end                                                                                                                                                                                 

  # test                                                                                                                                                                              
  @time tnames = [ name2tld(names[i]) for i = 1:length(names) ]                                                                                                                       
end                                                                                                                                                                                   

test(4_000_000)
import re, gzip, string, time                                                                                                                                                         

pattern = re.compile(r'^.*\.(.*)$')                                                                                                                                                   

def name2tld(s):                                                                                                                                                                      
    m = pattern.match(s)                                                                                                                                                              
    if bool(m):                                                                                                                                                                       
        return m.group(1)                                                                                                                                                             
    else:                                                                                                                                                                             
        return s                                                                                                                                                                      

def test(limit):                                                                                                                                                                      
    names   = []                                                                                                                                                                      
    t0      = time.time()                                                                                                                                                             
    with gzip.open("pld-index.gz") as f:                                                                                                                                              
        for i, line in enumerate(f):                                                                                                                                                  
            if i == limit:                                                                                                                                                            
                break                                                                                                                                                                 
            l = string.split(line.strip(), "\t")                                                                                                                                      
            names.append(l[0])                                                                                                                                                        

    t1      = time.time()                                                                                                                                                             
    print "elapsed time: ", t1-t0                                                                                                                                                     
    tnames  = [ name2tld(x) for x in names ]                                                                                                                                          
    t2      = time.time()                                                                                                                                                             
    print "elapsed time: ", t2-t1                                                                                                                                                     

# main                                                                                                                                                                                
test(4000000) 

Here's a comparison of the results:

[swade@mt9 webgraph-example]$ julia test.jl                                                                                                                                           
elapsed time: 64.840658492 seconds (3503564904 bytes allocated)                                                                                                                       
elapsed time: 183.969596645 seconds (3390129376 bytes allocated) 

[swade@mt9 webgraph-example]$ python test.py                                                                                                                                                 
elapsed time:  15.9351358414                                                                                                                                                          
elapsed time:  4.97129011154
@dcjones
Copy link
Contributor

dcjones commented Jun 1, 2014

Is it definitely match that's slow, or could it be GZip, split, strip, or eachline?

@dcjones
Copy link
Contributor

dcjones commented Jun 1, 2014

I ask because I've noticed in the past that just reading lines is a lot slower in julia than python.

@swadey
Copy link
Contributor Author

swadey commented Jun 1, 2014

The timing results show that 183 seconds are spent on this:

@time tnames = [ name2tld(names[i]) for i = 1:length(names) ]

The reading portion is also slower than python, but it's like 4x as opposed to 15-45x.

I'm profiling this and here's what I see (though it's a bit hard to read the tree):

26881 ...te/partition1/localdata/swade/webgraph-example/test.jl; name2tld; line: 6
9768 regex.jl; match; line: 111
3742 pcre.jl; exec; line: 118 
1881 pcre.jl; info; line: 52
1859 /home/swade/projects/julia/usr/bin/../lib/libjulia.so; jl_alloc_array_1d; line: 67 
1839 array.jl; reinterpret; line: 82
965 /home/swade/projects/julia/usr/bin/../lib/julia/sys.so; jlcall_indexed_next3611; line: 28
864 /home/swade/projects/julia/usr/bin/../lib/libjulia.so; allocobj; line: 373
81 /home/swade/projects/julia/usr/bin/../lib/libjulia.so; jl_gc_collect; line: 2041

@dcjones
Copy link
Contributor

dcjones commented Jun 1, 2014

That's interesting. There's at least one obvious optimization: a lot of is spent calling PCRE.info, which seems like it could be called once per regex, rather than on every match.

Is there an easy way for me to reproduce your benchmark? Is the data public, or is there something similar I can use?

@swadey
Copy link
Contributor Author

swadey commented Jun 1, 2014

Yes, the data is available here: http://data.dws.informatik.uni-mannheim.de/hyperlinkgraph/pld-index.gz.

Warning: the file is 297M.

JeffBezanson added a commit that referenced this issue Jun 1, 2014
ref #7066
ref JuliaData/DataFrames.jl#609 (comment)

This doesn't get us all the way there but every bit helps. Note that IOStream
is still a bit faster than AsyncStream+IOBuffer. The former is used for files,
the latter for pipes.
@JeffBezanson
Copy link
Member

I am (happy?) to report that this spends all its time in GC:

elapsed time: 46.684817909 seconds (3503590704 bytes allocated)
elapsed time: 6.452577233 seconds (1374146344 bytes allocated)
Arena 0:
system bytes     = 1482522624
in use bytes     = 1482242224
Total (incl. mmap):
system bytes     = 1560702976
in use bytes     = 1560422576
max mmap regions =         10
max mmap bytes   =   78180352
exec time   55.08895 sec
gc time     45.87060 sec (83.3%)
malloc size 1413 MB
total freed 3348948687 b
free rate   69.6 MB/sec

This is after @dcjones ' awesome optimization.
By increasing the default collection interval 10x, I can get

elapsed time: 8.246907294 seconds (3503590704 bytes allocated)
elapsed time: 4.994826031 seconds (1374146344 bytes allocated)
Arena 0:
system bytes     = 1411899392
in use bytes     = 1411558768
Total (incl. mmap):
system bytes     = 1490079744
in use bytes     = 1489739120
max mmap regions =         10
max mmap bytes   =   78180352
exec time   14.53636 sec
gc time     5.69104 sec (39.2%)
malloc size 1346 MB
total freed 3422815175 b
free rate   573.6 MB/sec

The current default is 25MB, which is pretty small. I will bump it a bit for 64-bit machines, and the upcoming GC improvements should take care of the rest.

@swadey
Copy link
Contributor Author

swadey commented Jun 3, 2014

@dcjones @JeffBezanson wow that's great. I just ran my large benchmark and I'm seeing a 30-40x speedup.

@JeffBezanson
Copy link
Member

I think we've tackled the specific issues here, and the rest will be general GC work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
io Involving the I/O subsystem: libuv, read, write, etc. performance Must go faster
Projects
None yet
Development

No branches or pull requests

3 participants