match() is slow #7066

swadey · 2014-06-01T15:59:31Z

I believe this is related to #3719, but this is affecting 75f7732.

I'm reading string from a file then matching strings with a simple regex and pulling out the capture. The julia code is consistently 15-45x slower than similar python code. Here's code in julia and python:

File Edit Options Buffers Tools Help                                                                                                                                                  
using GZip                                                                                                                                                                            

global const pattern = r"^.*\.(.*)$"                                                                                                                                                  

function name2tld(name :: String)                                                                                                                                                     
  m = match(pattern, name)                                                                                                                                                            
  return string(m == nothing ? name : m.captures[1])                                                                                                                                  
end                                                                                                                                                                                   

function test(limit)                                                                                                                                                                  
  names = (String)[]                                                                                                                                                                  
  @time gzopen("pld-index.gz") do f                                                                                                                                                   
    for (i, line) in enumerate(eachline(f))                                                                                                                                           
      if i == limit                                                                                                                                                                   
        break                                                                                                                                                                         
      end                                                                                                                                                                             
      l = split(strip(line), '\t')                                                                                                                                                    
      push!(names, l[1])                                                                                                                                                              
    end                                                                                                                                                                               
  end                                                                                                                                                                                 

  # test                                                                                                                                                                              
  @time tnames = [ name2tld(names[i]) for i = 1:length(names) ]                                                                                                                       
end                                                                                                                                                                                   

test(4_000_000)

import re, gzip, string, time                                                                                                                                                         

pattern = re.compile(r'^.*\.(.*)$')                                                                                                                                                   

def name2tld(s):                                                                                                                                                                      
    m = pattern.match(s)                                                                                                                                                              
    if bool(m):                                                                                                                                                                       
        return m.group(1)                                                                                                                                                             
    else:                                                                                                                                                                             
        return s                                                                                                                                                                      

def test(limit):                                                                                                                                                                      
    names   = []                                                                                                                                                                      
    t0      = time.time()                                                                                                                                                             
    with gzip.open("pld-index.gz") as f:                                                                                                                                              
        for i, line in enumerate(f):                                                                                                                                                  
            if i == limit:                                                                                                                                                            
                break                                                                                                                                                                 
            l = string.split(line.strip(), "\t")                                                                                                                                      
            names.append(l[0])                                                                                                                                                        

    t1      = time.time()                                                                                                                                                             
    print "elapsed time: ", t1-t0                                                                                                                                                     
    tnames  = [ name2tld(x) for x in names ]                                                                                                                                          
    t2      = time.time()                                                                                                                                                             
    print "elapsed time: ", t2-t1                                                                                                                                                     

# main                                                                                                                                                                                
test(4000000)

Here's a comparison of the results:

[swade@mt9 webgraph-example]$ julia test.jl                                                                                                                                           
elapsed time: 64.840658492 seconds (3503564904 bytes allocated)                                                                                                                       
elapsed time: 183.969596645 seconds (3390129376 bytes allocated) 

[swade@mt9 webgraph-example]$ python test.py                                                                                                                                                 
elapsed time:  15.9351358414                                                                                                                                                          
elapsed time:  4.97129011154

The text was updated successfully, but these errors were encountered:

dcjones · 2014-06-01T16:13:07Z

Is it definitely match that's slow, or could it be GZip, split, strip, or eachline?

dcjones · 2014-06-01T16:31:40Z

I ask because I've noticed in the past that just reading lines is a lot slower in julia than python.

swadey · 2014-06-01T16:34:23Z

The timing results show that 183 seconds are spent on this:

@time tnames = [ name2tld(names[i]) for i = 1:length(names) ]

The reading portion is also slower than python, but it's like 4x as opposed to 15-45x.

I'm profiling this and here's what I see (though it's a bit hard to read the tree):

26881 ...te/partition1/localdata/swade/webgraph-example/test.jl; name2tld; line: 6
9768 regex.jl; match; line: 111
3742 pcre.jl; exec; line: 118 
1881 pcre.jl; info; line: 52
1859 /home/swade/projects/julia/usr/bin/../lib/libjulia.so; jl_alloc_array_1d; line: 67 
1839 array.jl; reinterpret; line: 82
965 /home/swade/projects/julia/usr/bin/../lib/julia/sys.so; jlcall_indexed_next3611; line: 28
864 /home/swade/projects/julia/usr/bin/../lib/libjulia.so; allocobj; line: 373
81 /home/swade/projects/julia/usr/bin/../lib/libjulia.so; jl_gc_collect; line: 2041

dcjones · 2014-06-01T16:45:20Z

That's interesting. There's at least one obvious optimization: a lot of is spent calling PCRE.info, which seems like it could be called once per regex, rather than on every match.

Is there an easy way for me to reproduce your benchmark? Is the data public, or is there something similar I can use?

swadey · 2014-06-01T16:55:43Z

Yes, the data is available here: http://data.dws.informatik.uni-mannheim.de/hyperlinkgraph/pld-index.gz.

Warning: the file is 297M.

ref #7066 ref JuliaData/DataFrames.jl#609 (comment) This doesn't get us all the way there but every bit helps. Note that IOStream is still a bit faster than AsyncStream+IOBuffer. The former is used for files, the latter for pipes.

JeffBezanson · 2014-06-03T02:18:17Z

I am (happy?) to report that this spends all its time in GC:

elapsed time: 46.684817909 seconds (3503590704 bytes allocated)
elapsed time: 6.452577233 seconds (1374146344 bytes allocated)
Arena 0:
system bytes     = 1482522624
in use bytes     = 1482242224
Total (incl. mmap):
system bytes     = 1560702976
in use bytes     = 1560422576
max mmap regions =         10
max mmap bytes   =   78180352
exec time   55.08895 sec
gc time     45.87060 sec (83.3%)
malloc size 1413 MB
total freed 3348948687 b
free rate   69.6 MB/sec

This is after @dcjones ' awesome optimization.
By increasing the default collection interval 10x, I can get

elapsed time: 8.246907294 seconds (3503590704 bytes allocated)
elapsed time: 4.994826031 seconds (1374146344 bytes allocated)
Arena 0:
system bytes     = 1411899392
in use bytes     = 1411558768
Total (incl. mmap):
system bytes     = 1490079744
in use bytes     = 1489739120
max mmap regions =         10
max mmap bytes   =   78180352
exec time   14.53636 sec
gc time     5.69104 sec (39.2%)
malloc size 1346 MB
total freed 3422815175 b
free rate   573.6 MB/sec

The current default is 25MB, which is pretty small. I will bump it a bit for 64-bit machines, and the upcoming GC improvements should take care of the rest.

swadey · 2014-06-03T02:51:07Z

@dcjones @JeffBezanson wow that's great. I just ran my large benchmark and I'm seeing a 30-40x speedup.

JeffBezanson · 2014-06-03T03:13:23Z

I think we've tackled the specific issues here, and the rest will be general GC work.

JeffBezanson added I/O labels Jun 1, 2014

dcjones mentioned this issue Jun 1, 2014

Improvements to regex performance. #7069

Merged

JeffBezanson closed this as completed Jun 3, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

match() is slow #7066

match() is slow #7066

swadey commented Jun 1, 2014

dcjones commented Jun 1, 2014

dcjones commented Jun 1, 2014

swadey commented Jun 1, 2014

dcjones commented Jun 1, 2014

swadey commented Jun 1, 2014

JeffBezanson commented Jun 3, 2014

swadey commented Jun 3, 2014

JeffBezanson commented Jun 3, 2014

match() is slow #7066

match() is slow #7066

Comments

swadey commented Jun 1, 2014

dcjones commented Jun 1, 2014

dcjones commented Jun 1, 2014

swadey commented Jun 1, 2014

dcjones commented Jun 1, 2014

swadey commented Jun 1, 2014

JeffBezanson commented Jun 3, 2014

swadey commented Jun 3, 2014

JeffBezanson commented Jun 3, 2014