String comparison/loop performance slow #3733

randyzwitch · 2013-07-16T09:48:25Z

Continuation of thread on the julia-users list:
https://groups.google.com/d/msg/julia-users/hVxBJkSiPtE/3YgkZQFRyD4J

Looks similar to @johnmyleswhite issue #3719

from collections import Counter

#Import english dictionary
english_dict = [word.lower().rstrip("\n") for word in open("/usr/share/dict/words")]

#Import urls
urls = [word.lower().rstrip("\n") for word in open("/Users/randyzwitch/Downloads/urls_10000.wsv")]

counts = Counter()
for word in english_dict:
    for url in urls:
        if word in url:
            counts[word] += 1

Timing just the loop in Python takes 437s (~7 minutes)

#Import english dictionary
english_dict = readdlm("/usr/share/dict/words", ' ', String)

#Lowercase every word in english_dict
map!((x) -> lowercase(x), english_dict)

#Read in URLs file
urls = readdlm("/Users/randyzwitch/Downloads/urls_10000.wsv", '\t', String)

#Lowercase each url
map!((x) -> lowercase(x), urls)

#Look for word inside url string, add to dictionary and +1
counts=Dict{String, Int64}()
for word in english_dict
    for line in urls
        if search(line, word) != (0:-1)
            counts[word]=get(counts,word,0)+1
        end
    end
end

Timing just the loop in Julia took 1860s (~31 minutes), about 4x-5x longer than Python. Wrapping the Julia loop in a function provides minor speedup (~10%), but not enough to close Python gap

johnmyleswhite · 2013-07-16T11:43:54Z

What happens if you replace search with ismatch?

I also wonder whether our hashing strategy for strings is more demanding than Python's. I have a vague memory of this being true.

ViralBShah · 2013-07-16T12:02:40Z

It is certainly not 4-5x slower.

randyzwitch · 2013-07-16T14:44:46Z

@ViralBShah I'm consistently getting 400s for Python and 1600-1800s for Julia

IPython 0.13.2 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.

In [1]: from collections import Counter

In [2]: english_dict = [word.lower().rstrip("\n") for word in open("/usr/share/dict/words")]

In [3]: #Import urls

In [4]: urls = [word.lower().rstrip("\n") for word in open("/Users/randyzwitch/Downloads/urls_10000.wsv")]

In [7]: %time %paste
counts = Counter()
for word in english_dict:
    for url in urls:
        if word in url:
            counts[word] += 1

## -- End pasted text --

CPU times: user 416.93 s, sys: 0.19 s, total: 417.12 s
Wall time: 417.14 s

In [8]: 

In [8]: len(counts)
Out[8]: 2232

In [9]: counts["a"]
Out[9]: 19898

This code is incorporates Tim's suggestion of wrapping the loop in a function and making string type arrays for both the word list and url list.

julia> english_dict = readdlm("/usr/share/dict/words", ' ', String);

julia> map!((x) -> lowercase(x), english_dict);

julia> urls = readdlm("/Users/randyzwitch/Downloads/urls_10000.wsv", '\t', String);

julia> size(urls)
(10000,1)

julia> map!((x) -> lowercase(x), urls);

julia> function word_count()
       counts=Dict{String, Int64}()
       for word in english_dict
           for line in urls
               if search(line, word) != (0:-1)
                   counts[word]=get(counts,word,0)+1
               end
           end  
       end
       return counts
       end
# methods for generic function word_count
word_count() at none:2

julia> @time test = word_count()
elapsed time: 1659.545788768 seconds

julia> length(test)
2232

julia> test["a"]
19898

julia> @time test = word_count();
elapsed time: 1665.029234354 seconds

julia> length(test)
2232

julia> test["a"]
19898

julia> @time test = word_count();
elapsed time: 1689.584302743 seconds

@johnmyleswhite I'll try swapping out search with ismatch a bit later. I'm hoping to get Profile installed as well.

JeffBezanson · 2013-07-16T16:06:30Z

Not related to performance, but map!(lowercase, urls).

randyzwitch · 2013-07-16T16:37:51Z

Thanks @JeffBezanson.

@johnmyleswhite, how can I loop into a regex to use ismatch? Do I have to use @sprintf to build it on the fly?

johnmyleswhite · 2013-07-16T16:59:28Z

Reading through this again, I suspect ismatch will be slower than search.

dcjones · 2013-07-16T17:56:54Z

I ran this code through the profiler (which I'm loving, by the way).

I think the crux of this is just that a ton of effort has been put into optimizing python's str.find function. It's described in the source (Objects/stringlib/fastsearch.h) as "a mix between boyer-moore and horspool, with a few more bells and whistles on the top". There's a little about that here.

Reimplementing the same approach in Julia might be a fun project...

StefanKarpinski · 2013-07-16T17:57:18Z

Using search and taking substrings using the index range returned by search should be fast and correct. Try that.

JeffBezanson · 2013-07-16T18:06:14Z

search is just being used as a predicate here.

timholy · 2013-07-16T21:15:17Z

profiler (which I'm loving, by the way).

You're also using it rather well, given the 30x performance improvement you obtained in #3719.

dcjones · 2013-07-18T05:59:36Z

Some more information:

I implemented python's search function, which is pretty clever and not very complicated. Here are timings using the first 10000 words in the dictionary.

Baseline: 49.92
Nop: 20.06
Fastsearch: 30.92

"Nop" is a search function that always returns no match, so no string matching is performed and nothing is inserted in the dictionary.

Python is: 13.59

So it's still more than twice as slow, but judging from the nop time, the bottleneck is no longer actual string matching. In fact it's still 13.77 seconds without any function call and just empty nested loops.

ViralBShah · 2013-07-18T06:20:51Z

What does profiling say?

JeffBezanson · 2013-07-18T06:22:17Z

The search() == (0:-1) idiom is relatively expensive when a boolean is all that's needed. The compiler could do better, but it's still doing unnecessary operations --- constructing a range is non-trivial, and you have to compare both parts (start and len).

dcjones · 2013-07-18T06:35:10Z

I had actually already taken out the search() == (0:-1) test since that did pop up in the profiling data.

Here's something that takes 16 seconds and is a pretty close to the test I was running.

english_dict = readdlm("words10000", ' ', String)
map!(lowercase, english_dict)

urls = readdlm("urls_10000.wsv", '\t')
map!(lowercase, urls)

function nopsearch(t::ASCIIString, p::ASCIIString)
    0
end

function count()
    counts=Dict{String, Int64}()
    for word in english_dict
        for line in urls
            if nopsearch(line, word) != 0
            end
        end
    end
end

@printf("Nop: %0.2f\n", @elapsed count())

Here's the profiler output for this, which I can't really make much sense of: https://gist.github.com/dcjones/10404bf2b6bc35a600d7

Keno · 2013-07-18T06:41:00Z

I think @vtjnash said there was a 3x speedup to be had by implementing callsite-splitting of Union types in the compiler.

JeffBezanson · 2013-07-18T06:47:35Z

Well, yes, the abstract String type doesn't help. But there are no union types here. Changing the type to ASCIIString helps though. We should probably do what we did for Array, and rename String to AbstractString, and make String an alias for UTF8String.

Also this is not how I would have implemented the "wrap it in a function" suggestion. I would do function count(english_dict, urls), but there's a limit to how many benchmark rewrites one can request.

dcjones · 2013-07-18T07:16:20Z

Also this is not how I would have implemented the "wrap it in a function" suggestion. I would do function count(english_dict, urls), but there's a limit to how many benchmark rewrites one can request.

Wow, I didn't think of that. It does make a huge difference.

Baseline: 34.53
Nop: 6.60
Fastsearch: 16.72

JeffBezanson · 2013-07-18T07:29:45Z

I wonder if we should be calling strstr for ByteStrings. It'd be interesting to compare it to your code. Could you put up a gist?

dcjones · 2013-07-18T07:53:15Z

Ok, here's what I was working with: https://gist.github.com/dcjones/99ec36932d7eea4ec7c1

I added strstr to that and it clocks in at 24 seconds, though I'm sure that varies by implementation.

timholy · 2013-07-18T09:27:31Z

Here's the profiler output for this, which I can't really make much sense of

That profiler output has compilation in it (compilation is such a heavily-recursive process that its output dominates profiles). Run the code twice and look at the profile the second time. You can use sprofile_clear() if you forget and run it for a compile.

JeffBezanson · 2013-07-18T23:01:23Z

That is cool. We should use fastsearch.

dcjones · 2013-07-19T00:10:24Z

I'll do some cleanup and testing and send a pull request.

randyzwitch · 2014-04-23T13:44:30Z

Didn't realize this was still open

@IanButterworth

Stdlib: Pkg URL: https://github.com/JuliaLang/Pkg.jl.git Stdlib branch: master Julia branch: master Old commit: 85f1e5564 New commit: 3c86ba27e Julia version: 1.11.0-DEV Pkg version: 1.11.0 Bump invoked by: @IanButterworth Powered by: [BumpStdlibs.jl](https://github.com/JuliaLang/BumpStdlibs.jl) Diff: JuliaLang/Pkg.jl@85f1e55...3c86ba2 ``` $ git log --oneline 85f1e5564..3c86ba27e 3c86ba27e add `add --weak/extra Foo` to add to [weakdeps] or [extras] (#3708) 2e640f92f respect --color=no in Pkg.precompile (#3740) cbd5d08ad Automatically add compat entries when adding deps to a package (#3732) 03de920b3 rm old manual handling of `--compiled-modules` (#3738) 314d5497b Use realpaths for temp dirs during tests. Fix SparseArrays `why` breakage (#3734) a6531d4be environments.md: update Julia version (#3715) a509bc062 Revise the API of is_manifest_current. (#3701) 60b7b7995 rm incorrect kwargs in add docstring (#3733) ``` Co-authored-by: Dilum Aluthge <[email protected]>

dcjones mentioned this issue Jul 23, 2013

RFC: A faster string search algorithm, and an "index" function. #3797

Merged

randyzwitch closed this as completed Apr 23, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

String comparison/loop performance slow #3733

String comparison/loop performance slow #3733

randyzwitch commented Jul 16, 2013

johnmyleswhite commented Jul 16, 2013

ViralBShah commented Jul 16, 2013

randyzwitch commented Jul 16, 2013

JeffBezanson commented Jul 16, 2013

randyzwitch commented Jul 16, 2013

johnmyleswhite commented Jul 16, 2013

dcjones commented Jul 16, 2013

StefanKarpinski commented Jul 16, 2013

JeffBezanson commented Jul 16, 2013

timholy commented Jul 16, 2013

dcjones commented Jul 18, 2013

ViralBShah commented Jul 18, 2013

JeffBezanson commented Jul 18, 2013

dcjones commented Jul 18, 2013

Keno commented Jul 18, 2013

JeffBezanson commented Jul 18, 2013

dcjones commented Jul 18, 2013

JeffBezanson commented Jul 18, 2013

dcjones commented Jul 18, 2013

timholy commented Jul 18, 2013

JeffBezanson commented Jul 18, 2013

dcjones commented Jul 19, 2013

randyzwitch commented Apr 23, 2014

String comparison/loop performance slow #3733

String comparison/loop performance slow #3733

Comments

randyzwitch commented Jul 16, 2013

johnmyleswhite commented Jul 16, 2013

ViralBShah commented Jul 16, 2013

randyzwitch commented Jul 16, 2013

JeffBezanson commented Jul 16, 2013

randyzwitch commented Jul 16, 2013

johnmyleswhite commented Jul 16, 2013

dcjones commented Jul 16, 2013

StefanKarpinski commented Jul 16, 2013

JeffBezanson commented Jul 16, 2013

timholy commented Jul 16, 2013

dcjones commented Jul 18, 2013

ViralBShah commented Jul 18, 2013

JeffBezanson commented Jul 18, 2013

dcjones commented Jul 18, 2013

Keno commented Jul 18, 2013

JeffBezanson commented Jul 18, 2013

dcjones commented Jul 18, 2013

JeffBezanson commented Jul 18, 2013

dcjones commented Jul 18, 2013

timholy commented Jul 18, 2013

JeffBezanson commented Jul 18, 2013

dcjones commented Jul 19, 2013

randyzwitch commented Apr 23, 2014