Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

String comparison/loop performance slow #3733

Closed
randyzwitch opened this issue Jul 16, 2013 · 23 comments
Closed

String comparison/loop performance slow #3733

randyzwitch opened this issue Jul 16, 2013 · 23 comments
Labels
performance Must go faster

Comments

@randyzwitch
Copy link
Contributor

Continuation of thread on the julia-users list:
https://groups.google.com/d/msg/julia-users/hVxBJkSiPtE/3YgkZQFRyD4J

Looks similar to @johnmyleswhite issue #3719

from collections import Counter

#Import english dictionary
english_dict = [word.lower().rstrip("\n") for word in open("/usr/share/dict/words")]

#Import urls
urls = [word.lower().rstrip("\n") for word in open("/Users/randyzwitch/Downloads/urls_10000.wsv")]

counts = Counter()
for word in english_dict:
    for url in urls:
        if word in url:
            counts[word] += 1

Timing just the loop in Python takes 437s (~7 minutes)

#Import english dictionary
english_dict = readdlm("/usr/share/dict/words", ' ', String)

#Lowercase every word in english_dict
map!((x) -> lowercase(x), english_dict)

#Read in URLs file
urls = readdlm("/Users/randyzwitch/Downloads/urls_10000.wsv", '\t', String)

#Lowercase each url
map!((x) -> lowercase(x), urls)

#Look for word inside url string, add to dictionary and +1
counts=Dict{String, Int64}()
for word in english_dict
    for line in urls
        if search(line, word) != (0:-1)
            counts[word]=get(counts,word,0)+1
        end
    end
end

Timing just the loop in Julia took 1860s (~31 minutes), about 4x-5x longer than Python. Wrapping the Julia loop in a function provides minor speedup (~10%), but not enough to close Python gap

@johnmyleswhite
Copy link
Member

What happens if you replace search with ismatch?

I also wonder whether our hashing strategy for strings is more demanding than Python's. I have a vague memory of this being true.

@ViralBShah
Copy link
Member

It is certainly not 4-5x slower.

@randyzwitch
Copy link
Contributor Author

@ViralBShah I'm consistently getting 400s for Python and 1600-1800s for Julia

IPython 0.13.2 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.

In [1]: from collections import Counter

In [2]: english_dict = [word.lower().rstrip("\n") for word in open("/usr/share/dict/words")]

In [3]: #Import urls

In [4]: urls = [word.lower().rstrip("\n") for word in open("/Users/randyzwitch/Downloads/urls_10000.wsv")]

In [7]: %time %paste
counts = Counter()
for word in english_dict:
    for url in urls:
        if word in url:
            counts[word] += 1

## -- End pasted text --

CPU times: user 416.93 s, sys: 0.19 s, total: 417.12 s
Wall time: 417.14 s

In [8]: 

In [8]: len(counts)
Out[8]: 2232

In [9]: counts["a"]
Out[9]: 19898

This code is incorporates Tim's suggestion of wrapping the loop in a function and making string type arrays for both the word list and url list.

julia> english_dict = readdlm("/usr/share/dict/words", ' ', String);

julia> map!((x) -> lowercase(x), english_dict);

julia> urls = readdlm("/Users/randyzwitch/Downloads/urls_10000.wsv", '\t', String);

julia> size(urls)
(10000,1)

julia> map!((x) -> lowercase(x), urls);

julia> function word_count()
       counts=Dict{String, Int64}()
       for word in english_dict
           for line in urls
               if search(line, word) != (0:-1)
                   counts[word]=get(counts,word,0)+1
               end
           end  
       end
       return counts
       end
# methods for generic function word_count
word_count() at none:2

julia> @time test = word_count()
elapsed time: 1659.545788768 seconds

julia> length(test)
2232

julia> test["a"]
19898

julia> @time test = word_count();
elapsed time: 1665.029234354 seconds

julia> length(test)
2232

julia> test["a"]
19898

julia> @time test = word_count();
elapsed time: 1689.584302743 seconds

@johnmyleswhite I'll try swapping out search with ismatch a bit later. I'm hoping to get Profile installed as well.

@JeffBezanson
Copy link
Member

Not related to performance, but map!(lowercase, urls).

@randyzwitch
Copy link
Contributor Author

Thanks @JeffBezanson.

@johnmyleswhite, how can I loop into a regex to use ismatch? Do I have to use @sprintf to build it on the fly?

@johnmyleswhite
Copy link
Member

Reading through this again, I suspect ismatch will be slower than search.

@dcjones
Copy link
Contributor

dcjones commented Jul 16, 2013

I ran this code through the profiler (which I'm loving, by the way).

I think the crux of this is just that a ton of effort has been put into optimizing python's str.find function. It's described in the source (Objects/stringlib/fastsearch.h) as "a mix between boyer-moore and horspool, with a few more bells and whistles on the top". There's a little about that here.

Reimplementing the same approach in Julia might be a fun project...

@StefanKarpinski
Copy link
Member

Using search and taking substrings using the index range returned by search should be fast and correct. Try that.

@JeffBezanson
Copy link
Member

search is just being used as a predicate here.

@timholy
Copy link
Member

timholy commented Jul 16, 2013

profiler (which I'm loving, by the way).

You're also using it rather well, given the 30x performance improvement you obtained in #3719.

@dcjones
Copy link
Contributor

dcjones commented Jul 18, 2013

Some more information:

I implemented python's search function, which is pretty clever and not very complicated. Here are timings using the first 10000 words in the dictionary.

  • Baseline: 49.92
  • Nop: 20.06
  • Fastsearch: 30.92

"Nop" is a search function that always returns no match, so no string matching is performed and nothing is inserted in the dictionary.

Python is: 13.59

So it's still more than twice as slow, but judging from the nop time, the bottleneck is no longer actual string matching. In fact it's still 13.77 seconds without any function call and just empty nested loops.

@ViralBShah
Copy link
Member

What does profiling say?

@JeffBezanson
Copy link
Member

The search() == (0:-1) idiom is relatively expensive when a boolean is all that's needed. The compiler could do better, but it's still doing unnecessary operations --- constructing a range is non-trivial, and you have to compare both parts (start and len).

@dcjones
Copy link
Contributor

dcjones commented Jul 18, 2013

I had actually already taken out the search() == (0:-1) test since that did pop up in the profiling data.

Here's something that takes 16 seconds and is a pretty close to the test I was running.

english_dict = readdlm("words10000", ' ', String)
map!(lowercase, english_dict)

urls = readdlm("urls_10000.wsv", '\t')
map!(lowercase, urls)

function nopsearch(t::ASCIIString, p::ASCIIString)
    0
end

function count()
    counts=Dict{String, Int64}()
    for word in english_dict
        for line in urls
            if nopsearch(line, word) != 0
            end
        end
    end
end

@printf("Nop: %0.2f\n", @elapsed count())

Here's the profiler output for this, which I can't really make much sense of: https://gist.github.com/dcjones/10404bf2b6bc35a600d7

@Keno
Copy link
Member

Keno commented Jul 18, 2013

I think @vtjnash said there was a 3x speedup to be had by implementing callsite-splitting of Union types in the compiler.

@JeffBezanson
Copy link
Member

Well, yes, the abstract String type doesn't help. But there are no union types here. Changing the type to ASCIIString helps though. We should probably do what we did for Array, and rename String to AbstractString, and make String an alias for UTF8String.

Also this is not how I would have implemented the "wrap it in a function" suggestion. I would do function count(english_dict, urls), but there's a limit to how many benchmark rewrites one can request.

@dcjones
Copy link
Contributor

dcjones commented Jul 18, 2013

Also this is not how I would have implemented the "wrap it in a function" suggestion. I would do function count(english_dict, urls), but there's a limit to how many benchmark rewrites one can request.

Wow, I didn't think of that. It does make a huge difference.

Baseline: 34.53
Nop: 6.60
Fastsearch: 16.72

@JeffBezanson
Copy link
Member

I wonder if we should be calling strstr for ByteStrings. It'd be interesting to compare it to your code. Could you put up a gist?

@dcjones
Copy link
Contributor

dcjones commented Jul 18, 2013

Ok, here's what I was working with: https://gist.github.com/dcjones/99ec36932d7eea4ec7c1

I added strstr to that and it clocks in at 24 seconds, though I'm sure that varies by implementation.

@timholy
Copy link
Member

timholy commented Jul 18, 2013

Here's the profiler output for this, which I can't really make much sense of

That profiler output has compilation in it (compilation is such a heavily-recursive process that its output dominates profiles). Run the code twice and look at the profile the second time. You can use sprofile_clear() if you forget and run it for a compile.

@JeffBezanson
Copy link
Member

That is cool. We should use fastsearch.

@dcjones
Copy link
Contributor

dcjones commented Jul 19, 2013

I'll do some cleanup and testing and send a pull request.

@randyzwitch
Copy link
Contributor Author

Didn't realize this was still open

Keno pushed a commit that referenced this issue Dec 21, 2023
Stdlib: Pkg
URL: https://github.com/JuliaLang/Pkg.jl.git
Stdlib branch: master
Julia branch: master
Old commit: 85f1e5564
New commit: 3c86ba27e
Julia version: 1.11.0-DEV
Pkg version: 1.11.0
Bump invoked by: @IanButterworth
Powered by:
[BumpStdlibs.jl](https://github.com/JuliaLang/BumpStdlibs.jl)

Diff:
JuliaLang/Pkg.jl@85f1e55...3c86ba2

```
$ git log --oneline 85f1e5564..3c86ba27e
3c86ba27e add `add --weak/extra Foo` to add to [weakdeps] or [extras] (#3708)
2e640f92f respect --color=no in Pkg.precompile (#3740)
cbd5d08ad Automatically add compat entries when adding deps to a package (#3732)
03de920b3 rm old manual handling of `--compiled-modules` (#3738)
314d5497b Use realpaths for temp dirs during tests. Fix SparseArrays `why` breakage (#3734)
a6531d4be environments.md: update Julia version (#3715)
a509bc062 Revise the API of is_manifest_current. (#3701)
60b7b7995 rm incorrect kwargs in add docstring (#3733)
```

Co-authored-by: Dilum Aluthge <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster
Projects
None yet
Development

No branches or pull requests

8 participants