-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
String comparison/loop performance slow #3733
Comments
What happens if you replace I also wonder whether our hashing strategy for strings is more demanding than Python's. I have a vague memory of this being true. |
It is certainly not 4-5x slower. |
@ViralBShah I'm consistently getting 400s for Python and 1600-1800s for Julia IPython 0.13.2 -- An enhanced Interactive Python.
? -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help -> Python's own help system.
object? -> Details about 'object', use 'object??' for extra details.
In [1]: from collections import Counter
In [2]: english_dict = [word.lower().rstrip("\n") for word in open("/usr/share/dict/words")]
In [3]: #Import urls
In [4]: urls = [word.lower().rstrip("\n") for word in open("/Users/randyzwitch/Downloads/urls_10000.wsv")]
In [7]: %time %paste
counts = Counter()
for word in english_dict:
for url in urls:
if word in url:
counts[word] += 1
## -- End pasted text --
CPU times: user 416.93 s, sys: 0.19 s, total: 417.12 s
Wall time: 417.14 s
In [8]:
In [8]: len(counts)
Out[8]: 2232
In [9]: counts["a"]
Out[9]: 19898 This code is incorporates Tim's suggestion of wrapping the loop in a function and making string type arrays for both the word list and url list. julia> english_dict = readdlm("/usr/share/dict/words", ' ', String);
julia> map!((x) -> lowercase(x), english_dict);
julia> urls = readdlm("/Users/randyzwitch/Downloads/urls_10000.wsv", '\t', String);
julia> size(urls)
(10000,1)
julia> map!((x) -> lowercase(x), urls);
julia> function word_count()
counts=Dict{String, Int64}()
for word in english_dict
for line in urls
if search(line, word) != (0:-1)
counts[word]=get(counts,word,0)+1
end
end
end
return counts
end
# methods for generic function word_count
word_count() at none:2
julia> @time test = word_count()
elapsed time: 1659.545788768 seconds
julia> length(test)
2232
julia> test["a"]
19898
julia> @time test = word_count();
elapsed time: 1665.029234354 seconds
julia> length(test)
2232
julia> test["a"]
19898
julia> @time test = word_count();
elapsed time: 1689.584302743 seconds
@johnmyleswhite I'll try swapping out |
Not related to performance, but |
Thanks @JeffBezanson. @johnmyleswhite, how can I loop into a regex to use |
Reading through this again, I suspect |
I ran this code through the profiler (which I'm loving, by the way). I think the crux of this is just that a ton of effort has been put into optimizing python's Reimplementing the same approach in Julia might be a fun project... |
Using search and taking substrings using the index range returned by search should be fast and correct. Try that. |
search is just being used as a predicate here. |
You're also using it rather well, given the 30x performance improvement you obtained in #3719. |
Some more information: I implemented python's search function, which is pretty clever and not very complicated. Here are timings using the first 10000 words in the dictionary.
"Nop" is a search function that always returns no match, so no string matching is performed and nothing is inserted in the dictionary. Python is: 13.59 So it's still more than twice as slow, but judging from the nop time, the bottleneck is no longer actual string matching. In fact it's still 13.77 seconds without any function call and just empty nested loops. |
What does profiling say? |
The |
I had actually already taken out the Here's something that takes 16 seconds and is a pretty close to the test I was running. english_dict = readdlm("words10000", ' ', String)
map!(lowercase, english_dict)
urls = readdlm("urls_10000.wsv", '\t')
map!(lowercase, urls)
function nopsearch(t::ASCIIString, p::ASCIIString)
0
end
function count()
counts=Dict{String, Int64}()
for word in english_dict
for line in urls
if nopsearch(line, word) != 0
end
end
end
end
@printf("Nop: %0.2f\n", @elapsed count()) Here's the profiler output for this, which I can't really make much sense of: https://gist.github.com/dcjones/10404bf2b6bc35a600d7 |
I think @vtjnash said there was a 3x speedup to be had by implementing callsite-splitting of Union types in the compiler. |
Well, yes, the abstract Also this is not how I would have implemented the "wrap it in a function" suggestion. I would do |
Wow, I didn't think of that. It does make a huge difference. Baseline: 34.53 |
I wonder if we should be calling |
Ok, here's what I was working with: https://gist.github.com/dcjones/99ec36932d7eea4ec7c1 I added strstr to that and it clocks in at 24 seconds, though I'm sure that varies by implementation. |
That profiler output has compilation in it (compilation is such a heavily-recursive process that its output dominates profiles). Run the code twice and look at the profile the second time. You can use |
That is cool. We should use |
I'll do some cleanup and testing and send a pull request. |
Didn't realize this was still open |
Stdlib: Pkg URL: https://github.com/JuliaLang/Pkg.jl.git Stdlib branch: master Julia branch: master Old commit: 85f1e5564 New commit: 3c86ba27e Julia version: 1.11.0-DEV Pkg version: 1.11.0 Bump invoked by: @IanButterworth Powered by: [BumpStdlibs.jl](https://github.com/JuliaLang/BumpStdlibs.jl) Diff: JuliaLang/Pkg.jl@85f1e55...3c86ba2 ``` $ git log --oneline 85f1e5564..3c86ba27e 3c86ba27e add `add --weak/extra Foo` to add to [weakdeps] or [extras] (#3708) 2e640f92f respect --color=no in Pkg.precompile (#3740) cbd5d08ad Automatically add compat entries when adding deps to a package (#3732) 03de920b3 rm old manual handling of `--compiled-modules` (#3738) 314d5497b Use realpaths for temp dirs during tests. Fix SparseArrays `why` breakage (#3734) a6531d4be environments.md: update Julia version (#3715) a509bc062 Revise the API of is_manifest_current. (#3701) 60b7b7995 rm incorrect kwargs in add docstring (#3733) ``` Co-authored-by: Dilum Aluthge <[email protected]>
Continuation of thread on the julia-users list:
https://groups.google.com/d/msg/julia-users/hVxBJkSiPtE/3YgkZQFRyD4J
Looks similar to @johnmyleswhite issue #3719
Timing just the loop in Python takes 437s (~7 minutes)
Timing just the loop in Julia took 1860s (~31 minutes), about 4x-5x longer than Python. Wrapping the Julia loop in a function provides minor speedup (~10%), but not enough to close Python gap
The text was updated successfully, but these errors were encountered: