-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
match() is slow #7066
Comments
Is it definitely |
I ask because I've noticed in the past that just reading lines is a lot slower in julia than python. |
The timing results show that 183 seconds are spent on this: @time tnames = [ name2tld(names[i]) for i = 1:length(names) ] The reading portion is also slower than python, but it's like 4x as opposed to 15-45x. I'm profiling this and here's what I see (though it's a bit hard to read the tree):
|
That's interesting. There's at least one obvious optimization: a lot of is spent calling Is there an easy way for me to reproduce your benchmark? Is the data public, or is there something similar I can use? |
Yes, the data is available here: http://data.dws.informatik.uni-mannheim.de/hyperlinkgraph/pld-index.gz. Warning: the file is 297M. |
ref #7066 ref JuliaData/DataFrames.jl#609 (comment) This doesn't get us all the way there but every bit helps. Note that IOStream is still a bit faster than AsyncStream+IOBuffer. The former is used for files, the latter for pipes.
I am (happy?) to report that this spends all its time in GC:
This is after @dcjones ' awesome optimization.
The current default is 25MB, which is pretty small. I will bump it a bit for 64-bit machines, and the upcoming GC improvements should take care of the rest. |
@dcjones @JeffBezanson wow that's great. I just ran my large benchmark and I'm seeing a 30-40x speedup. |
I think we've tackled the specific issues here, and the rest will be general GC work. |
I believe this is related to #3719, but this is affecting 75f7732.
I'm reading string from a file then matching strings with a simple regex and pulling out the capture. The julia code is consistently 15-45x slower than similar python code. Here's code in julia and python:
Here's a comparison of the results:
The text was updated successfully, but these errors were encountered: