Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

very slow on big files compared to grep #864

Closed
arpi79 opened this issue Mar 20, 2018 · 3 comments
Closed

very slow on big files compared to grep #864

arpi79 opened this issue Mar 20, 2018 · 3 comments

Comments

@arpi79
Copy link

arpi79 commented Mar 20, 2018

What version of ripgrep are you using?

$ rg --version
ripgrep 0.8.1 (rev c8e9f25)
-SIMD -AVX

What operating system are you using ripgrep on?

CYGWIN_NT-6.1 spdm1247 2.10.0(0.325/5/3) 2018-02-02 15:16 x86_64 Cygwin

Describe your question, feature request, or bug.

rg is 4x times slower then grep for a similar search.
I have a 11GB log file and just want to search for a simple text and count occurrences.

$ time rg -j 4 -a 'ScheduledTopUp.TimeStamp=1518' server.log.2018-02-14 -c
37930

real 0m46.510s
user 0m0.000s
sys 0m0.000s

$ time grep 'ScheduledTopUp.TimeStamp=1518' server.log.2018-02-14 -c
37930

real 0m13.145s
user 0m9.282s
sys 0m3.806s

Try with other settings too, but no improvements:

$ time rg -j 4 'ScheduledTopUp.TimeStamp=1518' server.log.2018-02-14 --mmap -c
37930

real 2m21.926s
user 0m0.156s
sys 0m0.452s

$ time rg 'ScheduledTopUp.TimeStamp=1518' server.log.2018-02-14 -c
37930

real 3m10.727s
user 0m0.234s
sys 0m0.405s

@BurntSushi
Copy link
Owner

This isn't enough information to act on. Fixing performance bugs requires that it can be reproduced. Please find a way to reproduce the problem on an open dataset (or find a way to get me your dataset).

Also, I see that you are on Windows. Almost all of my benchmarking has been done on Linux. What very little benchmarking I've done in Windows suggests that performance can be greatly impacted by active virus scanners.

The high variability between your runs is also quite suspicious. Are you sure you aren't just measuring disk bandwidth?

@arpi79
Copy link
Author

arpi79 commented Mar 21, 2018

Hi,

I will try to find a such data set.
As you can see there were different parameters given to the rg.
If the same parameters/flags are given the same result is received.
"-j 4 -a" gives best result : ~ 50 sec
no flag the worst result: ~ 3 min

Btw. I installed rg on a debian machine:
4.14.0-2-amd64 #1 SMP Debian 4.14.7-1 (2017-12-22) x86_64 GNU/Linux

Now the times are comparable:

$ time grep 'ScheduledTopUp.TimeStamp=1518' server.log.2018-02-14 -c
37930

real 2m20.756s
user 0m19.277s
sys 0m7.317s

$ time rg -j 4 -a 'ScheduledTopUp.TimeStamp=1518' server.log.2018-02-14 -c
37930

real 2m30.704s
user 0m8.266s
sys 0m6.611s

So issue seems to be windows related.
Thank you for your time.

@arpi79 arpi79 closed this as completed Mar 21, 2018
@BurntSushi
Copy link
Owner

"-j 4 -a" gives best result : ~ 50 sec
no flag the worst result: ~ 3 min

Something is very clearly amiss. You're searching a single file. ripgrep doesn't benefit from parallelism when searching a single file, so the fact that it's faster suggests something is going wrong. One possible explanation is that ripgrep is actually searching your entire CWD, even though that would definitely be a bug given the command you're running.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants