-
-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slow performance with large amount of groups #138
Comments
there is more than one issue it seems. the slow compile time seems to be some Nim VM bottleneck. It should not take that long, even for a 17K length regex. The runtime version below takes less than a second to compile the regex. disabling the capturing PCRE indeed takes a few seconds to run (~8 in my machine), while nim-regex takes ~80s. 10x slower is far too slow. import std/strutils
import ../src/regex
#import re
proc main() =
let crawlers = readFile("./tests/crawlers.txt")
let crawlerRegex = re2("(?:" & join(splitLines(crawlers), "|") & ')') # note the (?:
echo "compiled"
var useragents = readFile("./tests/useragents.txt")
let uas = splitLines(useragents)
var i = 0
for ua in uas:
i += int(contains(ua, crawlerRegex))
echo i
main()
echo "ok"
|
The slow compile time seems to be a C/C++ backend issue. If you compile for js it only takes 15s. I've tried adding a debugEcho after the regex compilation and it's printed right away, so it looks like most time is spent emitting the regex data structure in C/C++. const crawlers = staticread("./crawlers.txt")
echo crawlers.len
const reg = "(?:" & join(splitLines(crawlers), "|") & ')'
echo reg.len
const crawlerRegex = re2(reg)
static: echo "COMPILE DONE"
static: echo crawlerRegex.isInitialized
echo crawlerRegex.isInitialized |
reported upstream nim-lang/Nim#23480 |
Essentially I'm reworking a php bot detection library in nim. It has a list of crawlers and combines them with groups into one big regex, like (bot1|bot2|bot3|....). There's around a thousand crawlers in the list. It takes around half an hour to compile with the release flag and on a dataset of around 40000 user agents it took a few minutes to check them all. With std/re it compiles more or less instantly and checks that dataset in a few seconds.
Here's an example code, files are attached:
crawlers.txt
useragents.txt
The text was updated successfully, but these errors were encountered: