Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow performance with large amount of groups #138

Open
bobbbob98 opened this issue Apr 1, 2024 · 3 comments
Open

Slow performance with large amount of groups #138

bobbbob98 opened this issue Apr 1, 2024 · 3 comments

Comments

@bobbbob98
Copy link

Essentially I'm reworking a php bot detection library in nim. It has a list of crawlers and combines them with groups into one big regex, like (bot1|bot2|bot3|....). There's around a thousand crawlers in the list. It takes around half an hour to compile with the release flag and on a dataset of around 40000 user agents it took a few minutes to check them all. With std/re it compiles more or less instantly and checks that dataset in a few seconds.
Here's an example code, files are attached:
crawlers.txt
useragents.txt

import regex, strutils
const crawlerRegex = re2('(' & join(splitLines(staticRead("crawlers.txt")), "|") & ')')
let uas = splitLines(readFile("useragents.txt"))
for ua in uas:
    echo ua, ", ", contains(ua, crawlerRegex)
@nitely
Copy link
Owner

nitely commented Apr 2, 2024

there is more than one issue it seems.

the slow compile time seems to be some Nim VM bottleneck. It should not take that long, even for a 17K length regex. The runtime version below takes less than a second to compile the regex.

disabling the capturing (?:...) makes the matching much faster for some reason.

PCRE indeed takes a few seconds to run (~8 in my machine), while nim-regex takes ~80s. 10x slower is far too slow.

import std/strutils
import ../src/regex
#import re

proc main() =
  let crawlers = readFile("./tests/crawlers.txt")
  let crawlerRegex = re2("(?:" & join(splitLines(crawlers), "|") & ')')   # note the (?:
  echo "compiled"
  var useragents = readFile("./tests/useragents.txt")
  let uas = splitLines(useragents)
  var i = 0
  for ua in uas:
    i += int(contains(ua, crawlerRegex))
  echo i
main()
echo "ok"
$ time nim c -r -d:release tests/testbug.nim
[...]
Hint: mm: orc; threads: on; opt: speed; options: -d:release
61922 lines; 1.360s; 110.156MiB peakmem; [...]
compiled
20902
ok

real    1m9.897s
user    1m9.735s
sys     0m0.157s

@nitely
Copy link
Owner

nitely commented Apr 6, 2024

The slow compile time seems to be a C/C++ backend issue. If you compile for js it only takes 15s. I've tried adding a debugEcho after the regex compilation and it's printed right away, so it looks like most time is spent emitting the regex data structure in C/C++.

const crawlers = staticread("./crawlers.txt")
echo crawlers.len
const reg = "(?:" & join(splitLines(crawlers), "|") & ')'
echo reg.len
const crawlerRegex = re2(reg)
static: echo "COMPILE DONE"
static: echo crawlerRegex.isInitialized
echo crawlerRegex.isInitialized

@nitely
Copy link
Owner

nitely commented Apr 6, 2024

reported upstream nim-lang/Nim#23480

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants