Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Levenshtein slow and high on CPU usage #11335

Open
mishushakov opened this issue Oct 18, 2021 · 34 comments · May be fixed by #11370
Open

Levenshtein slow and high on CPU usage #11335

mishushakov opened this issue Oct 18, 2021 · 34 comments · May be fixed by #11370
Labels
help wanted This issue is generally accepted and needs someone to pick it up kind:feature performance topic:stdlib:text

Comments

@mishushakov
Copy link

Bug Report

Crystal 1.2.0 (2021-10-13)

LLVM: 11.1.0
Default target: x86_64-apple-macosx

Levenshtein is very slow and uses too much CPU when trying to match a large sample over a large text corpus

Screenshot 2021-10-18 at 16 06 58

Screen.Recording.2021-10-18.at.16.24.04.mov

it takes up to 5 minutes to find a match

Case: software license classifier

license_file = File.open("src/licenses.csv")
sample = File.read("LICENSE")
license_db = CSV.parse(license_file)
license_db.each do |entry|
  puts Levenshtein.distance(sample, entry[1])
end

licenses.csv

@mishushakov mishushakov added the kind:bug A bug in the code. Does not apply to documentation, specs, etc. label Oct 18, 2021
@Blacksmoke16
Copy link
Member

Blacksmoke16 commented Oct 18, 2021

crystal run --release src/test.cr Otherwise you're running a non-optmized binary and this test is useless.

@straight-shoota straight-shoota added kind:question and removed kind:bug A bug in the code. Does not apply to documentation, specs, etc. labels Oct 18, 2021
@mishushakov
Copy link
Author

thanks for very swift response
release mode = better speed, but not quick enough!

Screenshot 2021-10-18 at 16 44 48

here's identical code, but in node.js

Screenshot 2021-10-18 at 16 45 46

4 times difference

there's probably a better algorithm around for classifying large texts, but still...

@Blacksmoke16
Copy link
Member

Out of curiosity, do you get the same results if were to just create an array of the strings and iterate over that versus going thru a file/CSV? That would rule out the bottleneck being in CSV or something.

@asterite
Copy link
Member

@mishushakov Could you share the node.js code you are using?

@mishushakov
Copy link
Author

@Blacksmoke16 yes, the speed is about the same

Screenshot 2021-10-18 at 17 37 21

@mishushakov
Copy link
Author

@asterite sure

const fs = require("fs")
const csv = require("csv-parser")
const levenshtein = require("fast-levenshtein")

const sample = fs.readFileSync("./LICENSE").toString()

fs.createReadStream('./licenses.csv')
  .pipe(csv())
  .on('data', (row) => {
    console.log(levenshtein.get(sample, row.text))
  })

i assume that the fast-levenshtein is faster than Crystal's levenshtein module

@mishushakov
Copy link
Author

here's another example, more similar to the node.js one

CSV.each_row(@license_file) do |row|
  puts Levenshtein.distance(sample, row[1])
end

Screenshot 2021-10-18 at 17 45 46

@mishushakov
Copy link
Author

note that all Crystal examples above i execute after building with the --release tag

@asterite
Copy link
Member

It seems the related node/typescript code is this:

https://github.com/ka-weihe/fastest-levenshtein/blob/master/mod.ts

I'm almost sure that library is very well optimized but ours isn't. It might just be a matter of optimizing ours in a similar way, but I didn't dig too much (actually, at all) at the code.

<joke> Another reason could be that we didn't use the fast- prefix in our library </joke>

@j8r
Copy link
Contributor

j8r commented Oct 18, 2021

In the JS example each row is iterated. Instead of parsing the whole CSV, you can do the same in Crystal with CSV.each_row

@mishushakov

This comment has been minimized.

@mishushakov
Copy link
Author

mishushakov commented Oct 18, 2021

@j8r yes, i've corrected that here #11335 (comment)

also the issue does not seem related to CSV

@j8r
Copy link
Contributor

j8r commented Oct 18, 2021

The slow performance is likely due to the implementation indeed.
For now, you can use multi-threading to speed it up.

@mishushakov
Copy link
Author

doesn't work, i guess?

@license_db.flatten.each do |entry|
  spawn puts Levenshtein.distance(sample, entry)
end

Fiber.yield

Screenshot 2021-10-18 at 19 22 05

@j8r
Copy link
Contributor

j8r commented Oct 18, 2021

Multi-threading is not enabled by default, -Dpreview_mt has to be put. And also creating so much fibers can be counter-productive. You can create, say 4 fibers, then dispatch each row to each of them through to each of the 4 channels.

@asterite
Copy link
Member

@j8r I don't think multi-threading has anything to do here. The algorithm used by the node-js library is better, and we should improve ours too.

@j8r
Copy link
Contributor

j8r commented Oct 18, 2021

Sure I agree, as I previously said. Still, in any case, it will increase the performance, even after being optimized.
@mishushakov can then have a faster code today, and afterwards gain additional speed after the optimizations.

@asterite
Copy link
Member

@mishushakov you mention identical code in this comment #11335 (comment) but the output seems to be different. Do you know if there's a bug in either the Crystal library or the node one?

Also, what's the content of the file named "LICENSE"? Otherwise I can't exactly reproduce your code.

@mishushakov
Copy link
Author

thanks for following up, @asterite
the license file contains a MIT license template

the output was different, because the sample was a little bit different

@j8r
Copy link
Contributor

j8r commented Oct 19, 2021

it takes up to 5 minutes to find a match

To clarify, did you meant 5 seconds, or it was on non-release mode?

I've reproduced locally, got similar timings (~4 seconds too). And for multi-threading, I've only seen improvements if there are simultaneous computations to be done in parallel.

I hope this is not blocking for you then.

@mishushakov
Copy link
Author

correct, the first test was not on release mode
license checker is not a requirement for the idea i'm building and even the results of Node.JS were suboptimal

i need a better algorithm
thanks all for your attention :)

@jzakiya
Copy link

jzakiya commented Oct 19, 2021

See if these make any difference.
https://rosettacode.org/wiki/Levenshtein_distance#Crystal

@vlazar
Copy link
Contributor

vlazar commented Oct 20, 2021

Tried Crystal implementations from rosetta code https://rosettacode.org/wiki/Levenshtein_distance#Crystal

First one is terribly slow.
The second one is almost 7x slower than the one in standard library.

@vlazar
Copy link
Contributor

vlazar commented Oct 20, 2021

@asterite I've tried JS and Crystal versions and the output matches for me. I've put Crystal's Apache license to LICENSE file.

Code

// lev_node.js
const fs = require("fs")
const csv = require("csv-parser")
const levenshtein = require("fastest-levenshtein")

const sample = fs.readFileSync("./LICENSE").toString()

fs.createReadStream("./licenses.csv")
  .pipe(csv())
  .on('data', (row) => {
    console.log(levenshtein.distance(sample, row.text))
  })
# lev_builtin.cr
require "csv"
require "levenshtein"

sample = File.read("./LICENSE")

CSV.new(File.open("./licenses.csv"), headers: true).each do |row|
  puts Levenshtein.distance(sample, row["text"])
end

Timings

node lev_node.js  7.22s user 0.31s system 98% cpu 7.674 total
./lev_builtin  63.19s user 0.21s system 99% cpu 1:03.86 total

So Crystal is more than 8x slower for me with these input data. Crystal version compiled with --release flag of course.

@vlazar
Copy link
Contributor

vlazar commented Oct 20, 2021

It seems the related node/typescript code is this:

https://github.com/ka-weihe/fastest-levenshtein/blob/master/mod.ts

Which is much less readable compared to https://github.com/crystal-lang/crystal/blob/master/src/levenshtein.cr

It also always allocates 65536 UInt32 elements https://github.com/ka-weihe/fastest-levenshtein/blob/master/mod.ts#L1 no matter the size of input strings.

@asterite
Copy link
Member

Yes, it uses a different algorithm. If you search for levensthein Myers you'll find it.

It actually only allocated that array once (js doesn't have threads). We could allocate that array on the stack, or once per thread.

We should definitely switch to the algorithm. I don't have time to do it, but it's a nice task for someone who liked these things. It doesn't matter if it's less readable than the current implementation, performance is more important.

@straight-shoota straight-shoota added help wanted This issue is generally accepted and needs someone to pick it up kind:feature performance topic:stdlib:text labels Oct 20, 2021
@asterite
Copy link
Member

@darkstego
Copy link
Contributor

@mishushakov Did you want to measure the Levenshtein distance between a bunch of very long strings, or were you trying to find a match?

There are algorithms that include cutoffs so if you aren't interested in the the distance if it is over a certain cutoff, the algorithm can end the search early.

@mishushakov
Copy link
Author

mishushakov commented Oct 28, 2021

@darkstego the initial idea was to match license in real time

right now i'm thinking of building an indexer, which will collect the matches in a database
then the application will just query the database

in my case the response time is critical

@mishushakov
Copy link
Author

this will mean slower startup/sync time, but faster query, which i think is a good tradeoff

@darkstego
Copy link
Contributor

@mishushakov sorry, I'm a bit confused. When you say matching do you mean an exact match? Because string matching is far faster than measuring the levenshtein distance.

@mishushakov
Copy link
Author

mishushakov commented Oct 28, 2021

@darkstego the program should be able to recognise a license, based on a snippet in licenses.csv file (attached to the first comment)

so, it's not exact match

@darkstego
Copy link
Contributor

@mishushakov I don't know if this is helpful, but here is an implementation of the search using Levenshtein method that is around 5x faster than the NodeJS on my system. Hope it helps.

@mishushakov
Copy link
Author

@darkstego that's great news, i'll definitely give it a try!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted This issue is generally accepted and needs someone to pick it up kind:feature performance topic:stdlib:text
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants