-
-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Levenshtein slow and high on CPU usage #11335
Comments
|
Out of curiosity, do you get the same results if were to just create an array of the strings and iterate over that versus going thru a file/CSV? That would rule out the bottleneck being in CSV or something. |
@mishushakov Could you share the node.js code you are using? |
@Blacksmoke16 yes, the speed is about the same |
@asterite sure const fs = require("fs")
const csv = require("csv-parser")
const levenshtein = require("fast-levenshtein")
const sample = fs.readFileSync("./LICENSE").toString()
fs.createReadStream('./licenses.csv')
.pipe(csv())
.on('data', (row) => {
console.log(levenshtein.get(sample, row.text))
}) i assume that the |
note that all Crystal examples above i execute after building with the |
It seems the related node/typescript code is this: https://github.com/ka-weihe/fastest-levenshtein/blob/master/mod.ts I'm almost sure that library is very well optimized but ours isn't. It might just be a matter of optimizing ours in a similar way, but I didn't dig too much (actually, at all) at the code.
|
In the JS example each row is iterated. Instead of parsing the whole CSV, you can do the same in Crystal with CSV.each_row |
This comment has been minimized.
This comment has been minimized.
@j8r yes, i've corrected that here #11335 (comment) also the issue does not seem related to CSV |
The slow performance is likely due to the implementation indeed. |
Multi-threading is not enabled by default, |
@j8r I don't think multi-threading has anything to do here. The algorithm used by the node-js library is better, and we should improve ours too. |
Sure I agree, as I previously said. Still, in any case, it will increase the performance, even after being optimized. |
@mishushakov you mention identical code in this comment #11335 (comment) but the output seems to be different. Do you know if there's a bug in either the Crystal library or the node one? Also, what's the content of the file named "LICENSE"? Otherwise I can't exactly reproduce your code. |
thanks for following up, @asterite the output was different, because the sample was a little bit different |
To clarify, did you meant 5 seconds, or it was on non-release mode? I've reproduced locally, got similar timings (~4 seconds too). And for multi-threading, I've only seen improvements if there are simultaneous computations to be done in parallel. I hope this is not blocking for you then. |
correct, the first test was not on release mode i need a better algorithm |
See if these make any difference. |
Tried Crystal implementations from rosetta code https://rosettacode.org/wiki/Levenshtein_distance#Crystal First one is terribly slow. |
@asterite I've tried JS and Crystal versions and the output matches for me. I've put Crystal's Apache license to Code// lev_node.js
const fs = require("fs")
const csv = require("csv-parser")
const levenshtein = require("fastest-levenshtein")
const sample = fs.readFileSync("./LICENSE").toString()
fs.createReadStream("./licenses.csv")
.pipe(csv())
.on('data', (row) => {
console.log(levenshtein.distance(sample, row.text))
}) # lev_builtin.cr
require "csv"
require "levenshtein"
sample = File.read("./LICENSE")
CSV.new(File.open("./licenses.csv"), headers: true).each do |row|
puts Levenshtein.distance(sample, row["text"])
end Timings
So Crystal is more than 8x slower for me with these input data. Crystal version compiled with |
Which is much less readable compared to https://github.com/crystal-lang/crystal/blob/master/src/levenshtein.cr It also always allocates 65536 UInt32 elements https://github.com/ka-weihe/fastest-levenshtein/blob/master/mod.ts#L1 no matter the size of input strings. |
Yes, it uses a different algorithm. If you search for levensthein Myers you'll find it. It actually only allocated that array once (js doesn't have threads). We could allocate that array on the stack, or once per thread. We should definitely switch to the algorithm. I don't have time to do it, but it's a nice task for someone who liked these things. It doesn't matter if it's less readable than the current implementation, performance is more important. |
It seems it's explained here: https://www.google.com/url?sa=t&source=web&rct=j&url=https://citeseerx.ist.psu.edu/viewdoc/download%3Fdoi%3D10.1.1.142.1245%26rep%3Drep1%26type%3Dpdf&ved=2ahUKEwi2g9Wq6djzAhVVqJUCHTDDBY0QFnoECAYQAQ&sqi=2&usg=AOvVaw0PHsNv91SBeHQg8REvhp2_ I might read it and implement it if I have time, but I can't promise anything. |
@mishushakov Did you want to measure the Levenshtein distance between a bunch of very long strings, or were you trying to find a match? There are algorithms that include cutoffs so if you aren't interested in the the distance if it is over a certain cutoff, the algorithm can end the search early. |
@darkstego the initial idea was to match license in real time right now i'm thinking of building an indexer, which will collect the matches in a database in my case the response time is critical |
this will mean slower startup/sync time, but faster query, which i think is a good tradeoff |
@mishushakov sorry, I'm a bit confused. When you say matching do you mean an exact match? Because string matching is far faster than measuring the levenshtein distance. |
@darkstego the program should be able to recognise a license, based on a snippet in licenses.csv file (attached to the first comment) so, it's not exact match |
@mishushakov I don't know if this is helpful, but here is an implementation of the search using Levenshtein method that is around 5x faster than the NodeJS on my system. Hope it helps. |
@darkstego that's great news, i'll definitely give it a try! |
Bug Report
Crystal 1.2.0 (2021-10-13)
LLVM: 11.1.0
Default target: x86_64-apple-macosx
Levenshtein is very slow and uses too much CPU when trying to match a large sample over a large text corpus
Screen.Recording.2021-10-18.at.16.24.04.mov
it takes up to 5 minutes to find a match
Case: software license classifier
licenses.csv
The text was updated successfully, but these errors were encountered: