Levenshtein slow and high on CPU usage #11335

mishushakov · 2021-10-18T14:29:47Z

Bug Report

Crystal 1.2.0 (2021-10-13)

LLVM: 11.1.0
Default target: x86_64-apple-macosx

Levenshtein is very slow and uses too much CPU when trying to match a large sample over a large text corpus

Screen.Recording.2021-10-18.at.16.24.04.mov

it takes up to 5 minutes to find a match

Case: software license classifier

license_file = File.open("src/licenses.csv")
sample = File.read("LICENSE")
license_db = CSV.parse(license_file)
license_db.each do |entry|
  puts Levenshtein.distance(sample, entry[1])
end

licenses.csv

Blacksmoke16 · 2021-10-18T14:33:42Z

crystal run --release src/test.cr Otherwise you're running a non-optmized binary and this test is useless.

mishushakov · 2021-10-18T14:47:22Z

thanks for very swift response
release mode = better speed, but not quick enough!

here's identical code, but in node.js

4 times difference

there's probably a better algorithm around for classifying large texts, but still...

Blacksmoke16 · 2021-10-18T14:49:41Z

Out of curiosity, do you get the same results if were to just create an array of the strings and iterate over that versus going thru a file/CSV? That would rule out the bottleneck being in CSV or something.

asterite · 2021-10-18T15:34:50Z

@mishushakov Could you share the node.js code you are using?

mishushakov · 2021-10-18T15:38:03Z

@Blacksmoke16 yes, the speed is about the same

mishushakov · 2021-10-18T15:39:49Z

@asterite sure

const fs = require("fs")
const csv = require("csv-parser")
const levenshtein = require("fast-levenshtein")

const sample = fs.readFileSync("./LICENSE").toString()

fs.createReadStream('./licenses.csv')
  .pipe(csv())
  .on('data', (row) => {
    console.log(levenshtein.get(sample, row.text))
  })

i assume that the fast-levenshtein is faster than Crystal's levenshtein module

mishushakov · 2021-10-18T15:47:00Z

here's another example, more similar to the node.js one

CSV.each_row(@license_file) do |row|
  puts Levenshtein.distance(sample, row[1])
end

mishushakov · 2021-10-18T15:48:07Z

note that all Crystal examples above i execute after building with the --release tag

asterite · 2021-10-18T15:48:44Z

It seems the related node/typescript code is this:

https://github.com/ka-weihe/fastest-levenshtein/blob/master/mod.ts

I'm almost sure that library is very well optimized but ours isn't. It might just be a matter of optimizing ours in a similar way, but I didn't dig too much (actually, at all) at the code.

<joke> Another reason could be that we didn't use the fast- prefix in our library </joke>

j8r · 2021-10-18T16:17:58Z

In the JS example each row is iterated. Instead of parsing the whole CSV, you can do the same in Crystal with CSV.each_row

mishushakov · 2021-10-18T16:19:01Z

@j8r yes, i've corrected that here #11335 (comment)

also the issue does not seem related to CSV

j8r · 2021-10-18T16:32:39Z

The slow performance is likely due to the implementation indeed.
For now, you can use multi-threading to speed it up.

mishushakov · 2021-10-18T17:25:06Z

doesn't work, i guess?

@license_db.flatten.each do |entry|
  spawn puts Levenshtein.distance(sample, entry)
end

Fiber.yield

j8r · 2021-10-18T18:13:44Z

Multi-threading is not enabled by default, -Dpreview_mt has to be put. And also creating so much fibers can be counter-productive. You can create, say 4 fibers, then dispatch each row to each of them through to each of the 4 channels.

asterite · 2021-10-18T18:21:57Z

@j8r I don't think multi-threading has anything to do here. The algorithm used by the node-js library is better, and we should improve ours too.

j8r · 2021-10-18T18:34:10Z

Sure I agree, as I previously said. Still, in any case, it will increase the performance, even after being optimized.
@mishushakov can then have a faster code today, and afterwards gain additional speed after the optimizations.

asterite · 2021-10-19T13:48:37Z

@mishushakov you mention identical code in this comment #11335 (comment) but the output seems to be different. Do you know if there's a bug in either the Crystal library or the node one?

Also, what's the content of the file named "LICENSE"? Otherwise I can't exactly reproduce your code.

mishushakov · 2021-10-19T14:15:49Z

thanks for following up, @asterite
the license file contains a MIT license template

the output was different, because the sample was a little bit different

j8r · 2021-10-19T16:21:00Z

it takes up to 5 minutes to find a match

To clarify, did you meant 5 seconds, or it was on non-release mode?

I've reproduced locally, got similar timings (~4 seconds too). And for multi-threading, I've only seen improvements if there are simultaneous computations to be done in parallel.

I hope this is not blocking for you then.

mishushakov · 2021-10-19T17:14:09Z

correct, the first test was not on release mode
license checker is not a requirement for the idea i'm building and even the results of Node.JS were suboptimal

i need a better algorithm
thanks all for your attention :)

jzakiya · 2021-10-19T22:41:37Z

See if these make any difference.
https://rosettacode.org/wiki/Levenshtein_distance#Crystal

vlazar · 2021-10-20T06:51:07Z

Tried Crystal implementations from rosetta code https://rosettacode.org/wiki/Levenshtein_distance#Crystal

First one is terribly slow.
The second one is almost 7x slower than the one in standard library.

vlazar · 2021-10-20T07:42:15Z

@asterite I've tried JS and Crystal versions and the output matches for me. I've put Crystal's Apache license to LICENSE file.

Code

// lev_node.js
const fs = require("fs")
const csv = require("csv-parser")
const levenshtein = require("fastest-levenshtein")

const sample = fs.readFileSync("./LICENSE").toString()

fs.createReadStream("./licenses.csv")
  .pipe(csv())
  .on('data', (row) => {
    console.log(levenshtein.distance(sample, row.text))
  })

# lev_builtin.cr
require "csv"
require "levenshtein"

sample = File.read("./LICENSE")

CSV.new(File.open("./licenses.csv"), headers: true).each do |row|
  puts Levenshtein.distance(sample, row["text"])
end

Timings

node lev_node.js  7.22s user 0.31s system 98% cpu 7.674 total
./lev_builtin  63.19s user 0.21s system 99% cpu 1:03.86 total

So Crystal is more than 8x slower for me with these input data. Crystal version compiled with --release flag of course.

vlazar · 2021-10-20T07:47:54Z

It seems the related node/typescript code is this:

https://github.com/ka-weihe/fastest-levenshtein/blob/master/mod.ts

Which is much less readable compared to https://github.com/crystal-lang/crystal/blob/master/src/levenshtein.cr

It also always allocates 65536 UInt32 elements https://github.com/ka-weihe/fastest-levenshtein/blob/master/mod.ts#L1 no matter the size of input strings.

asterite · 2021-10-20T10:40:30Z

Yes, it uses a different algorithm. If you search for levensthein Myers you'll find it.

It actually only allocated that array once (js doesn't have threads). We could allocate that array on the stack, or once per thread.

We should definitely switch to the algorithm. I don't have time to do it, but it's a nice task for someone who liked these things. It doesn't matter if it's less readable than the current implementation, performance is more important.

asterite · 2021-10-20T10:54:34Z

It seems it's explained here: https://www.google.com/url?sa=t&source=web&rct=j&url=https://citeseerx.ist.psu.edu/viewdoc/download%3Fdoi%3D10.1.1.142.1245%26rep%3Drep1%26type%3Dpdf&ved=2ahUKEwi2g9Wq6djzAhVVqJUCHTDDBY0QFnoECAYQAQ&sqi=2&usg=AOvVaw0PHsNv91SBeHQg8REvhp2_

I might read it and implement it if I have time, but I can't promise anything.

darkstego · 2021-10-28T17:32:55Z

@mishushakov Did you want to measure the Levenshtein distance between a bunch of very long strings, or were you trying to find a match?

There are algorithms that include cutoffs so if you aren't interested in the the distance if it is over a certain cutoff, the algorithm can end the search early.

mishushakov · 2021-10-28T18:20:03Z

@darkstego the initial idea was to match license in real time

right now i'm thinking of building an indexer, which will collect the matches in a database
then the application will just query the database

in my case the response time is critical

mishushakov · 2021-10-28T18:22:15Z

this will mean slower startup/sync time, but faster query, which i think is a good tradeoff

darkstego · 2021-10-28T18:29:37Z

@mishushakov sorry, I'm a bit confused. When you say matching do you mean an exact match? Because string matching is far faster than measuring the levenshtein distance.

mishushakov · 2021-10-28T19:17:57Z

@darkstego the program should be able to recognise a license, based on a snippet in licenses.csv file (attached to the first comment)

so, it's not exact match

darkstego · 2021-10-29T04:55:59Z

@mishushakov I don't know if this is helpful, but here is an implementation of the search using Levenshtein method that is around 5x faster than the NodeJS on my system. Hope it helps.

mishushakov · 2021-10-29T13:09:23Z

@darkstego that's great news, i'll definitely give it a try!

mishushakov added the kind:bug A bug in the code. Does not apply to documentation, specs, etc. label Oct 18, 2021

straight-shoota added kind:question and removed kind:bug A bug in the code. Does not apply to documentation, specs, etc. labels Oct 18, 2021

This comment has been minimized.

Sign in to view

straight-shoota removed the kind:question label Oct 20, 2021

straight-shoota added help wanted This issue is generally accepted and needs someone to pick it up kind:feature performance topic:stdlib:text labels Oct 20, 2021

darkstego linked a pull request Oct 26, 2021 that will close this issue

Implement Myers algorithm for Levenshtein distance calculation #11370

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Levenshtein slow and high on CPU usage #11335

Levenshtein slow and high on CPU usage #11335

mishushakov commented Oct 18, 2021

Blacksmoke16 commented Oct 18, 2021 •

edited

Loading

mishushakov commented Oct 18, 2021

Blacksmoke16 commented Oct 18, 2021

asterite commented Oct 18, 2021

mishushakov commented Oct 18, 2021

mishushakov commented Oct 18, 2021

mishushakov commented Oct 18, 2021

mishushakov commented Oct 18, 2021

asterite commented Oct 18, 2021

j8r commented Oct 18, 2021

This comment has been minimized.

mishushakov commented Oct 18, 2021 •

edited

Loading

j8r commented Oct 18, 2021

mishushakov commented Oct 18, 2021

j8r commented Oct 18, 2021

asterite commented Oct 18, 2021

j8r commented Oct 18, 2021

asterite commented Oct 19, 2021

mishushakov commented Oct 19, 2021

j8r commented Oct 19, 2021

mishushakov commented Oct 19, 2021

jzakiya commented Oct 19, 2021

vlazar commented Oct 20, 2021

vlazar commented Oct 20, 2021

vlazar commented Oct 20, 2021

asterite commented Oct 20, 2021

asterite commented Oct 20, 2021

darkstego commented Oct 28, 2021

mishushakov commented Oct 28, 2021 •

edited

Loading

mishushakov commented Oct 28, 2021

darkstego commented Oct 28, 2021

mishushakov commented Oct 28, 2021 •

edited

Loading

darkstego commented Oct 29, 2021

mishushakov commented Oct 29, 2021

Levenshtein slow and high on CPU usage #11335

Levenshtein slow and high on CPU usage #11335

Comments

mishushakov commented Oct 18, 2021

Bug Report

Blacksmoke16 commented Oct 18, 2021 • edited Loading

mishushakov commented Oct 18, 2021

Blacksmoke16 commented Oct 18, 2021

asterite commented Oct 18, 2021

mishushakov commented Oct 18, 2021

mishushakov commented Oct 18, 2021

mishushakov commented Oct 18, 2021

mishushakov commented Oct 18, 2021

asterite commented Oct 18, 2021

j8r commented Oct 18, 2021

This comment has been minimized.

mishushakov commented Oct 18, 2021 • edited Loading

j8r commented Oct 18, 2021

mishushakov commented Oct 18, 2021

j8r commented Oct 18, 2021

asterite commented Oct 18, 2021

j8r commented Oct 18, 2021

asterite commented Oct 19, 2021

mishushakov commented Oct 19, 2021

j8r commented Oct 19, 2021

mishushakov commented Oct 19, 2021

jzakiya commented Oct 19, 2021

vlazar commented Oct 20, 2021

vlazar commented Oct 20, 2021

Code

Timings

vlazar commented Oct 20, 2021

asterite commented Oct 20, 2021

asterite commented Oct 20, 2021

darkstego commented Oct 28, 2021

mishushakov commented Oct 28, 2021 • edited Loading

mishushakov commented Oct 28, 2021

darkstego commented Oct 28, 2021

mishushakov commented Oct 28, 2021 • edited Loading

darkstego commented Oct 29, 2021

mishushakov commented Oct 29, 2021

Blacksmoke16 commented Oct 18, 2021 •

edited

Loading

mishushakov commented Oct 18, 2021 •

edited

Loading

mishushakov commented Oct 28, 2021 •

edited

Loading

mishushakov commented Oct 28, 2021 •

edited

Loading