Implement Myers algorithm for Levenshtein distance calculation #11370

darkstego · 2021-10-26T23:09:33Z

This implements a faster algorithm for calculating Levenshtein distance. Fixes #11335.

The algorithm is 3x faster than the original one for short strings and 10x for long ones.

caspiano · 2021-10-26T23:42:47Z

Could you please run crystal tool format?

BTW, this looks incredible!

caspiano · 2021-10-27T00:14:21Z

I got these results against the benchmark from #8324
Note: the benchmark does not test the algorithm against longer strings.

gist here.

old 404.67  (  2.47ms) (± 3.54%)   195kB/op   1.72× slower
new 696.13  (  1.44ms) (± 6.50%)  1.72MB/op        fastest

darkstego · 2021-10-27T00:46:59Z

The implementation could be faster still with some changes. A separate ASCII method makes this submitted implementation 1.5x slower. But that isn't a very DRY approach and contemplated back and forth about implementing it. The only difference being the character bit-vector store is not conditionally defined and no case statement to clear it.

The real value though is for long strings because this algorithm is O([m/w]n) whereas the existing one is O(mn). I am sure there are further optimizations that can be done. w is word size and set to 32, but should be 64 on 64 bit systems, that should make it even faster on long strings.

src/levenshtein.cr

vlazar · 2021-10-27T09:26:24Z

I've tried benchmarking a new implementation using this code #11335 (comment)

MacOS Catalina 10.15.7
2.3 GHz Quad-Core Intel Core i7
NodeJS 16.13.0
Crystal 1.2.1 (LLVM 11.1.0)

Now it's just a tiny bit slower than presumably the fastest Levenshtein distance algorithm for JavaScript from fastest-levenshtein npm package:

Crystal 1.2.1

% hyperfine ./lev_myers
Benchmark #1: ./lev_builtin_myers
  Time (mean ± σ):      7.616 s ±  0.146 s    [User: 7.585 s, System: 0.031 s]
  Range (min … max):    7.355 s …  7.805 s    10 runs

NodeJS 16.13.0

% hyperfine 'node lev_node.js'
Benchmark #1: node lev_node.js
  Time (mean ± σ):      7.194 s ±  0.066 s    [User: 7.053 s, System: 0.189 s]
  Range (min … max):    7.093 s …  7.331 s    10 runs

asterite · 2021-10-27T12:20:01Z

src/levenshtein.cr

+    score = m
+    ascii = string1.ascii_only? && string2.ascii_only?
+
+    pmr = ascii ? StaticArray(UInt32, 128).new(0) : Hash(Int32, UInt32).new(w) { 0.to_u32 }


Because this makes pmr be a union type, any use of it will result in a multidispatch.

I suggest splitting the rest of this method in two based on ascii or not. Something like:

if ascii pmr = StaticArray(UInt32, 128).new(0) rest_of_the_code(lpos, score, pmr, etc.) else pmr = Hash(Int32, UInt32).new(w) { 0.to_u32 } rest_of_the_code(lpos, score, pmr, etc.) end

that will make rest_of_the_code to be instantiated once per each type of pmr, hopefully resulting in a faster code, and code that LLVM will be able to inline better.

Why use a Hash when it's not ascii and not a StaticArray?

I do get 50% uptick if I split it into ascii and non ascii method and just check in the main function. Not very DRY but it is faster so I will commit it shortly

Why use a Hash when it's not ascii and not a StaticArray?

This is used as a dictionary. Every loop I traverse a chunk of string1 (size 32 in this case) and note on a bit vector where each character is in the chunk. So I need a dictionary to map a string character to a 32bit int (my vector). This algorithm traverse string1 exactly 1 time and uses the created dictionary on every column (string2) hence the speedup.

One way to do this dictionary is a hash of size 32 (chunk size). But in ASCII I am guaranteed that the char will be less than 128, so I make a small StaticArray of that size and use the codepoint as an address to speed up the lookup.

Ideally I would instantiate pmr and then pass it into my method. But the only problem is clearing the pmr at the end of each loop is executed differently depending on whether it is a StaticArray or Hash.

That said, since ASCII also allows the use of pointers the methods have diverged slightly. There is still way more code duplication than I would like, and I feel like there is a way to better organize it.

Ideally I would instantiate pmr and then pass it into my method. But the only problem is clearing the pmr at the end of each loop is executed differently depending on whether it is a StaticArray or Hash.

It's true. In that case you can still leave the case you had before. Before the method will be instantiated with two different types separately, the compiler will optimize that case statement so there will be no check at the end.

That said, if the codes diverged a bit then it's fine to keep them duplicate.

If they only different by one thing, you can consider using a third method that receives a block. Because blocks are inlined, it will be the same as writing two different versions with just one difference (let me know if this is not clear, I can provide a small code example)

I thought about the block method, but the number of variables that needed to be passed were too many, and some need to be passed by reference so I need to encapsulate those. It quickly became more of a mess.

I am trying to use macros to create the two methods (ascii,unicode) and reduce code duplication. I did implement that and it passed the specs, but when benchmarking it doesn't terminate so I still have a bit of debugging to do. Also Macros do open up potential for 64bit implementation, which does appear to be faster, but will require some debugging.

src/levenshtein.cr

darkstego · 2021-10-27T13:24:19Z

So I decided to really optimize for performance. This latest commit should get 1.8x speedup over my old commit in long ascii texts.

One potential avenue for further speed up is finding out is making an Int64 version of this that would be compiled on 64bit systems.

darkstego · 2021-10-27T13:56:10Z

Sorry about the constant formatting fixes. I don't know why my pre-commit hooks aren't working.

src/levenshtein.cr

Co-authored-by: Sijawusz Pur Rahnama <[email protected]>

asterite · 2021-10-27T16:11:08Z

src/levenshtein.cr

+      vn = 0
+
+      # prepare char bit vector
+      s = string1[r*w, w]


This allocates a new string. Maybe using Char::Reader there's a way to avoid this allocation and further speed things up.

It could be done in a follow up PR, though!

Yeah, this is getting quite messy and I am a couple of branches in with optimization implementations. I will close this and try to clean things up before setting up another PR.

Sure!

Also, it's totally fine to get something slightly better working (though in this case it's already a huge performance improvement), and then apply the suggestions or further improvements in other PRs.

Finally, a bit of code duplication is also fine! That's usually better than introducing macros and making the code harder to read.

And thank you for your patience! 🙏

darkstego · 2021-10-27T23:17:20Z

So I finally have what I feel is a complete implementation. I implemented a 64 bit version of the algorithm when compiled on 64 bit systems. I implemented the ASCII and Unicode versions through Macros mainly because I kept forgetting to implement changes on both codebases. They diverge in a few spots that are commented mainly to do with the dictionary implementation and are commented, so it isn't bad at all.

The performance is really good now. Currently clocking in at 30x over the original at string length of 1500 on 64 bit systems

Default 210.39  (  4.75ms) (± 0.49%)  5.83kB/op  30.73× slower
Orig PR   1.86k (538.91µs) (± 3.07%)  13.9kB/op   3.48× slower
Latest   6.46k (154.69µs) (± 1.57%)  23.3kB/op        fastest

Now at Over 3x speed from the original PR.

I did run a test to compare against the Node JS implementation and it was over twice as fast on my System. I was using the method described in this #11335 comment.

Linux 5.14.7-2-MANJARO
AMD Ryzen Threadripper 2950X 16-Core Processor
Node 16.11.0

Node JS

real    0m4.905s
user    0m4.868s
sys     0m0.020s

Crystal

real    0m1.993s
user    0m2.011s
sys     0m0.071s

vlazar · 2021-10-28T06:19:03Z

🚀 Wow, with the latest changes this implementation is now 2.4x faster than the NodeJS on this benchmark for me #11335 (comment)

Great job @darkstego !

vlazar · 2021-10-28T06:31:05Z

One thing to note is memory consumption has grown in new implementation.

I've run the benchmark already mentioned here #11370 (comment)

My numbers:

old 305.94  (  3.27ms) (± 3.08%)   195kB/op   1.59× slower
new 486.96  (  2.05ms) (± 1.34%)  3.24MB/op        fastest

It uses a case with 100000 characters long strings though. Not sure how real world this use case is.

If I remove these 2 cases

    Levenshtein.new_distance("a" * 100000, "hello")
    Levenshtein.new_distance("hello", "a" * 100000)

The memory consumption is still much higher, but it's not in MBs anymore.

But notice the results, for some reason it's now slower than original? 😕

old   1.23M (812.94ns) (± 0.82%)    208B/op        fastest
new 511.08k (  1.96µs) (± 0.90%)  2.37kB/op   2.41× slower

Code:

# From https://github.com/crystal-lang/crystal/pull/11370
Benchmark.ips do |bm|
  bm.report "old" do
    Levenshtein.distance("algorithm", "altruistic")
    Levenshtein.distance("hello", "hallo")
    Levenshtein.distance("こんにちは", "こんちは")
    Levenshtein.distance("hey", "hey")
    Levenshtein.distance("hippo", "zzzzzzzz")
    # Levenshtein.distance("a" * 100000, "hello")
    # Levenshtein.distance("hello", "a" * 100000)
  end

  bm.report "new" do
    LevenshteinNew.distance("algorithm", "altruistic")
    LevenshteinNew.distance("hello", "hallo")
    LevenshteinNew.distance("こんにちは", "こんちは")
    LevenshteinNew.distance("hey", "hey")
    LevenshteinNew.distance("hippo", "zzzzzzzz")
    # LevenshteinNew.distance("a" * 100000, "hello")
    # LevenshteinNew.distance("hello", "a" * 100000)
  end
end

vlazar · 2021-10-28T07:06:45Z

I've run cases 1 by 1 and added results near each one that's slower. The most slowdown seems to be for short strings with small distance, especially for Unicode string.

Benchmark.ips do |bm|
  bm.report "old" do
    Levenshtein.distance("algorithm", "altruistic") # 1.15× slower
    Levenshtein.distance("hello", "hallo")
    Levenshtein.distance("こんにちは", "こんちは")
    Levenshtein.distance("hey", "hey")
    Levenshtein.distance("hippo", "zzzzzzzz")
    Levenshtein.distance("a" * 100000, "hello") # 1.75× slower
    Levenshtein.distance("hello", "a" * 100000) # 1.80× slower
  end

  bm.report "new" do
    LevenshteinNew.distance("algorithm", "altruistic")
    LevenshteinNew.distance("hello", "hallo") # 2.06× slower
    LevenshteinNew.distance("こんにちは", "こんちは") # 4.65× slower
    LevenshteinNew.distance("hey", "hey")
    LevenshteinNew.distance("hippo", "zzzzzzzz") # 1.47× slower
    LevenshteinNew.distance("a" * 100000, "hello")
    LevenshteinNew.distance("hello", "a" * 100000)
  end
end

I wonder what cases would be used most frequently in the real world 🙂 It would be nice to optimize for most those.

straight-shoota · 2021-10-28T07:51:35Z

I'm pretty sure that short strings are a very common use case for Levenshtein distance (thing search query strings for example).
So we should make sure that performance does at least not deteriorate for short strings. There should be individual benchmarks for different string lengths.

darkstego · 2021-10-28T11:04:47Z

I'm pretty sure that short strings are a very common use case for Levenshtein distance (thing search query strings for example).
So we should make sure that performance does at least not deteriorate for short strings. There should be individual benchmarks for different string lengths.

I know why this is happening. For short Unicode strings the dictionary used is a hash. That is created and populated before traversing the columns of the matrix, but that step takes times and if the string is short the improved big O times of the algorithm doesn't get a chance to offset that. In my early testing I realized that the cutoff point for unicode was about 30 chars or so.

The fix is pretty easy, I would just call the old algorithm if the strings were below the cutoff length.

The StaticArray (ascii) method was always faster for me, so there is a regression somewhere, and I have a good idea where that might be.

If anyone has any suggestion on an efficient way to store and retrieve bit information in Crystal that would help a lot. The algorithm needs 2 things:

A dictionary that maps Char to an int of size (word width). Dictionary only needs to store (word width) number of entries).
An array of bits of size equal to the longer string.

I just realized there was a bitarray in Crystal and might look at that, but if I am not mistaken it stores every bit as a U32.

One thing to note is memory consumption has grown in new implementation.

I know the reason for this. The algorithm stores a bit array of length = longest string, and I encoded each bit as a U64 on 64 bit systems. There is potential to reduce the memory consumption by quite a bit.

vlazar · 2021-10-28T19:18:57Z

@darkstego No worries at all! I was just sharing my numbers for the record (for prospective). Current code is much more balanced and optimized for more realistic use cases.

darkstego · 2021-10-29T04:51:57Z

Another improvement just added. I realized that if a cutoff is given to the algorithm it can abort as soon as it knows the distance will be larger than the cutoff. This makes for a huge improvement when searching for the best match among long strings. In the NodeJS example you can now find the best match in 5x speedup over the NodeJS example by using a cutoff. The distance does end up with an optional cutoff parameter.

Sorry the commits weren't squashed. I thought I did that before uploading.

vlazar · 2021-10-29T06:21:22Z

@darkstego I've tried running benchmark you mentioned here #11370 (comment) and ran into this error:

% ./levenshtein_benchmark
Original: ASCII-Size:3  14.23M ( 70.29ns) (± 1.19%)  32.0B/op   1.59× slower
     New: ASCII-Size:3  22.69M ( 44.08ns) (± 4.48%)   0.0B/op        fastest
Original: ASCII-Size:6   5.73M (174.59ns) (± 4.42%)  32.0B/op   2.68× slower
     New: ASCII-Size:6  15.33M ( 65.21ns) (± 4.50%)   0.0B/op        fastest
Original: ASCII-Size:12   1.83M (545.93ns) (± 3.55%)  64.0B/op   4.99× slower
     New: ASCII-Size:12   9.13M (109.50ns) (± 3.84%)   0.0B/op        fastest
Original: ASCII-Size:24 527.97k (  1.89µs) (± 3.47%)  112B/op   9.59× slower
     New: ASCII-Size:24   5.06M (197.50ns) (± 3.36%)  0.0B/op        fastest
Original: ASCII-Size:50 128.05k (  7.81µs) (± 3.68%)   208B/op  13.12× slower
     New: ASCII-Size:50   1.68M (595.30ns) (± 3.34%)  32.0B/op        fastest
Original: ASCII-Size:100  32.28k ( 30.97µs) (± 3.48%)   448B/op  16.99× slower
     New: ASCII-Size:100 548.43k (  1.82µs) (± 3.61%)  64.0B/op        fastest
Original: ASCII-Size:500   1.29k (773.83µs) (± 2.95%)  2.0kB/op  25.50× slower
     New: ASCII-Size:500  32.95k ( 30.35µs) (± 3.55%)   160B/op        fastest
Original: ASCII-Size:1000 322.31  (  3.10ms) (± 2.83%)  3.93kB/op  26.05× slower
     New: ASCII-Size:1000   8.40k (119.12µs) (± 3.26%)    288B/op        fastest
Original: Unicode-Size:3   5.64M (177.19ns) (± 1.32%)  80.0B/op        fastest
     New: Unicode-Size:3   5.35M (186.76ns) (± 0.94%)  80.0B/op   1.05× slower
Original: Unicode-Size:6   3.27M (305.84ns) (± 1.02%)  96.0B/op        fastest
     New: Unicode-Size:6   3.17M (315.20ns) (± 0.89%)  96.0B/op   1.03× slower
Unhandled exception: 0xd9dc out of char range (ArgumentError)
  from /usr/local/Cellar/crystal/1.2.1/src/pointer.cr:437:13 in '__crystal_main'
  from /usr/local/Cellar/crystal/1.2.1/src/crystal/main.cr:110:5 in 'main'

I've updated benchmark to the latest code here and it still gives me an error:

% ./levenshtein_benchmark
Original: ASCII-Size:3  14.13M ( 70.78ns) (± 1.45%)  32.0B/op   1.59× slower
     New: ASCII-Size:3  22.49M ( 44.47ns) (± 4.27%)   0.0B/op        fastest
Original: ASCII-Size:6   5.57M (179.47ns) (± 3.63%)  32.0B/op   2.78× slower
     New: ASCII-Size:6  15.52M ( 64.45ns) (± 4.01%)   0.0B/op        fastest
Original: ASCII-Size:12   1.90M (525.09ns) (± 4.73%)  64.0B/op   4.72× slower
     New: ASCII-Size:12   8.99M (111.22ns) (± 4.00%)   0.0B/op        fastest
Original: ASCII-Size:24 527.29k (  1.90µs) (± 3.69%)  112B/op   9.17× slower
     New: ASCII-Size:24   4.83M (206.84ns) (± 3.96%)  0.0B/op        fastest
Original: ASCII-Size:50 128.70k (  7.77µs) (± 3.14%)   208B/op  13.52× slower
     New: ASCII-Size:50   1.74M (574.90ns) (± 4.11%)  32.0B/op        fastest
Original: ASCII-Size:100  31.98k ( 31.27µs) (± 3.06%)   448B/op  17.40× slower
     New: ASCII-Size:100 556.38k (  1.80µs) (± 4.01%)  64.0B/op        fastest
Original: ASCII-Size:500   1.26k (795.30µs) (± 5.53%)  2.0kB/op  23.94× slower
     New: ASCII-Size:500  30.10k ( 33.22µs) (± 4.20%)   160B/op        fastest
Original: ASCII-Size:1000 320.76  (  3.12ms) (± 3.79%)  3.93kB/op  21.52× slower
     New: ASCII-Size:1000   6.90k (144.87µs) (± 4.14%)    288B/op        fastest
Original: Unicode-Size:3   6.05M (165.34ns) (± 1.33%)  80.0B/op        fastest
     New: Unicode-Size:3   5.78M (173.00ns) (± 1.21%)  80.0B/op   1.05× slower
Original: Unicode-Size:6   3.33M (300.39ns) (± 1.97%)  96.0B/op        fastest
     New: Unicode-Size:6   3.25M (307.39ns) (± 1.24%)  96.0B/op   1.02× slower
Original: Unicode-Size:12   1.21M (829.06ns) (± 4.56%)  160B/op        fastest
     New: Unicode-Size:12   1.17M (857.51ns) (± 4.81%)  160B/op   1.03× slower
Original: Unicode-Size:24 436.66k (  2.29µs) (± 4.00%)  256B/op   1.01× slower
     New: Unicode-Size:24 440.09k (  2.27µs) (± 3.40%)  256B/op        fastest
Original: Unicode-Size:50 120.21k (  8.32µs) (± 3.58%)  448B/op   1.00× slower
     New: Unicode-Size:50 120.39k (  8.31µs) (± 3.31%)  448B/op        fastest
Original: Unicode-Size:100  33.75k ( 29.63µs) (± 3.37%)    928B/op   3.55× slower
     New: Unicode-Size:100 119.78k (  8.35µs) (± 0.80%)  2.06kB/op        fastest
Unhandled exception: 0xdd53 out of char range (ArgumentError)
  from /usr/local/Cellar/crystal/1.2.1/src/pointer.cr:437:13 in '__crystal_main'
  from /usr/local/Cellar/crystal/1.2.1/src/crystal/main.cr:110:5 in 'main'

% crystal -v
Crystal 1.2.1 (2021-10-21)

LLVM: 11.1.0
Default target: x86_64-apple-macosx

% sysctl -n machdep.cpu.brand_string
Intel(R) Core(TM) i7-3615QM CPU @ 2.30GHz

darkstego · 2021-10-29T10:52:24Z

@vlazar Seems the testing code was generating illegal chars. I unfortunately didn't realize there was a gap in the unicode space that needed to be accounted for. Try revision 3 of the gist.

vlazar · 2021-10-29T19:15:03Z

@darkstego I've tried 3rd revision it and results looks suspicious now 😆

Original: ASCII-Size:3  84.00M ( 11.90ns) (± 5.95%)  0.0B/op        fastest
     New: ASCII-Size:3  83.62M ( 11.96ns) (± 6.11%)  0.0B/op   1.00× slower
Original: ASCII-Size:6  56.30M ( 17.76ns) (± 5.75%)  0.0B/op        fastest
     New: ASCII-Size:6  54.48M ( 18.36ns) (± 6.83%)  0.0B/op   1.03× slower
Original: ASCII-Size:12  81.12M ( 12.33ns) (± 6.54%)  0.0B/op   1.00× slower
     New: ASCII-Size:12  81.46M ( 12.28ns) (± 6.00%)  0.0B/op        fastest
Original: ASCII-Size:24  66.20M ( 15.11ns) (± 5.56%)  0.0B/op        fastest
     New: ASCII-Size:24  65.12M ( 15.36ns) (± 6.52%)  0.0B/op   1.02× slower
Original: ASCII-Size:50  48.46M ( 20.64ns) (± 6.71%)  0.0B/op        fastest
     New: ASCII-Size:50  47.85M ( 20.90ns) (± 5.60%)  0.0B/op   1.01× slower
Original: ASCII-Size:100  34.63M ( 28.88ns) (± 4.48%)  0.0B/op        fastest
     New: ASCII-Size:100  34.11M ( 29.32ns) (± 4.90%)  0.0B/op   1.02× slower
Original: ASCII-Size:500  10.51M ( 95.18ns) (± 3.72%)  0.0B/op        fastest
     New: ASCII-Size:500  10.46M ( 95.60ns) (± 3.80%)  0.0B/op   1.00× slower
Original: ASCII-Size:1000   6.31M (158.47ns) (± 4.93%)  0.0B/op        fastest
     New: ASCII-Size:1000   6.31M (158.59ns) (± 5.67%)  0.0B/op   1.00× slower
Original: Unicode-Size:3  82.80M ( 12.08ns) (± 6.39%)  0.0B/op        fastest
     New: Unicode-Size:3  81.51M ( 12.27ns) (± 6.77%)  0.0B/op   1.02× slower
Original: Unicode-Size:6  55.49M ( 18.02ns) (± 6.11%)  0.0B/op        fastest
     New: Unicode-Size:6  54.78M ( 18.25ns) (± 6.07%)  0.0B/op   1.01× slower
...

asterite · 2021-10-29T19:25:22Z

src/levenshtein.cr

+  # If *cutoff* is given then the method is allowed to end once the loweset
+  # possible bound is greater than *cutoff* and return that lower bound.


Could a doc/code example for this be given?

Alternatively, I think it's totally fine to add the cutoff logic in a separate PR. Given that it's a new feature, it could sparkle more discussion, eventually leading to this PR being merged less likely or less faster to happen (just a guess! I'm not actually requesting this)

Would a pull request onto my feature branch show up here? I haven't really made a pull request onto a pull request before.

Good point. I don't think so, but also, the other PR can come/appear after this one is merged.

I could also make a new pull request and first squash all the features together. Whatever is easier for you guys.

darkstego · 2021-10-29T20:17:33Z

@darkstego I've tried 3rd revision it and results looks suspicious now laughing

@vlazar Sorry about that. It was benchmarking two identical strings all the time. Fixed and tested. I also have a test to compare the distance results between the old and new methods to make sure the data is correct.

Is there a way to import a module from a file by a different name? There is a lot of copying and pasting that needs to be done to test and benchmark and I would ideally like to test the module that is in my git repo without having to move it around?

vlazar · 2021-10-30T15:22:25Z

@darkstego Looks great!

New results from me. All cases are now faster, except negligible slowdown on short Unicode strings fro your benchmarks for different string lengths.

Comparing to NodeJS on this benchmark from original report Levenshtein slow and high on CPU usage #11335

Crystal (now 1.6x faster than NodeJS from previously being 8x slower)

Time (mean ± σ):      4.404 s ±  0.084 s    [User: 4.364 s, System: 0.019 s]
Range (min … max):    4.227 s …  4.505 s    10 runs

Crystal with cutoff from https://gist.github.com/darkstego/c41ea58505186542e4f9fdc5079fb1f4 is 6.6x faster than NodeJS

Time (mean ± σ):      1.103 s ±  0.020 s    [User: 1.093 s, System: 0.008 s]
Range (min … max):    1.073 s …  1.139 s    10 runs

NodeJS

Time (mean ± σ):      7.269 s ±  0.062 s    [User: 7.133 s, System: 0.183 s]
Range (min … max):    7.163 s …  7.335 s    10 runs

Benchmark from Implement Myers algorithm for Levenshtein distance calculation #11370 (comment) (short and long strings, small and big distances)

old 296.80  (  3.37ms) (± 2.34%)  195kB/op   2.63× slower
new 781.55  (  1.28ms) (± 3.19%)  244kB/op        fastest

Your benchmark from https://gist.github.com/darkstego/b7f512780454c9088cc8b26a8a6af888

Original: ASCII-Size:3  13.98M ( 71.55ns) (± 1.22%)  32.0B/op   1.56× slower
     New: ASCII-Size:3  21.76M ( 45.95ns) (± 4.73%)   0.0B/op        fastest
Original: ASCII-Size:6   5.46M (183.25ns) (± 3.98%)  32.0B/op   2.66× slower
     New: ASCII-Size:6  14.53M ( 68.80ns) (± 4.53%)   0.0B/op        fastest
Original: ASCII-Size:12   1.81M (552.07ns) (± 3.60%)  64.0B/op   4.80× slower
     New: ASCII-Size:12   8.69M (115.12ns) (± 4.06%)   0.0B/op        fastest
Original: ASCII-Size:32 309.26k (  3.23µs) (± 2.91%)   144B/op   7.97× slower
     New: ASCII-Size:32   2.47M (405.65ns) (± 3.26%)  32.0B/op        fastest
Original: ASCII-Size:64  83.17k ( 12.02µs) (± 3.13%)   272B/op  16.74× slower
     New: ASCII-Size:64   1.39M (718.04ns) (± 3.50%)  32.0B/op        fastest
Original: ASCII-Size:100  36.12k ( 27.69µs) (± 3.39%)   448B/op  15.08× slower
     New: ASCII-Size:100 544.75k (  1.84µs) (± 2.76%)  64.0B/op        fastest
Original: ASCII-Size:500   1.60k (623.85µs) (± 5.44%)  2.0kB/op  20.85× slower
     New: ASCII-Size:500  33.43k ( 29.91µs) (± 5.14%)   160B/op        fastest
Original: ASCII-Size:1000 413.47  (  2.42ms) (± 3.95%)  3.93kB/op  21.37× slower
     New: ASCII-Size:1000   8.84k (113.18µs) (± 3.17%)    288B/op        fastest
Original: Unicode-Size:3   6.04M (165.61ns) (± 1.35%)  80.0B/op        fastest
     New: Unicode-Size:3   5.67M (176.31ns) (± 0.95%)  80.0B/op   1.06× slower
Original: Unicode-Size:6   3.39M (294.95ns) (± 1.40%)  96.0B/op        fastest
     New: Unicode-Size:6   3.25M (307.54ns) (± 0.93%)  96.0B/op   1.04× slower
Original: Unicode-Size:12   1.20M (830.25ns) (± 4.47%)  160B/op        fastest
     New: Unicode-Size:12   1.19M (837.91ns) (± 5.32%)  160B/op   1.01× slower
Original: Unicode-Size:32 269.84k (  3.71µs) (± 3.04%)  320B/op        fastest
     New: Unicode-Size:32 269.68k (  3.71µs) (± 3.21%)  320B/op   1.00× slower
Original: Unicode-Size:64  77.03k ( 12.98µs) (± 2.94%)    576B/op   3.31× slower
     New: Unicode-Size:64 254.94k (  3.92µs) (± 0.70%)  1.86kB/op        fastest
Original: Unicode-Size:100  33.37k ( 29.97µs) (± 3.51%)    928B/op   3.96× slower
     New: Unicode-Size:100 132.29k (  7.56µs) (± 0.67%)  2.06kB/op        fastest
Original: Unicode-Size:500   1.52k (655.94µs) (± 3.33%)  4.03kB/op   7.06× slower
     New: Unicode-Size:500  10.76k ( 92.96µs) (± 3.17%)  3.72kB/op        fastest
Original: Unicode-Size:1000 384.41  (  2.60ms) (± 5.13%)  7.87kB/op   7.94× slower
     New: Unicode-Size:1000   3.05k (327.82µs) (± 3.60%)  5.76kB/op        fastest
Total Errors = 0

j8r · 2021-10-30T20:27:34Z

src/levenshtein.cr

+  # Myers Algorithm for ASCII and Unicode
+  #
+  # The algorithm uses uses a dictionary to store string char location as bits
+  # ASCII implementation uses StaticArray while for full Unicode a Hash is used


The explanation is quite good, but why using a Hash for Unicode?

Briefly, the dictionary maps char to bit-vector (uint32 or uint64). So a hash is used in the general case. For ASCII chars the codepoint is between 0-128, so for a performance boost we can use a StaticArray of size 128 and use the char codepoint to point to the array address containing the appropriate bit-vector. If we used Arrays and addresses for the entire unicode range it will have to be an array of size 1,114,112.

When I started I tried making a StaticArray of that size, but it was just not working (I guess trying to put something that large on the stack was causing issues). I did try using a regular Array, but from early testing the Hash was not only much smaller memory usage but it had better performance, so I stuck with a Hash.

Thanks! The issue with StaticArray is related to #2485. May not make any major difference though. I was mainly wondering about not using Array.

Levenshtein distance upper bound is the lenght of the longer string

paulocoghi · 2022-03-24T00:33:33Z

Anything holding this PR? It's so optimized and balanced. Excellent job from @darkstego

beta-ziliani · 2022-03-29T17:51:12Z

Sorry for the very long delay @darkstego and thanks for the awesome work! It would be great to add a few test cases for longer strings, to make sure we are covering all the different algorithms that are implemented. I can take care of that if you prefer. 🙇

darkstego · 2022-03-29T23:27:40Z

I was actually thinking about moving this into a shard a while back, since I didn't know if it will be mainlined. The algorithm is much more efficient at the cost of more complex code.

@beta-ziliani I will see if I can track down some of the test code I had. Since the algorithm had a window size (32 or 64 depending on the architecture), it was important to test strings that were multiple window size in length as well as edge cases when the string is exactly equal to the window size (64 in my case). I believe I had a program that generated strings of random lengths and compared the results from my code to the base implementation just to be sure.

paulocoghi · 2022-03-30T11:17:32Z

I was actually thinking about moving this into a shard a while back, since I didn't know if it will be mainlined. The algorithm is much more efficient at the cost of more complex code.

I vote to mainline the code

beta-ziliani · 2022-03-31T14:16:56Z

In terms of tests, the idea is not to have anything fancy, just a couple of examples with different sizes will do.

Added a longer unicode test, as well as ascii tests of lengths 32,64 and >64.

darkstego · 2022-06-05T02:20:14Z

Added tests for the Levenshtein implementation. Long unicode and 32,64,70 length ASCII to cover different sizes and all edge cases.

paulocoghi · 2022-06-05T07:24:25Z

Thanks a lot, @darkstego ! I wish your PR gets merged soon by the Crystal team. Excellent contribution!

paulocoghi · 2022-07-08T21:12:32Z

With maximum respect to the Crystal team, it seems that this PR fulfills all the requirements presented with praise.

The PR's author added the last requirements more than one month ago.

HertzDevil · 2022-07-08T21:34:44Z

src/levenshtein.cr

+        last_cost = i + 1
+
+        s_size.times do |j|
+          sub_cost = l[i] == s[j] ? 0 : 1


single_byte_optimizable? means bytes higher than 0x7F should all compare equal, since they behave like the replacement character:

Levenshtein.distance("\x81", "\x82").should eq(0) Levenshtein.distance("\x82", "\x81").should eq(0) Levenshtein.distance("\x81" * 33, String.new(Bytes.new(33) { |i| 0x80_u8 + i })).should eq(0) # okay, as these use the `Char::Reader` overload instead Levenshtein.distance("\x81", "\uFFFD").should eq(0) Levenshtein.distance("\uFFFD", "\x81").should eq(0)

Suggested change

sub_cost = l[i] == s[j] ? 0 : 1

l_i = {l[i], 0x80_u8}.min

s_j = {s[j], 0x80_u8}.min

sub_cost = l_i == s_j ? 0 : 1

This also means the ascii_only? branches could be relaxed to single_byte_optimizable?, operating over an alphabet of 129 "code points" where again bytes larger than 0x7F are mapped to 0x80.

Of course, most strings are already valid UTF-8, so this is a rather low priority pre-existing issue. Feel free to ignore in this PR.

You bring up an interesting point. I am not sure I ever really thought about what the Levenshtein distance of invalid strings would look like. But having said that the I don't even know what the point of this single_byte_optimizable? path, because if I am not mistaken no valid string will ever end up there, since if they are ascii it will call a different function and if it isn't then it wouldn't be single_byte_optimizable.

I think past me in an effort to optimize for speed looked into using unsafe pointers in non-ascii strings, to bring a speed boost to that specific scenario. And correct me if I am wrong, isn't this segment of code just providing a faster way to measure the Levenshtein distance of invalid strings? If that is the case then why even have it?

Maybe getting rid of that 'if block' would be the cleanest solution.

A string that consists only of ASCII characters plus invalid UTF-8 byte sequences, but not valid ones of 2 bytes or more (code point 0x80 or above), is single_byte_optimizable? but not ascii_only?.

I guess my question is how often these occur in the wild. Is speeding up calculation of Levenshtein distance of strings that are single_byte_optimizable? but not ascii_only? even of value to anyone? Especially since the Char::Reader path is plenty fast to begin with, and handles the invalid bytes properly.

Implement Myers algorithm for Levenshtein distance calculation

269d51d

Format file correctly

2138a55

straight-shoota added performance topic:stdlib:text labels Oct 27, 2021

straight-shoota reviewed Oct 27, 2021

View reviewed changes

src/levenshtein.cr Outdated Show resolved Hide resolved

src/levenshtein.cr Outdated Show resolved Hide resolved

asterite reviewed Oct 27, 2021

View reviewed changes

src/levenshtein.cr Outdated Show resolved Hide resolved

asterite reviewed Oct 27, 2021

View reviewed changes

src/levenshtein.cr Outdated Show resolved Hide resolved

Split Myers function to provide speedup

43e5dc1

Fixed Formatting

a24a905

Sija reviewed Oct 27, 2021

View reviewed changes

src/levenshtein.cr Outdated Show resolved Hide resolved

Clear array with fill

7e0cf1b

Co-authored-by: Sijawusz Pur Rahnama <[email protected]>

asterite reviewed Oct 27, 2021

View reviewed changes

darkstego closed this Oct 27, 2021

darkstego reopened this Oct 27, 2021

darkstego added 3 commits October 27, 2021 17:16

allow 64 bit implementation

0eaa432

Implement 64 bit support with Macros

e2ed81c

Use Macros to seperate ASCII and Unicode Myers methods

90f2492

darkstego added 4 commits October 28, 2021 23:32

Implement Cutoff in Myers Algorithm

eb1a407

Add tolerance to distance calculation

3d42128

Fix return condition

8efac41

Add Cutoff to Levenshtein Algorithm

e7fe991

darkstego mentioned this pull request Oct 29, 2021

Levenshtein slow and high on CPU usage #11335

Open

Use a more efficient method of incrementing score

90e5db6

Add description to Levenshtein cutoff

b23a964

asterite reviewed Oct 29, 2021

View reviewed changes

darkstego added 2 commits October 29, 2021 16:28

Fix return for input length equal to word size

5c4c833

Add code example for cutoff

afd6ba6

j8r reviewed Oct 30, 2021

View reviewed changes

darkstego mentioned this pull request Oct 31, 2021

code with static arrays is very very slow to compile with the --release flag #2485

Open

Fix Cutoff upperbound

be0071f

Levenshtein distance upper bound is the lenght of the longer string

Added more Levenshtein specs

579a6b6

Added a longer unicode test, as well as ascii tests of lengths 32,64 and >64.

HertzDevil reviewed Jul 8, 2022

View reviewed changes

		# If cutoff is given then the method is allowed to end once the loweset
		# possible bound is greater than cutoff and return that lower bound.

-          sub_cost = l[i] == s[j] ? 0 : 1
+          l_i = {l[i], 0x80_u8}.min
+          s_j = {s[j], 0x80_u8}.min
+          sub_cost = l_i == s_j ? 0 : 1

Implement Myers algorithm for Levenshtein distance calculation #11370

Are you sure you want to change the base?

Implement Myers algorithm for Levenshtein distance calculation #11370

Conversation

darkstego commented Oct 26, 2021

caspiano commented Oct 26, 2021

caspiano commented Oct 27, 2021 • edited Loading

darkstego commented Oct 27, 2021

vlazar commented Oct 27, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

darkstego commented Oct 27, 2021

darkstego commented Oct 27, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

darkstego commented Oct 27, 2021

vlazar commented Oct 28, 2021

vlazar commented Oct 28, 2021

vlazar commented Oct 28, 2021

straight-shoota commented Oct 28, 2021

darkstego commented Oct 28, 2021

vlazar commented Oct 28, 2021

darkstego commented Oct 29, 2021

vlazar commented Oct 29, 2021

darkstego commented Oct 29, 2021

vlazar commented Oct 29, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

darkstego commented Oct 29, 2021

vlazar commented Oct 30, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

paulocoghi commented Mar 24, 2022 • edited Loading

beta-ziliani commented Mar 29, 2022

darkstego commented Mar 29, 2022

paulocoghi commented Mar 30, 2022

beta-ziliani commented Mar 31, 2022

darkstego commented Jun 5, 2022

paulocoghi commented Jun 5, 2022

paulocoghi commented Jul 8, 2022

HertzDevil Jul 8, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

caspiano commented Oct 27, 2021 •

edited

Loading

vlazar commented Oct 30, 2021 •

edited

Loading

paulocoghi commented Mar 24, 2022 •

edited

Loading

HertzDevil Jul 8, 2022 •

edited

Loading