Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Myers algorithm for Levenshtein distance calculation #11370

Open
wants to merge 21 commits into
base: master
Choose a base branch
from

Conversation

darkstego
Copy link
Contributor

This implements a faster algorithm for calculating Levenshtein distance. Fixes #11335.

The algorithm is 3x faster than the original one for short strings and 10x for long ones.

@caspiano
Copy link
Contributor

Could you please run crystal tool format?

BTW, this looks incredible!

@caspiano
Copy link
Contributor

caspiano commented Oct 27, 2021

I got these results against the benchmark from #8324
Note: the benchmark does not test the algorithm against longer strings.

gist here.

old 404.67  (  2.47ms) (± 3.54%)   195kB/op   1.72× slower
new 696.13  (  1.44ms) (± 6.50%)  1.72MB/op        fastest

@darkstego
Copy link
Contributor Author

The implementation could be faster still with some changes. A separate ASCII method makes this submitted implementation 1.5x slower. But that isn't a very DRY approach and contemplated back and forth about implementing it. The only difference being the character bit-vector store is not conditionally defined and no case statement to clear it.

The real value though is for long strings because this algorithm is O([m/w]n) whereas the existing one is O(mn). I am sure there are further optimizations that can be done. w is word size and set to 32, but should be 64 on 64 bit systems, that should make it even faster on long strings.

src/levenshtein.cr Outdated Show resolved Hide resolved
src/levenshtein.cr Outdated Show resolved Hide resolved
@vlazar
Copy link
Contributor

vlazar commented Oct 27, 2021

I've tried benchmarking a new implementation using this code #11335 (comment)

  • MacOS Catalina 10.15.7
  • 2.3 GHz Quad-Core Intel Core i7
  • NodeJS 16.13.0
  • Crystal 1.2.1 (LLVM 11.1.0)

Now it's just a tiny bit slower than presumably the fastest Levenshtein distance algorithm for JavaScript from fastest-levenshtein npm package:

Crystal 1.2.1

% hyperfine ./lev_myers
Benchmark #1: ./lev_builtin_myers
  Time (mean ± σ):      7.616 s ±  0.146 s    [User: 7.585 s, System: 0.031 s]
  Range (min … max):    7.355 s …  7.805 s    10 runs

NodeJS 16.13.0

% hyperfine 'node lev_node.js'
Benchmark #1: node lev_node.js
  Time (mean ± σ):      7.194 s ±  0.066 s    [User: 7.053 s, System: 0.189 s]
  Range (min … max):    7.093 s …  7.331 s    10 runs

score = m
ascii = string1.ascii_only? && string2.ascii_only?

pmr = ascii ? StaticArray(UInt32, 128).new(0) : Hash(Int32, UInt32).new(w) { 0.to_u32 }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because this makes pmr be a union type, any use of it will result in a multidispatch.

I suggest splitting the rest of this method in two based on ascii or not. Something like:

if ascii
  pmr = StaticArray(UInt32, 128).new(0)
  rest_of_the_code(lpos, score, pmr, etc.)
else
  pmr = Hash(Int32, UInt32).new(w) { 0.to_u32 }
  rest_of_the_code(lpos, score, pmr, etc.)
end

that will make rest_of_the_code to be instantiated once per each type of pmr, hopefully resulting in a faster code, and code that LLVM will be able to inline better.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why use a Hash when it's not ascii and not a StaticArray?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do get 50% uptick if I split it into ascii and non ascii method and just check in the main function. Not very DRY but it is faster so I will commit it shortly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why use a Hash when it's not ascii and not a StaticArray?

This is used as a dictionary. Every loop I traverse a chunk of string1 (size 32 in this case) and note on a bit vector where each character is in the chunk. So I need a dictionary to map a string character to a 32bit int (my vector). This algorithm traverse string1 exactly 1 time and uses the created dictionary on every column (string2) hence the speedup.

One way to do this dictionary is a hash of size 32 (chunk size). But in ASCII I am guaranteed that the char will be less than 128, so I make a small StaticArray of that size and use the codepoint as an address to speed up the lookup.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally I would instantiate pmr and then pass it into my method. But the only problem is clearing the pmr at the end of each loop is executed differently depending on whether it is a StaticArray or Hash.

That said, since ASCII also allows the use of pointers the methods have diverged slightly. There is still way more code duplication than I would like, and I feel like there is a way to better organize it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally I would instantiate pmr and then pass it into my method. But the only problem is clearing the pmr at the end of each loop is executed differently depending on whether it is a StaticArray or Hash.

It's true. In that case you can still leave the case you had before. Before the method will be instantiated with two different types separately, the compiler will optimize that case statement so there will be no check at the end.

That said, if the codes diverged a bit then it's fine to keep them duplicate.

If they only different by one thing, you can consider using a third method that receives a block. Because blocks are inlined, it will be the same as writing two different versions with just one difference (let me know if this is not clear, I can provide a small code example)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about the block method, but the number of variables that needed to be passed were too many, and some need to be passed by reference so I need to encapsulate those. It quickly became more of a mess.

I am trying to use macros to create the two methods (ascii,unicode) and reduce code duplication. I did implement that and it passed the specs, but when benchmarking it doesn't terminate so I still have a bit of debugging to do. Also Macros do open up potential for 64bit implementation, which does appear to be faster, but will require some debugging.

src/levenshtein.cr Outdated Show resolved Hide resolved
src/levenshtein.cr Outdated Show resolved Hide resolved
@darkstego
Copy link
Contributor Author

So I decided to really optimize for performance. This latest commit should get 1.8x speedup over my old commit in long ascii texts.

One potential avenue for further speed up is finding out is making an Int64 version of this that would be compiled on 64bit systems.

@darkstego
Copy link
Contributor Author

Sorry about the constant formatting fixes. I don't know why my pre-commit hooks aren't working.

src/levenshtein.cr Outdated Show resolved Hide resolved
Co-authored-by: Sijawusz Pur Rahnama <[email protected]>
vn = 0

# prepare char bit vector
s = string1[r*w, w]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This allocates a new string. Maybe using Char::Reader there's a way to avoid this allocation and further speed things up.

It could be done in a follow up PR, though!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is getting quite messy and I am a couple of branches in with optimization implementations. I will close this and try to clean things up before setting up another PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure!

Also, it's totally fine to get something slightly better working (though in this case it's already a huge performance improvement), and then apply the suggestions or further improvements in other PRs.

Finally, a bit of code duplication is also fine! That's usually better than introducing macros and making the code harder to read.

And thank you for your patience! 🙏

@darkstego darkstego closed this Oct 27, 2021
@darkstego darkstego reopened this Oct 27, 2021
@darkstego
Copy link
Contributor Author

So I finally have what I feel is a complete implementation. I implemented a 64 bit version of the algorithm when compiled on 64 bit systems. I implemented the ASCII and Unicode versions through Macros mainly because I kept forgetting to implement changes on both codebases. They diverge in a few spots that are commented mainly to do with the dictionary implementation and are commented, so it isn't bad at all.

The performance is really good now. Currently clocking in at 30x over the original at string length of 1500 on 64 bit systems

Default 210.39  (  4.75ms) (± 0.49%)  5.83kB/op  30.73× slower
Orig PR   1.86k (538.91µs) (± 3.07%)  13.9kB/op   3.48× slower
Latest   6.46k (154.69µs) (± 1.57%)  23.3kB/op        fastest

Now at Over 3x speed from the original PR.

I did run a test to compare against the Node JS implementation and it was over twice as fast on my System. I was using the method described in this #11335 comment.

  • Linux 5.14.7-2-MANJARO
  • AMD Ryzen Threadripper 2950X 16-Core Processor
  • Node 16.11.0

Node JS

real    0m4.905s
user    0m4.868s
sys     0m0.020s

Crystal

real    0m1.993s
user    0m2.011s
sys     0m0.071s

@vlazar
Copy link
Contributor

vlazar commented Oct 28, 2021

🚀 Wow, with the latest changes this implementation is now 2.4x faster than the NodeJS on this benchmark for me #11335 (comment)

Great job @darkstego !

@vlazar
Copy link
Contributor

vlazar commented Oct 28, 2021

One thing to note is memory consumption has grown in new implementation.

I've run the benchmark already mentioned here #11370 (comment)

My numbers:

old 305.94  (  3.27ms) (± 3.08%)   195kB/op   1.59× slower
new 486.96  (  2.05ms) (± 1.34%)  3.24MB/op        fastest

It uses a case with 100000 characters long strings though. Not sure how real world this use case is.

If I remove these 2 cases

    Levenshtein.new_distance("a" * 100000, "hello")
    Levenshtein.new_distance("hello", "a" * 100000)

The memory consumption is still much higher, but it's not in MBs anymore.

But notice the results, for some reason it's now slower than original? 😕

old   1.23M (812.94ns) (± 0.82%)    208B/op        fastest
new 511.08k (  1.96µs) (± 0.90%)  2.37kB/op   2.41× slower

Code:

# From https://github.com/crystal-lang/crystal/pull/11370
Benchmark.ips do |bm|
  bm.report "old" do
    Levenshtein.distance("algorithm", "altruistic")
    Levenshtein.distance("hello", "hallo")
    Levenshtein.distance("こんにちは", "こんちは")
    Levenshtein.distance("hey", "hey")
    Levenshtein.distance("hippo", "zzzzzzzz")
    # Levenshtein.distance("a" * 100000, "hello")
    # Levenshtein.distance("hello", "a" * 100000)
  end

  bm.report "new" do
    LevenshteinNew.distance("algorithm", "altruistic")
    LevenshteinNew.distance("hello", "hallo")
    LevenshteinNew.distance("こんにちは", "こんちは")
    LevenshteinNew.distance("hey", "hey")
    LevenshteinNew.distance("hippo", "zzzzzzzz")
    # LevenshteinNew.distance("a" * 100000, "hello")
    # LevenshteinNew.distance("hello", "a" * 100000)
  end
end

@vlazar
Copy link
Contributor

vlazar commented Oct 28, 2021

I've run cases 1 by 1 and added results near each one that's slower. The most slowdown seems to be for short strings with small distance, especially for Unicode string.

Benchmark.ips do |bm|
  bm.report "old" do
    Levenshtein.distance("algorithm", "altruistic") # 1.15× slower
    Levenshtein.distance("hello", "hallo")
    Levenshtein.distance("こんにちは", "こんちは")
    Levenshtein.distance("hey", "hey")
    Levenshtein.distance("hippo", "zzzzzzzz")
    Levenshtein.distance("a" * 100000, "hello") # 1.75× slower
    Levenshtein.distance("hello", "a" * 100000) # 1.80× slower
  end

  bm.report "new" do
    LevenshteinNew.distance("algorithm", "altruistic")
    LevenshteinNew.distance("hello", "hallo") # 2.06× slower
    LevenshteinNew.distance("こんにちは", "こんちは") # 4.65× slower
    LevenshteinNew.distance("hey", "hey")
    LevenshteinNew.distance("hippo", "zzzzzzzz") # 1.47× slower
    LevenshteinNew.distance("a" * 100000, "hello")
    LevenshteinNew.distance("hello", "a" * 100000)
  end
end

I wonder what cases would be used most frequently in the real world 🙂 It would be nice to optimize for most those.

@straight-shoota
Copy link
Member

I'm pretty sure that short strings are a very common use case for Levenshtein distance (thing search query strings for example).
So we should make sure that performance does at least not deteriorate for short strings. There should be individual benchmarks for different string lengths.

@darkstego
Copy link
Contributor Author

I'm pretty sure that short strings are a very common use case for Levenshtein distance (thing search query strings for example).
So we should make sure that performance does at least not deteriorate for short strings. There should be individual benchmarks for different string lengths.

I know why this is happening. For short Unicode strings the dictionary used is a hash. That is created and populated before traversing the columns of the matrix, but that step takes times and if the string is short the improved big O times of the algorithm doesn't get a chance to offset that. In my early testing I realized that the cutoff point for unicode was about 30 chars or so.

The fix is pretty easy, I would just call the old algorithm if the strings were below the cutoff length.

The StaticArray (ascii) method was always faster for me, so there is a regression somewhere, and I have a good idea where that might be.

If anyone has any suggestion on an efficient way to store and retrieve bit information in Crystal that would help a lot. The algorithm needs 2 things:

  • A dictionary that maps Char to an int of size (word width). Dictionary only needs to store (word width) number of entries).
  • An array of bits of size equal to the longer string.

I just realized there was a bitarray in Crystal and might look at that, but if I am not mistaken it stores every bit as a U32.

One thing to note is memory consumption has grown in new implementation.

I know the reason for this. The algorithm stores a bit array of length = longest string, and I encoded each bit as a U64 on 64 bit systems. There is potential to reduce the memory consumption by quite a bit.

@vlazar
Copy link
Contributor

vlazar commented Oct 28, 2021

@darkstego No worries at all! I was just sharing my numbers for the record (for prospective). Current code is much more balanced and optimized for more realistic use cases.

@darkstego
Copy link
Contributor Author

Another improvement just added. I realized that if a cutoff is given to the algorithm it can abort as soon as it knows the distance will be larger than the cutoff. This makes for a huge improvement when searching for the best match among long strings. In the NodeJS example you can now find the best match in 5x speedup over the NodeJS example by using a cutoff. The distance does end up with an optional cutoff parameter.

Sorry the commits weren't squashed. I thought I did that before uploading.

@vlazar
Copy link
Contributor

vlazar commented Oct 29, 2021

@darkstego I've tried running benchmark you mentioned here #11370 (comment) and ran into this error:

% ./levenshtein_benchmark
Original: ASCII-Size:3  14.23M ( 70.29ns) (± 1.19%)  32.0B/op   1.59× slower
     New: ASCII-Size:3  22.69M ( 44.08ns) (± 4.48%)   0.0B/op        fastest
Original: ASCII-Size:6   5.73M (174.59ns) (± 4.42%)  32.0B/op   2.68× slower
     New: ASCII-Size:6  15.33M ( 65.21ns) (± 4.50%)   0.0B/op        fastest
Original: ASCII-Size:12   1.83M (545.93ns) (± 3.55%)  64.0B/op   4.99× slower
     New: ASCII-Size:12   9.13M (109.50ns) (± 3.84%)   0.0B/op        fastest
Original: ASCII-Size:24 527.97k (  1.89µs) (± 3.47%)  112B/op   9.59× slower
     New: ASCII-Size:24   5.06M (197.50ns) (± 3.36%)  0.0B/op        fastest
Original: ASCII-Size:50 128.05k (  7.81µs) (± 3.68%)   208B/op  13.12× slower
     New: ASCII-Size:50   1.68M (595.30ns) (± 3.34%)  32.0B/op        fastest
Original: ASCII-Size:100  32.28k ( 30.97µs) (± 3.48%)   448B/op  16.99× slower
     New: ASCII-Size:100 548.43k (  1.82µs) (± 3.61%)  64.0B/op        fastest
Original: ASCII-Size:500   1.29k (773.83µs) (± 2.95%)  2.0kB/op  25.50× slower
     New: ASCII-Size:500  32.95k ( 30.35µs) (± 3.55%)   160B/op        fastest
Original: ASCII-Size:1000 322.31  (  3.10ms) (± 2.83%)  3.93kB/op  26.05× slower
     New: ASCII-Size:1000   8.40k (119.12µs) (± 3.26%)    288B/op        fastest
Original: Unicode-Size:3   5.64M (177.19ns) (± 1.32%)  80.0B/op        fastest
     New: Unicode-Size:3   5.35M (186.76ns) (± 0.94%)  80.0B/op   1.05× slower
Original: Unicode-Size:6   3.27M (305.84ns) (± 1.02%)  96.0B/op        fastest
     New: Unicode-Size:6   3.17M (315.20ns) (± 0.89%)  96.0B/op   1.03× slower
Unhandled exception: 0xd9dc out of char range (ArgumentError)
  from /usr/local/Cellar/crystal/1.2.1/src/pointer.cr:437:13 in '__crystal_main'
  from /usr/local/Cellar/crystal/1.2.1/src/crystal/main.cr:110:5 in 'main'

I've updated benchmark to the latest code here and it still gives me an error:

% ./levenshtein_benchmark
Original: ASCII-Size:3  14.13M ( 70.78ns) (± 1.45%)  32.0B/op   1.59× slower
     New: ASCII-Size:3  22.49M ( 44.47ns) (± 4.27%)   0.0B/op        fastest
Original: ASCII-Size:6   5.57M (179.47ns) (± 3.63%)  32.0B/op   2.78× slower
     New: ASCII-Size:6  15.52M ( 64.45ns) (± 4.01%)   0.0B/op        fastest
Original: ASCII-Size:12   1.90M (525.09ns) (± 4.73%)  64.0B/op   4.72× slower
     New: ASCII-Size:12   8.99M (111.22ns) (± 4.00%)   0.0B/op        fastest
Original: ASCII-Size:24 527.29k (  1.90µs) (± 3.69%)  112B/op   9.17× slower
     New: ASCII-Size:24   4.83M (206.84ns) (± 3.96%)  0.0B/op        fastest
Original: ASCII-Size:50 128.70k (  7.77µs) (± 3.14%)   208B/op  13.52× slower
     New: ASCII-Size:50   1.74M (574.90ns) (± 4.11%)  32.0B/op        fastest
Original: ASCII-Size:100  31.98k ( 31.27µs) (± 3.06%)   448B/op  17.40× slower
     New: ASCII-Size:100 556.38k (  1.80µs) (± 4.01%)  64.0B/op        fastest
Original: ASCII-Size:500   1.26k (795.30µs) (± 5.53%)  2.0kB/op  23.94× slower
     New: ASCII-Size:500  30.10k ( 33.22µs) (± 4.20%)   160B/op        fastest
Original: ASCII-Size:1000 320.76  (  3.12ms) (± 3.79%)  3.93kB/op  21.52× slower
     New: ASCII-Size:1000   6.90k (144.87µs) (± 4.14%)    288B/op        fastest
Original: Unicode-Size:3   6.05M (165.34ns) (± 1.33%)  80.0B/op        fastest
     New: Unicode-Size:3   5.78M (173.00ns) (± 1.21%)  80.0B/op   1.05× slower
Original: Unicode-Size:6   3.33M (300.39ns) (± 1.97%)  96.0B/op        fastest
     New: Unicode-Size:6   3.25M (307.39ns) (± 1.24%)  96.0B/op   1.02× slower
Original: Unicode-Size:12   1.21M (829.06ns) (± 4.56%)  160B/op        fastest
     New: Unicode-Size:12   1.17M (857.51ns) (± 4.81%)  160B/op   1.03× slower
Original: Unicode-Size:24 436.66k (  2.29µs) (± 4.00%)  256B/op   1.01× slower
     New: Unicode-Size:24 440.09k (  2.27µs) (± 3.40%)  256B/op        fastest
Original: Unicode-Size:50 120.21k (  8.32µs) (± 3.58%)  448B/op   1.00× slower
     New: Unicode-Size:50 120.39k (  8.31µs) (± 3.31%)  448B/op        fastest
Original: Unicode-Size:100  33.75k ( 29.63µs) (± 3.37%)    928B/op   3.55× slower
     New: Unicode-Size:100 119.78k (  8.35µs) (± 0.80%)  2.06kB/op        fastest
Unhandled exception: 0xdd53 out of char range (ArgumentError)
  from /usr/local/Cellar/crystal/1.2.1/src/pointer.cr:437:13 in '__crystal_main'
  from /usr/local/Cellar/crystal/1.2.1/src/crystal/main.cr:110:5 in 'main'
% crystal -v
Crystal 1.2.1 (2021-10-21)

LLVM: 11.1.0
Default target: x86_64-apple-macosx

% sysctl -n machdep.cpu.brand_string
Intel(R) Core(TM) i7-3615QM CPU @ 2.30GHz

@darkstego
Copy link
Contributor Author

@vlazar Seems the testing code was generating illegal chars. I unfortunately didn't realize there was a gap in the unicode space that needed to be accounted for. Try revision 3 of the gist.

@vlazar
Copy link
Contributor

vlazar commented Oct 29, 2021

@darkstego I've tried 3rd revision it and results looks suspicious now 😆

Original: ASCII-Size:3  84.00M ( 11.90ns) (± 5.95%)  0.0B/op        fastest
     New: ASCII-Size:3  83.62M ( 11.96ns) (± 6.11%)  0.0B/op   1.00× slower
Original: ASCII-Size:6  56.30M ( 17.76ns) (± 5.75%)  0.0B/op        fastest
     New: ASCII-Size:6  54.48M ( 18.36ns) (± 6.83%)  0.0B/op   1.03× slower
Original: ASCII-Size:12  81.12M ( 12.33ns) (± 6.54%)  0.0B/op   1.00× slower
     New: ASCII-Size:12  81.46M ( 12.28ns) (± 6.00%)  0.0B/op        fastest
Original: ASCII-Size:24  66.20M ( 15.11ns) (± 5.56%)  0.0B/op        fastest
     New: ASCII-Size:24  65.12M ( 15.36ns) (± 6.52%)  0.0B/op   1.02× slower
Original: ASCII-Size:50  48.46M ( 20.64ns) (± 6.71%)  0.0B/op        fastest
     New: ASCII-Size:50  47.85M ( 20.90ns) (± 5.60%)  0.0B/op   1.01× slower
Original: ASCII-Size:100  34.63M ( 28.88ns) (± 4.48%)  0.0B/op        fastest
     New: ASCII-Size:100  34.11M ( 29.32ns) (± 4.90%)  0.0B/op   1.02× slower
Original: ASCII-Size:500  10.51M ( 95.18ns) (± 3.72%)  0.0B/op        fastest
     New: ASCII-Size:500  10.46M ( 95.60ns) (± 3.80%)  0.0B/op   1.00× slower
Original: ASCII-Size:1000   6.31M (158.47ns) (± 4.93%)  0.0B/op        fastest
     New: ASCII-Size:1000   6.31M (158.59ns) (± 5.67%)  0.0B/op   1.00× slower
Original: Unicode-Size:3  82.80M ( 12.08ns) (± 6.39%)  0.0B/op        fastest
     New: Unicode-Size:3  81.51M ( 12.27ns) (± 6.77%)  0.0B/op   1.02× slower
Original: Unicode-Size:6  55.49M ( 18.02ns) (± 6.11%)  0.0B/op        fastest
     New: Unicode-Size:6  54.78M ( 18.25ns) (± 6.07%)  0.0B/op   1.01× slower
...

Comment on lines +15 to +16
# If *cutoff* is given then the method is allowed to end once the loweset
# possible bound is greater than *cutoff* and return that lower bound.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could a doc/code example for this be given?

Alternatively, I think it's totally fine to add the cutoff logic in a separate PR. Given that it's a new feature, it could sparkle more discussion, eventually leading to this PR being merged less likely or less faster to happen (just a guess! I'm not actually requesting this)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would a pull request onto my feature branch show up here? I haven't really made a pull request onto a pull request before.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I don't think so, but also, the other PR can come/appear after this one is merged.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could also make a new pull request and first squash all the features together. Whatever is easier for you guys.

@darkstego
Copy link
Contributor Author

@darkstego I've tried 3rd revision it and results looks suspicious now laughing

@vlazar Sorry about that. It was benchmarking two identical strings all the time. Fixed and tested. I also have a test to compare the distance results between the old and new methods to make sure the data is correct.

Is there a way to import a module from a file by a different name? There is a lot of copying and pasting that needs to be done to test and benchmark and I would ideally like to test the module that is in my git repo without having to move it around?

@vlazar
Copy link
Contributor

vlazar commented Oct 30, 2021

@darkstego Looks great!

New results from me. All cases are now faster, except negligible slowdown on short Unicode strings fro your benchmarks for different string lengths.

  1. Comparing to NodeJS on this benchmark from original report Levenshtein slow and high on CPU usage #11335

Crystal (now 1.6x faster than NodeJS from previously being 8x slower)

Time (mean ± σ):      4.404 s ±  0.084 s    [User: 4.364 s, System: 0.019 s]
Range (min … max):    4.227 s …  4.505 s    10 runs

Crystal with cutoff from https://gist.github.com/darkstego/c41ea58505186542e4f9fdc5079fb1f4 is 6.6x faster than NodeJS

Time (mean ± σ):      1.103 s ±  0.020 s    [User: 1.093 s, System: 0.008 s]
Range (min … max):    1.073 s …  1.139 s    10 runs

NodeJS

Time (mean ± σ):      7.269 s ±  0.062 s    [User: 7.133 s, System: 0.183 s]
Range (min … max):    7.163 s …  7.335 s    10 runs
  1. Benchmark from Implement Myers algorithm for Levenshtein distance calculation #11370 (comment) (short and long strings, small and big distances)
old 296.80  (  3.37ms) (± 2.34%)  195kB/op   2.63× slower
new 781.55  (  1.28ms) (± 3.19%)  244kB/op        fastest
  1. Your benchmark from https://gist.github.com/darkstego/b7f512780454c9088cc8b26a8a6af888
Original: ASCII-Size:3  13.98M ( 71.55ns) (± 1.22%)  32.0B/op   1.56× slower
     New: ASCII-Size:3  21.76M ( 45.95ns) (± 4.73%)   0.0B/op        fastest
Original: ASCII-Size:6   5.46M (183.25ns) (± 3.98%)  32.0B/op   2.66× slower
     New: ASCII-Size:6  14.53M ( 68.80ns) (± 4.53%)   0.0B/op        fastest
Original: ASCII-Size:12   1.81M (552.07ns) (± 3.60%)  64.0B/op   4.80× slower
     New: ASCII-Size:12   8.69M (115.12ns) (± 4.06%)   0.0B/op        fastest
Original: ASCII-Size:32 309.26k (  3.23µs) (± 2.91%)   144B/op   7.97× slower
     New: ASCII-Size:32   2.47M (405.65ns) (± 3.26%)  32.0B/op        fastest
Original: ASCII-Size:64  83.17k ( 12.02µs) (± 3.13%)   272B/op  16.74× slower
     New: ASCII-Size:64   1.39M (718.04ns) (± 3.50%)  32.0B/op        fastest
Original: ASCII-Size:100  36.12k ( 27.69µs) (± 3.39%)   448B/op  15.08× slower
     New: ASCII-Size:100 544.75k (  1.84µs) (± 2.76%)  64.0B/op        fastest
Original: ASCII-Size:500   1.60k (623.85µs) (± 5.44%)  2.0kB/op  20.85× slower
     New: ASCII-Size:500  33.43k ( 29.91µs) (± 5.14%)   160B/op        fastest
Original: ASCII-Size:1000 413.47  (  2.42ms) (± 3.95%)  3.93kB/op  21.37× slower
     New: ASCII-Size:1000   8.84k (113.18µs) (± 3.17%)    288B/op        fastest
Original: Unicode-Size:3   6.04M (165.61ns) (± 1.35%)  80.0B/op        fastest
     New: Unicode-Size:3   5.67M (176.31ns) (± 0.95%)  80.0B/op   1.06× slower
Original: Unicode-Size:6   3.39M (294.95ns) (± 1.40%)  96.0B/op        fastest
     New: Unicode-Size:6   3.25M (307.54ns) (± 0.93%)  96.0B/op   1.04× slower
Original: Unicode-Size:12   1.20M (830.25ns) (± 4.47%)  160B/op        fastest
     New: Unicode-Size:12   1.19M (837.91ns) (± 5.32%)  160B/op   1.01× slower
Original: Unicode-Size:32 269.84k (  3.71µs) (± 3.04%)  320B/op        fastest
     New: Unicode-Size:32 269.68k (  3.71µs) (± 3.21%)  320B/op   1.00× slower
Original: Unicode-Size:64  77.03k ( 12.98µs) (± 2.94%)    576B/op   3.31× slower
     New: Unicode-Size:64 254.94k (  3.92µs) (± 0.70%)  1.86kB/op        fastest
Original: Unicode-Size:100  33.37k ( 29.97µs) (± 3.51%)    928B/op   3.96× slower
     New: Unicode-Size:100 132.29k (  7.56µs) (± 0.67%)  2.06kB/op        fastest
Original: Unicode-Size:500   1.52k (655.94µs) (± 3.33%)  4.03kB/op   7.06× slower
     New: Unicode-Size:500  10.76k ( 92.96µs) (± 3.17%)  3.72kB/op        fastest
Original: Unicode-Size:1000 384.41  (  2.60ms) (± 5.13%)  7.87kB/op   7.94× slower
     New: Unicode-Size:1000   3.05k (327.82µs) (± 3.60%)  5.76kB/op        fastest
Total Errors = 0

# Myers Algorithm for ASCII and Unicode
#
# The algorithm uses uses a dictionary to store string char location as bits
# ASCII implementation uses StaticArray while for full Unicode a Hash is used
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The explanation is quite good, but why using a Hash for Unicode?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Briefly, the dictionary maps char to bit-vector (uint32 or uint64). So a hash is used in the general case. For ASCII chars the codepoint is between 0-128, so for a performance boost we can use a StaticArray of size 128 and use the char codepoint to point to the array address containing the appropriate bit-vector. If we used Arrays and addresses for the entire unicode range it will have to be an array of size 1,114,112.

When I started I tried making a StaticArray of that size, but it was just not working (I guess trying to put something that large on the stack was causing issues). I did try using a regular Array, but from early testing the Hash was not only much smaller memory usage but it had better performance, so I stuck with a Hash.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! The issue with StaticArray is related to #2485. May not make any major difference though. I was mainly wondering about not using Array.

Levenshtein distance upper bound is the lenght of the longer string
@paulocoghi
Copy link

paulocoghi commented Mar 24, 2022

Anything holding this PR? It's so optimized and balanced. Excellent job from @darkstego

@beta-ziliani
Copy link
Member

Sorry for the very long delay @darkstego and thanks for the awesome work! It would be great to add a few test cases for longer strings, to make sure we are covering all the different algorithms that are implemented. I can take care of that if you prefer. 🙇

@darkstego
Copy link
Contributor Author

I was actually thinking about moving this into a shard a while back, since I didn't know if it will be mainlined. The algorithm is much more efficient at the cost of more complex code.

@beta-ziliani I will see if I can track down some of the test code I had. Since the algorithm had a window size (32 or 64 depending on the architecture), it was important to test strings that were multiple window size in length as well as edge cases when the string is exactly equal to the window size (64 in my case). I believe I had a program that generated strings of random lengths and compared the results from my code to the base implementation just to be sure.

@paulocoghi
Copy link

I was actually thinking about moving this into a shard a while back, since I didn't know if it will be mainlined. The algorithm is much more efficient at the cost of more complex code.

I vote to mainline the code

@beta-ziliani
Copy link
Member

In terms of tests, the idea is not to have anything fancy, just a couple of examples with different sizes will do.

Added a longer unicode test, as well as ascii tests of lengths
32,64 and >64.
@darkstego
Copy link
Contributor Author

Added tests for the Levenshtein implementation. Long unicode and 32,64,70 length ASCII to cover different sizes and all edge cases.

@paulocoghi
Copy link

Thanks a lot, @darkstego ! I wish your PR gets merged soon by the Crystal team. Excellent contribution!

@paulocoghi
Copy link

With maximum respect to the Crystal team, it seems that this PR fulfills all the requirements presented with praise.

The PR's author added the last requirements more than one month ago.

last_cost = i + 1

s_size.times do |j|
sub_cost = l[i] == s[j] ? 0 : 1
Copy link
Contributor

@HertzDevil HertzDevil Jul 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

single_byte_optimizable? means bytes higher than 0x7F should all compare equal, since they behave like the replacement character:

Levenshtein.distance("\x81", "\x82").should eq(0)
Levenshtein.distance("\x82", "\x81").should eq(0)
Levenshtein.distance("\x81" * 33, String.new(Bytes.new(33) { |i| 0x80_u8 + i })).should eq(0)

# okay, as these use the `Char::Reader` overload instead
Levenshtein.distance("\x81", "\uFFFD").should eq(0)
Levenshtein.distance("\uFFFD", "\x81").should eq(0)
Suggested change
sub_cost = l[i] == s[j] ? 0 : 1
l_i = {l[i], 0x80_u8}.min
s_j = {s[j], 0x80_u8}.min
sub_cost = l_i == s_j ? 0 : 1

This also means the ascii_only? branches could be relaxed to single_byte_optimizable?, operating over an alphabet of 129 "code points" where again bytes larger than 0x7F are mapped to 0x80.

Of course, most strings are already valid UTF-8, so this is a rather low priority pre-existing issue. Feel free to ignore in this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You bring up an interesting point. I am not sure I ever really thought about what the Levenshtein distance of invalid strings would look like. But having said that the I don't even know what the point of this single_byte_optimizable? path, because if I am not mistaken no valid string will ever end up there, since if they are ascii it will call a different function and if it isn't then it wouldn't be single_byte_optimizable.

I think past me in an effort to optimize for speed looked into using unsafe pointers in non-ascii strings, to bring a speed boost to that specific scenario. And correct me if I am wrong, isn't this segment of code just providing a faster way to measure the Levenshtein distance of invalid strings? If that is the case then why even have it?

Maybe getting rid of that 'if block' would be the cleanest solution.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A string that consists only of ASCII characters plus invalid UTF-8 byte sequences, but not valid ones of 2 bytes or more (code point 0x80 or above), is single_byte_optimizable? but not ascii_only?.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess my question is how often these occur in the wild. Is speeding up calculation of Levenshtein distance of strings that are single_byte_optimizable? but not ascii_only? even of value to anyone? Especially since the Char::Reader path is plenty fast to begin with, and handles the invalid bytes properly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Levenshtein slow and high on CPU usage
10 participants