-
-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement Myers algorithm for Levenshtein distance calculation #11370
base: master
Are you sure you want to change the base?
Conversation
Could you please run BTW, this looks incredible! |
I got these results against the benchmark from #8324
|
The implementation could be faster still with some changes. A separate ASCII method makes this submitted implementation 1.5x slower. But that isn't a very DRY approach and contemplated back and forth about implementing it. The only difference being the character bit-vector store is not conditionally defined and no case statement to clear it. The real value though is for long strings because this algorithm is O([m/w]n) whereas the existing one is O(mn). I am sure there are further optimizations that can be done. w is word size and set to 32, but should be 64 on 64 bit systems, that should make it even faster on long strings. |
I've tried benchmarking a new implementation using this code #11335 (comment)
Now it's just a tiny bit slower than presumably the fastest Levenshtein distance algorithm for JavaScript from Crystal 1.2.1
NodeJS 16.13.0
|
src/levenshtein.cr
Outdated
score = m | ||
ascii = string1.ascii_only? && string2.ascii_only? | ||
|
||
pmr = ascii ? StaticArray(UInt32, 128).new(0) : Hash(Int32, UInt32).new(w) { 0.to_u32 } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because this makes pmr
be a union type, any use of it will result in a multidispatch.
I suggest splitting the rest of this method in two based on ascii
or not. Something like:
if ascii
pmr = StaticArray(UInt32, 128).new(0)
rest_of_the_code(lpos, score, pmr, etc.)
else
pmr = Hash(Int32, UInt32).new(w) { 0.to_u32 }
rest_of_the_code(lpos, score, pmr, etc.)
end
that will make rest_of_the_code
to be instantiated once per each type of pmr
, hopefully resulting in a faster code, and code that LLVM will be able to inline better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why use a Hash
when it's not ascii and not a StaticArray
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do get 50% uptick if I split it into ascii and non ascii method and just check in the main function. Not very DRY but it is faster so I will commit it shortly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why use a
Hash
when it's not ascii and not aStaticArray
?
This is used as a dictionary. Every loop I traverse a chunk of string1 (size 32 in this case) and note on a bit vector where each character is in the chunk. So I need a dictionary to map a string character to a 32bit int (my vector). This algorithm traverse string1 exactly 1 time and uses the created dictionary on every column (string2) hence the speedup.
One way to do this dictionary is a hash of size 32 (chunk size). But in ASCII I am guaranteed that the char will be less than 128, so I make a small StaticArray of that size and use the codepoint as an address to speed up the lookup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally I would instantiate pmr and then pass it into my method. But the only problem is clearing the pmr at the end of each loop is executed differently depending on whether it is a StaticArray
or Hash
.
That said, since ASCII also allows the use of pointers the methods have diverged slightly. There is still way more code duplication than I would like, and I feel like there is a way to better organize it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally I would instantiate pmr and then pass it into my method. But the only problem is clearing the pmr at the end of each loop is executed differently depending on whether it is a StaticArray or Hash.
It's true. In that case you can still leave the case
you had before. Before the method will be instantiated with two different types separately, the compiler will optimize that case
statement so there will be no check at the end.
That said, if the codes diverged a bit then it's fine to keep them duplicate.
If they only different by one thing, you can consider using a third method that receives a block. Because blocks are inlined, it will be the same as writing two different versions with just one difference (let me know if this is not clear, I can provide a small code example)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about the block method, but the number of variables that needed to be passed were too many, and some need to be passed by reference so I need to encapsulate those. It quickly became more of a mess.
I am trying to use macros to create the two methods (ascii,unicode) and reduce code duplication. I did implement that and it passed the specs, but when benchmarking it doesn't terminate so I still have a bit of debugging to do. Also Macros do open up potential for 64bit implementation, which does appear to be faster, but will require some debugging.
So I decided to really optimize for performance. This latest commit should get 1.8x speedup over my old commit in long ascii texts. One potential avenue for further speed up is finding out is making an Int64 version of this that would be compiled on 64bit systems. |
Sorry about the constant formatting fixes. I don't know why my pre-commit hooks aren't working. |
Co-authored-by: Sijawusz Pur Rahnama <[email protected]>
src/levenshtein.cr
Outdated
vn = 0 | ||
|
||
# prepare char bit vector | ||
s = string1[r*w, w] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This allocates a new string. Maybe using Char::Reader
there's a way to avoid this allocation and further speed things up.
It could be done in a follow up PR, though!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this is getting quite messy and I am a couple of branches in with optimization implementations. I will close this and try to clean things up before setting up another PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure!
Also, it's totally fine to get something slightly better working (though in this case it's already a huge performance improvement), and then apply the suggestions or further improvements in other PRs.
Finally, a bit of code duplication is also fine! That's usually better than introducing macros and making the code harder to read.
And thank you for your patience! 🙏
So I finally have what I feel is a complete implementation. I implemented a 64 bit version of the algorithm when compiled on 64 bit systems. I implemented the ASCII and Unicode versions through Macros mainly because I kept forgetting to implement changes on both codebases. They diverge in a few spots that are commented mainly to do with the dictionary implementation and are commented, so it isn't bad at all. The performance is really good now. Currently clocking in at 30x over the original at string length of 1500 on 64 bit systems
Now at Over 3x speed from the original PR. I did run a test to compare against the Node JS implementation and it was over twice as fast on my System. I was using the method described in this #11335 comment.
Node JS
Crystal
|
🚀 Wow, with the latest changes this implementation is now 2.4x faster than the NodeJS on this benchmark for me #11335 (comment) Great job @darkstego ! |
One thing to note is memory consumption has grown in new implementation. I've run the benchmark already mentioned here #11370 (comment) My numbers:
It uses a case with If I remove these 2 cases
The memory consumption is still much higher, but it's not in MBs anymore. But notice the results, for some reason it's now slower than original? 😕
Code: # From https://github.com/crystal-lang/crystal/pull/11370
Benchmark.ips do |bm|
bm.report "old" do
Levenshtein.distance("algorithm", "altruistic")
Levenshtein.distance("hello", "hallo")
Levenshtein.distance("こんにちは", "こんちは")
Levenshtein.distance("hey", "hey")
Levenshtein.distance("hippo", "zzzzzzzz")
# Levenshtein.distance("a" * 100000, "hello")
# Levenshtein.distance("hello", "a" * 100000)
end
bm.report "new" do
LevenshteinNew.distance("algorithm", "altruistic")
LevenshteinNew.distance("hello", "hallo")
LevenshteinNew.distance("こんにちは", "こんちは")
LevenshteinNew.distance("hey", "hey")
LevenshteinNew.distance("hippo", "zzzzzzzz")
# LevenshteinNew.distance("a" * 100000, "hello")
# LevenshteinNew.distance("hello", "a" * 100000)
end
end |
I've run cases 1 by 1 and added results near each one that's slower. The most slowdown seems to be for short strings with small distance, especially for Unicode string. Benchmark.ips do |bm|
bm.report "old" do
Levenshtein.distance("algorithm", "altruistic") # 1.15× slower
Levenshtein.distance("hello", "hallo")
Levenshtein.distance("こんにちは", "こんちは")
Levenshtein.distance("hey", "hey")
Levenshtein.distance("hippo", "zzzzzzzz")
Levenshtein.distance("a" * 100000, "hello") # 1.75× slower
Levenshtein.distance("hello", "a" * 100000) # 1.80× slower
end
bm.report "new" do
LevenshteinNew.distance("algorithm", "altruistic")
LevenshteinNew.distance("hello", "hallo") # 2.06× slower
LevenshteinNew.distance("こんにちは", "こんちは") # 4.65× slower
LevenshteinNew.distance("hey", "hey")
LevenshteinNew.distance("hippo", "zzzzzzzz") # 1.47× slower
LevenshteinNew.distance("a" * 100000, "hello")
LevenshteinNew.distance("hello", "a" * 100000)
end
end I wonder what cases would be used most frequently in the real world 🙂 It would be nice to optimize for most those. |
I'm pretty sure that short strings are a very common use case for Levenshtein distance (thing search query strings for example). |
I know why this is happening. For short Unicode strings the dictionary used is a hash. That is created and populated before traversing the columns of the matrix, but that step takes times and if the string is short the improved big O times of the algorithm doesn't get a chance to offset that. In my early testing I realized that the cutoff point for unicode was about 30 chars or so. The fix is pretty easy, I would just call the old algorithm if the strings were below the cutoff length. The StaticArray (ascii) method was always faster for me, so there is a regression somewhere, and I have a good idea where that might be. If anyone has any suggestion on an efficient way to store and retrieve bit information in Crystal that would help a lot. The algorithm needs 2 things:
I just realized there was a bitarray in Crystal and might look at that, but if I am not mistaken it stores every bit as a U32.
I know the reason for this. The algorithm stores a bit array of length = longest string, and I encoded each bit as a U64 on 64 bit systems. There is potential to reduce the memory consumption by quite a bit. |
@darkstego No worries at all! I was just sharing my numbers for the record (for prospective). Current code is much more balanced and optimized for more realistic use cases. |
Another improvement just added. I realized that if a cutoff is given to the algorithm it can abort as soon as it knows the distance will be larger than the cutoff. This makes for a huge improvement when searching for the best match among long strings. In the NodeJS example you can now find the best match in 5x speedup over the NodeJS example by using a cutoff. The distance does end up with an optional cutoff parameter. Sorry the commits weren't squashed. I thought I did that before uploading. |
@darkstego I've tried running benchmark you mentioned here #11370 (comment) and ran into this error:
I've updated benchmark to the latest code here and it still gives me an error:
|
@vlazar Seems the testing code was generating illegal chars. I unfortunately didn't realize there was a gap in the unicode space that needed to be accounted for. Try revision 3 of the gist. |
@darkstego I've tried 3rd revision it and results looks suspicious now 😆
|
# If *cutoff* is given then the method is allowed to end once the loweset | ||
# possible bound is greater than *cutoff* and return that lower bound. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could a doc/code example for this be given?
Alternatively, I think it's totally fine to add the cutoff
logic in a separate PR. Given that it's a new feature, it could sparkle more discussion, eventually leading to this PR being merged less likely or less faster to happen (just a guess! I'm not actually requesting this)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would a pull request onto my feature branch show up here? I haven't really made a pull request onto a pull request before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I don't think so, but also, the other PR can come/appear after this one is merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could also make a new pull request and first squash all the features together. Whatever is easier for you guys.
@vlazar Sorry about that. It was benchmarking two identical strings all the time. Fixed and tested. I also have a test to compare the distance results between the old and new methods to make sure the data is correct. Is there a way to import a module from a file by a different name? There is a lot of copying and pasting that needs to be done to test and benchmark and I would ideally like to test the module that is in my git repo without having to move it around? |
@darkstego Looks great! New results from me. All cases are now faster, except negligible slowdown on short Unicode strings fro your benchmarks for different string lengths.
Crystal (now 1.6x faster than NodeJS from previously being 8x slower)
Crystal with cutoff from https://gist.github.com/darkstego/c41ea58505186542e4f9fdc5079fb1f4 is 6.6x faster than NodeJS
NodeJS
|
# Myers Algorithm for ASCII and Unicode | ||
# | ||
# The algorithm uses uses a dictionary to store string char location as bits | ||
# ASCII implementation uses StaticArray while for full Unicode a Hash is used |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The explanation is quite good, but why using a Hash for Unicode?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Briefly, the dictionary maps char to bit-vector (uint32 or uint64). So a hash is used in the general case. For ASCII chars the codepoint is between 0-128, so for a performance boost we can use a StaticArray of size 128 and use the char codepoint to point to the array address containing the appropriate bit-vector. If we used Arrays and addresses for the entire unicode range it will have to be an array of size 1,114,112.
When I started I tried making a StaticArray of that size, but it was just not working (I guess trying to put something that large on the stack was causing issues). I did try using a regular Array, but from early testing the Hash was not only much smaller memory usage but it had better performance, so I stuck with a Hash.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! The issue with StaticArray is related to #2485. May not make any major difference though. I was mainly wondering about not using Array.
Levenshtein distance upper bound is the lenght of the longer string
Anything holding this PR? It's so optimized and balanced. Excellent job from @darkstego |
Sorry for the very long delay @darkstego and thanks for the awesome work! It would be great to add a few test cases for longer strings, to make sure we are covering all the different algorithms that are implemented. I can take care of that if you prefer. 🙇 |
I was actually thinking about moving this into a shard a while back, since I didn't know if it will be mainlined. The algorithm is much more efficient at the cost of more complex code. @beta-ziliani I will see if I can track down some of the test code I had. Since the algorithm had a window size (32 or 64 depending on the architecture), it was important to test strings that were multiple window size in length as well as edge cases when the string is exactly equal to the window size (64 in my case). I believe I had a program that generated strings of random lengths and compared the results from my code to the base implementation just to be sure. |
I vote to mainline the code |
In terms of tests, the idea is not to have anything fancy, just a couple of examples with different sizes will do. |
Added a longer unicode test, as well as ascii tests of lengths 32,64 and >64.
Added tests for the Levenshtein implementation. Long unicode and 32,64,70 length ASCII to cover different sizes and all edge cases. |
Thanks a lot, @darkstego ! I wish your PR gets merged soon by the Crystal team. Excellent contribution! |
With maximum respect to the Crystal team, it seems that this PR fulfills all the requirements presented with praise. The PR's author added the last requirements more than one month ago. |
last_cost = i + 1 | ||
|
||
s_size.times do |j| | ||
sub_cost = l[i] == s[j] ? 0 : 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
single_byte_optimizable?
means bytes higher than 0x7F should all compare equal, since they behave like the replacement character:
Levenshtein.distance("\x81", "\x82").should eq(0)
Levenshtein.distance("\x82", "\x81").should eq(0)
Levenshtein.distance("\x81" * 33, String.new(Bytes.new(33) { |i| 0x80_u8 + i })).should eq(0)
# okay, as these use the `Char::Reader` overload instead
Levenshtein.distance("\x81", "\uFFFD").should eq(0)
Levenshtein.distance("\uFFFD", "\x81").should eq(0)
sub_cost = l[i] == s[j] ? 0 : 1 | |
l_i = {l[i], 0x80_u8}.min | |
s_j = {s[j], 0x80_u8}.min | |
sub_cost = l_i == s_j ? 0 : 1 |
This also means the ascii_only?
branches could be relaxed to single_byte_optimizable?
, operating over an alphabet of 129 "code points" where again bytes larger than 0x7F are mapped to 0x80.
Of course, most strings are already valid UTF-8, so this is a rather low priority pre-existing issue. Feel free to ignore in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You bring up an interesting point. I am not sure I ever really thought about what the Levenshtein distance of invalid strings would look like. But having said that the I don't even know what the point of this single_byte_optimizable?
path, because if I am not mistaken no valid string will ever end up there, since if they are ascii it will call a different function and if it isn't then it wouldn't be single_byte_optimizable.
I think past me in an effort to optimize for speed looked into using unsafe pointers in non-ascii strings, to bring a speed boost to that specific scenario. And correct me if I am wrong, isn't this segment of code just providing a faster way to measure the Levenshtein distance of invalid strings? If that is the case then why even have it?
Maybe getting rid of that 'if block' would be the cleanest solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A string that consists only of ASCII characters plus invalid UTF-8 byte sequences, but not valid ones of 2 bytes or more (code point 0x80 or above), is single_byte_optimizable?
but not ascii_only?
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess my question is how often these occur in the wild. Is speeding up calculation of Levenshtein distance of strings that are single_byte_optimizable?
but not ascii_only?
even of value to anyone? Especially since the Char::Reader
path is plenty fast to begin with, and handles the invalid bytes properly.
This implements a faster algorithm for calculating Levenshtein distance. Fixes #11335.
The algorithm is 3x faster than the original one for short strings and 10x for long ones.