Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Myers algorithm for Levenshtein distance calculation #11370

Open
wants to merge 21 commits into
base: master
Choose a base branch
from
Open
Changes from 2 commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
108 changes: 62 additions & 46 deletions src/levenshtein.cr
Original file line number Diff line number Diff line change
Expand Up @@ -14,57 +14,16 @@ module Levenshtein
return 0 if string1 == string2

s_size = string1.size
t_size = string2.size
l_size = string2.size

return t_size if s_size == 0
return s_size if t_size == 0

# This is to allocate less memory
if t_size > s_size
if l_size < s_size
string1, string2 = string2, string1
t_size, s_size = s_size, t_size
l_size, s_size = s_size, l_size
end

costs = Slice(Int32).new(t_size + 1) { |i| i }
last_cost = 0

if string1.single_byte_optimizable? && string2.single_byte_optimizable?
s = string1.to_unsafe
t = string2.to_unsafe

s_size.times do |i|
last_cost = i + 1

t_size.times do |j|
sub_cost = s[i] == t[j] ? 0 : 1
cost = Math.min(Math.min(last_cost + 1, costs[j + 1] + 1), costs[j] + sub_cost)
costs[j] = last_cost
last_cost = cost
end
costs[t_size] = last_cost
end

last_cost
else
reader = Char::Reader.new(string1)

# Use an array instead of a reader to decode the second string only once
chars = string2.chars

reader.each_with_index do |char1, i|
last_cost = i + 1

chars.each_with_index do |char2, j|
sub_cost = char1 == char2 ? 0 : 1
cost = Math.min(Math.min(last_cost + 1, costs[j + 1] + 1), costs[j] + sub_cost)
costs[j] = last_cost
last_cost = cost
end
costs[t_size] = last_cost
end
return l_size if s_size == 0

last_cost
end
myers(string1, string2)
end

# Finds the closest string to a given string amongst many strings.
Expand Down Expand Up @@ -155,4 +114,61 @@ module Levenshtein
def self.find(name, all_names, tolerance = nil)
Finder.find(name, all_names, tolerance)
end

# Myers algorithm to solve Levenshtein distance
private def self.myers(string1 : String, string2 : String) : Int32
w = 32
m = string1.size
n = string2.size
rmax = (m / w).ceil.to_i
ysbaddaden marked this conversation as resolved.
Show resolved Hide resolved
hna = Array(Int32).new(n, 0)
hpa = Array(Int32).new(n, 0)

lpos = 1 << ((m - 1) % w)
score = m
ascii = string1.ascii_only? && string2.ascii_only?

pmr = ascii ? StaticArray(UInt32, 128).new(0) : Hash(Int32, UInt32).new(w) { 0.to_u32 }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because this makes pmr be a union type, any use of it will result in a multidispatch.

I suggest splitting the rest of this method in two based on ascii or not. Something like:

if ascii
  pmr = StaticArray(UInt32, 128).new(0)
  rest_of_the_code(lpos, score, pmr, etc.)
else
  pmr = Hash(Int32, UInt32).new(w) { 0.to_u32 }
  rest_of_the_code(lpos, score, pmr, etc.)
end

that will make rest_of_the_code to be instantiated once per each type of pmr, hopefully resulting in a faster code, and code that LLVM will be able to inline better.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why use a Hash when it's not ascii and not a StaticArray?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do get 50% uptick if I split it into ascii and non ascii method and just check in the main function. Not very DRY but it is faster so I will commit it shortly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why use a Hash when it's not ascii and not a StaticArray?

This is used as a dictionary. Every loop I traverse a chunk of string1 (size 32 in this case) and note on a bit vector where each character is in the chunk. So I need a dictionary to map a string character to a 32bit int (my vector). This algorithm traverse string1 exactly 1 time and uses the created dictionary on every column (string2) hence the speedup.

One way to do this dictionary is a hash of size 32 (chunk size). But in ASCII I am guaranteed that the char will be less than 128, so I make a small StaticArray of that size and use the codepoint as an address to speed up the lookup.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally I would instantiate pmr and then pass it into my method. But the only problem is clearing the pmr at the end of each loop is executed differently depending on whether it is a StaticArray or Hash.

That said, since ASCII also allows the use of pointers the methods have diverged slightly. There is still way more code duplication than I would like, and I feel like there is a way to better organize it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally I would instantiate pmr and then pass it into my method. But the only problem is clearing the pmr at the end of each loop is executed differently depending on whether it is a StaticArray or Hash.

It's true. In that case you can still leave the case you had before. Before the method will be instantiated with two different types separately, the compiler will optimize that case statement so there will be no check at the end.

That said, if the codes diverged a bit then it's fine to keep them duplicate.

If they only different by one thing, you can consider using a third method that receives a block. Because blocks are inlined, it will be the same as writing two different versions with just one difference (let me know if this is not clear, I can provide a small code example)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about the block method, but the number of variables that needed to be passed were too many, and some need to be passed by reference so I need to encapsulate those. It quickly became more of a mess.

I am trying to use macros to create the two methods (ascii,unicode) and reduce code duplication. I did implement that and it passed the specs, but when benchmarking it doesn't terminate so I still have a bit of debugging to do. Also Macros do open up potential for 64bit implementation, which does appear to be faster, but will require some debugging.


rmax.times do |r|
vp = UInt32::MAX
vn = 0

# prepare char bit vector
s = string1[r*w, w]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This allocates a new string. Maybe using Char::Reader there's a way to avoid this allocation and further speed things up.

It could be done in a follow up PR, though!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is getting quite messy and I am a couple of branches in with optimization implementations. I will close this and try to clean things up before setting up another PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure!

Also, it's totally fine to get something slightly better working (though in this case it's already a huge performance improvement), and then apply the suggestions or further improvements in other PRs.

Finally, a bit of code duplication is also fine! That's usually better than introducing macros and making the code harder to read.

And thank you for your patience! 🙏

s.each_char_with_index do |c, i|
pmr[c.ord] |= 1 << i
end

string2.each_char_with_index do |c, i|
hn0 = hna[i]
hp0 = hpa[i]
pm = pmr[c.ord] | hn0
d0 = (((pm & vp) &+ vp) ^ vp) | pm | vn
hp = vn | ~(d0 | vp)
hn = d0 & vp
if (r == rmax - 1) && ((hp & lpos) != 0)
score += 1
elsif (r == rmax - 1) && ((hn & lpos) != 0)
score -= 1
end
hnx = (hn << 1) | hn0
hpx = (hp << 1) | hp0
hna[i] = (hn >> (w - 1)).to_i32
hpa[i] = (hp >> (w - 1)).to_i32
straight-shoota marked this conversation as resolved.
Show resolved Hide resolved
nc = (r == 0) ? 1 : 0
vp = hnx | ~(d0 | hpx | nc)
vn = d0 & (hpx | nc)
end

# Clear char bit vector
straight-shoota marked this conversation as resolved.
Show resolved Hide resolved
case pmr
when StaticArray
pmr.map! { 0.to_u32 }
darkstego marked this conversation as resolved.
Show resolved Hide resolved
when Hash
pmr.clear
end
end
score
end
end