-
Notifications
You must be signed in to change notification settings - Fork 215
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate and implement commonSuffixLength binary search #54
Comments
@zimmski, I've tried it in a separate branch here: There is a benchmark method here: Benchmark results are as follows:
In brackets, I output the suffix length. A few runs demonstrated a slow-down for the linear search as the suffix length increases, and almost no change at all for the binary search, which produces excellent results overall. Here's how to run:
|
The benchmark looks amazing! 👍 Have you found out why this was commented out? Let's submit this as a PR and work in the PR on it. Some initial thoughts:
|
Sorry to be the bearer of bad news, but the benchmark is seriously flawed. Due to the way the rune slices are constructed: the runes slices are identical, not just in their contents but also their backing array. (Observe that in So the suffix length is always the length of the slice. And more important, reflect has special O(1) handling of identical slices: https://github.com/golang/go/blob/master/src/reflect/deepequal.go#L81 You can see this in action by modifying the If you use a more realistic benchmark setup, like: sz := 10000
s1 := randSeq(sz)
s2 := make([]rune, sz+1)
n := rand.Intn(len(s1))
copy(s2, s1[:n])
s2[n] = '\t'
copy(s2[n+1:], s1[n:]) then you get results like this:
This is more like what I'd expect. Note that in the blog post, the author writes:
But Go is not a high-level language in the relevant sense. It doesn't always intern strings, and while there are optimized vectorized routines for string comparison, in the general case, I very strongly suspect they're not going to be enough to overcome the O(n log n) factor. So--sadly--I think this is a non-starter. However, I do have a simple patch that squeezes a couple of percent of performance out of commonPrefixLength. I'd be happy to send it (and more) as a PR, but it doesn't appear that this project is being actively maintained (?). But the biggest bottleneck in both commonPrefixLength and commonSuffixLength is the string to []rune conversion, which can be addressed in two ways: (1) exporting different API in this package to allow the user to circumvent this cost, (2) making the conversion faster in the Go compiler and runtime (which is probably possible, much more time has been spend on string -> []byte conversion). |
Thanks @josharian! It's true I haven't been active myself, but I think you still should send your patches (especially if you already have them!) |
There is a TODO marker in the code for a binary search version of commonSuffixLength.
The text was updated successfully, but these errors were encountered: