Investigate and implement commonSuffixLength binary search #54

zimmski · 2016-11-03T21:38:54Z

There is a TODO marker in the code for a binary search version of commonSuffixLength.

maksimov · 2017-04-09T16:22:41Z

@zimmski, I've tried it in a separate branch here:
https://github.com/maksimov/go-diff/blob/binary-suffix-tests/diffmatchpatch/diff.go#L481
You will note that I had to use reflect.DeepEqual for slice comparison.

There is a benchmark method here:
https://github.com/maksimov/go-diff/blob/binary-suffix-tests/diffmatchpatch/diff_test.go#L1440

Benchmark results are as follows:

BenchmarkCommonSuffixLength/Linear(454)-8         	 1000000	      2230 ns/op
BenchmarkCommonSuffixLength/Binary(454)-8         	10000000	       228 ns/op
...
BenchmarkCommonSuffixLength/Linear(303)-8         	  300000	      5136 ns/op
BenchmarkCommonSuffixLength/Binary(303)-8         	10000000	       228 ns/op
...
BenchmarkCommonSuffixLength/Linear(4023)-8         	  200000	     10828 ns/op
BenchmarkCommonSuffixLength/Binary(4023)-8         	10000000	       227 ns/op

In brackets, I output the suffix length. A few runs demonstrated a slow-down for the linear search as the suffix length increases, and almost no change at all for the binary search, which produces excellent results overall.

Here's how to run:

go test ./... -bench="BenchmarkCommonSuffixLength" .

zimmski · 2017-04-10T08:25:18Z

The benchmark looks amazing! 👍 Have you found out why this was commented out?

Let's submit this as a PR and work in the PR on it. Some initial thoughts:

Use "runesEqual" instead of the "DeepEqual" (its in stringsutil.go). I think this should give this another boost.
Let's find two or three constant strings for the benchmark instead of random strings. Otherwise there is no way to compare them with other versions. Some worst-cases would be good.
It would be interesting to test and see a comparison of the two functions with the rest of the implementation.
I have not looked into this, maybe we could use the same algorithm for the prefix?

josharian · 2017-11-09T00:03:02Z

Sorry to be the bearer of bad news, but the benchmark is seriously flawed. Due to the way the rune slices are constructed:

https://github.com/maksimov/go-diff/blob/f425388e4c0a50f84a09b809bab2d1e9f2d2f18d/diffmatchpatch/diff_test.go#L1442-L1445

the runes slices are identical, not just in their contents but also their backing array. (Observe that in s2 := append(s1[n:], 't'), s2 is set up to alias s1, and s1 is also modified in the process.)

So the suffix length is always the length of the slice. And more important, reflect has special O(1) handling of identical slices:

https://github.com/golang/go/blob/master/src/reflect/deepequal.go#L81

You can see this in action by modifying the Intn parameter and observing that the binary search benchmark doesn't change at all, even if size changes by orders of magnitudes. But note that in basically every real world case, the rune slices won't alias each other (particularly given the exported API).

If you use a more realistic benchmark setup, like:

	sz := 10000
	s1 := randSeq(sz)
	s2 := make([]rune, sz+1)
	n := rand.Intn(len(s1))
	copy(s2, s1[:n])
	s2[n] = '\t'
	copy(s2[n+1:], s1[n:])

then you get results like this:

BenchmarkCommonSuffixLength/Binary(9104,_10000)-8         	    5000	     78562 ns/op	    8184 B/op	    1836 allocs/op
BenchmarkCommonSuffixLength/Binary(9104,_10000)-8         	    5000	     80758 ns/op	    8184 B/op	    1836 allocs/op
BenchmarkCommonSuffixLength/Binary(9104,_10000)-8         	    5000	     79645 ns/op	    8184 B/op	    1836 allocs/op
BenchmarkCommonSuffixLength/Linear(9104,_10000)-8         	  300000	      1074 ns/op	       0 B/op	       0 allocs/op
BenchmarkCommonSuffixLength/Linear(9104,_10000)-8         	  300000	      1033 ns/op	       0 B/op	       0 allocs/op
BenchmarkCommonSuffixLength/Linear(9104,_10000)-8         	  300000	      1038 ns/op	       0 B/op	       0 allocs/op

This is more like what I'd expect.

Note that in the blog post, the author writes:

In the year since this was first published I've been taking a lot of heat from programmers who look at the algorithm and realize what the substring operation is doing at the processor level. They write back, call me up, or burst our laughing when they realize that my 'optimization' actually evaluates to O(n log n). I patiently explain that high-level languages don't work that way, ...

But Go is not a high-level language in the relevant sense. It doesn't always intern strings, and while there are optimized vectorized routines for string comparison, in the general case, I very strongly suspect they're not going to be enough to overcome the O(n log n) factor.

So--sadly--I think this is a non-starter.

However, I do have a simple patch that squeezes a couple of percent of performance out of commonPrefixLength. I'd be happy to send it (and more) as a PR, but it doesn't appear that this project is being actively maintained (?).

But the biggest bottleneck in both commonPrefixLength and commonSuffixLength is the string to []rune conversion, which can be addressed in two ways: (1) exporting different API in this package to allow the user to circumvent this cost, (2) making the conversion faster in the Go compiler and runtime (which is probably possible, much more time has been spend on string -> []byte conversion).

maksimov · 2017-11-09T12:14:36Z

Thanks @josharian! It's true I haven't been active myself, but I think you still should send your patches (especially if you already have them!)

zimmski added enhancement Help Wanted labels Nov 3, 2016

zimmski mentioned this issue Dec 3, 2016

Bring coverage up to 100% line coverage #55

Closed

sergi closed this as completed in 6cf2633 Jan 25, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate and implement commonSuffixLength binary search #54

Investigate and implement commonSuffixLength binary search #54

zimmski commented Nov 3, 2016

maksimov commented Apr 9, 2017 •

edited

Loading

zimmski commented Apr 10, 2017

josharian commented Nov 9, 2017 •

edited

Loading

maksimov commented Nov 9, 2017

Investigate and implement commonSuffixLength binary search #54

Investigate and implement commonSuffixLength binary search #54

Comments

zimmski commented Nov 3, 2016

maksimov commented Apr 9, 2017 • edited Loading

zimmski commented Apr 10, 2017

josharian commented Nov 9, 2017 • edited Loading

maksimov commented Nov 9, 2017

maksimov commented Apr 9, 2017 •

edited

Loading

josharian commented Nov 9, 2017 •

edited

Loading