-
Notifications
You must be signed in to change notification settings - Fork 215
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix line diff by using rune index without a separator #136
Conversation
Current implementation produces wrong result because it calls `DiffMain` on the following 2 arguments: * `1,2,3,4,5,6,7,8,9,10` * `1,2,3,4,5,6,7,8,9,11` This numbers represent indices into the lines array. The algorithm finds that equal part of those strings is `1,2,3,4,5,6,7,8,9,1` and which is followed by `Delete 0` and `Insert `1`.
a9d3bea
to
dfb3555
Compare
@@ -8,4 +8,4 @@ require ( | |||
gopkg.in/yaml.v2 v2.4.0 // indirect | |||
) | |||
|
|||
go 1.12 | |||
go 1.13 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Upgrading to go 1.13 in order to use binary and hex integer literals.
[The suggested approach](https://github.com/google/diff-match-patch/wiki/Line-or-Word-Diffs#line-mode ) for doing line level diffing is the following set of steps: 1. `ti1, ti2, linesIdx = DiffLinesToChars(t1, t2)` 2. `diffs = DiffMain(ti1, ti2)` 3. `DiffCharsToLines(diff, linesIdx)` The original implementation in `google/diff-match-patch` uses unicode codepoints for storing indices in `ti1` and `ti2` joined by an empty string. Current implementation in this repo stores them as integers joined by a comma. While this implementation makes `ti1` and `ti2` more readable, it introduces bugs when trying to rely on it when doing line level diffing with `DiffMain`. The root cause of the issue is that an integer line index might span more than one character/rune, and `DiffMain` can assume that two different lines having the same index prefix match partially. For example, indices 123 and 129 will have partial match `12`. In that example, the diff will show lines 3 and 9 which is not correct. A simple failing test case demonstrating this issue is available at `TestDiffPartialLineIndex`. In this PR I am adjusting the algorithm to use the same approach as in [diff-match-patch](https://github.com/google/diff-match-patch/blob/62f2e689f498f9c92dbc588c58750addec9b1654/javascript/diff_match_patch_uncompressed.js#L508-L510 ) by storing each line index as a rune. While a rune in Golang is a type alias to uint32, not every uint32 can be a valid rune. During string to rune slice conversion invalid runes will be replaced with `utf.RuneError`. The integer to rune generation logic is based on the table in https://en.wikipedia.org/wiki/UTF-8#Encoding The first 127 lines will work the fastest as they are represented as a single bytes. Higher numbers are represented as 2-4 bytes. In addition to that, the range `U+D800 - U+DFFF` contains [invalid codepoints](https://en.wikipedia.org/wiki/UTF-8#Invalid_sequences_and_error_handling). and all codepoints higher or equal to `0xD800` are incremented by `0xDFFF - 0xD800`. The maximum representable integer using this approach is 1'112'060. This improves on Javascript implementation which currently [bails out](https://github.com/google/diff-match-patch/blob/62f2e689f498f9c92dbc588c58750addec9b1654/javascript/diff_match_patch_uncompressed.js#L503-L505 ) when files have more than 65535 lines.
Here is the benchmark comparison with the original code and the new code from this PR:
|
I did not realize that there is an existing PR #120 which implements somewhat similar solution. My PR implements a solution closer to the Javascript version |
LGTM |
@sergi could we publish a tag with this in it? |
Re-ping for @sergi |
- removed exclude for github.com/sergi/go-diff as it should have been fixed by sergi/go-diff#136
- removed exclude for github.com/sergi/go-diff as it should have been fixed by sergi/go-diff#136
- removed exclude for github.com/sergi/go-diff as it should have been fixed by sergi/go-diff#136
- removed exclude for github.com/sergi/go-diff as it should have been fixed by sergi/go-diff#136
- removed exclude for github.com/sergi/go-diff as it should have been fixed by sergi/go-diff#136
- removed exclude for github.com/sergi/go-diff as it should have been fixed by sergi/go-diff#136
- removed exclude for github.com/sergi/go-diff as it should have been fixed by sergi/go-diff#136
- removed exclude for github.com/sergi/go-diff as it should have been fixed by sergi/go-diff#136
- removed exclude for github.com/sergi/go-diff as it should have been fixed by sergi/go-diff#136
- removed exclude for github.com/sergi/go-diff as it should have been fixed by sergi/go-diff#136
- removed exclude for github.com/sergi/go-diff as it should have been fixed by sergi/go-diff#136
- removed exclude for github.com/sergi/go-diff as it should have been fixed by sergi/go-diff#136
- removed exclude for github.com/sergi/go-diff as it should have been fixed by sergi/go-diff#136
- removed exclude for github.com/sergi/go-diff as it should have been fixed by sergi/go-diff#136
- removed exclude for github.com/sergi/go-diff as it should have been fixed by sergi/go-diff#136
- removed exclude for github.com/sergi/go-diff as it should have been fixed by sergi/go-diff#136
- removed exclude for github.com/sergi/go-diff as it should have been fixed by sergi/go-diff#136
- removed exclude for github.com/sergi/go-diff as it should have been fixed by sergi/go-diff#136
- removed exclude for github.com/sergi/go-diff as it should have been fixed by sergi/go-diff#136
- removed exclude for github.com/sergi/go-diff as it should have been fixed by sergi/go-diff#136
The suggested approach for doing line level diffing is the following set of steps:
ti1, ti2, linesIdx = DiffLinesToChars(t1, t2)
diffs = DiffMain(ti1, ti2)
DiffCharsToLines(diff, linesIdx)
The original implementation in
google/diff-match-patch
usesunicode codepoints for storing indices in
ti1
andti2
joined by an empty string.Current implementation in this repo stores them as integers joined by a
comma. While this implementation makes
ti1
andti2
more readable, itintroduces bugs when trying to rely on it when doing line level diffing
with
DiffMain
. The root cause of the issue is that an integer lineindex might span more than one character/rune, and
DiffMain
can assumethat two different lines having the same index prefix match partially. For
example, indices 123 and 129 will have partial match
12
. In thatexample, the diff will show lines 3 and 9 which is not correct. A simple
failing test case demonstrating this issue is available at
TestDiffPartialLineIndex
.In this PR I am adjusting the algorithm to use the same approach as in
diff-match-patch by storing each line index as a rune.
While a rune in Golang is a type alias to uint32, not every uint32
can be a valid rune. During string to rune slice conversion invalid runes will
be replaced with
utf.RuneError
.The integer to rune generation logic is based on the table in https://en.wikipedia.org/wiki/UTF-8#Encoding
The first 127 lines will work the fastest as they are represented as a
single bytes. Higher numbers are represented as 2-4 bytes.
In addition to that, the range
U+D800 - U+DFFF
containsinvalid codepoints.
and all codepoints higher or equal to
0xD800
are incremented by0xDFFF - 0xD800
.The maximum representable integer using this approach is 1'112'060.
This improves on Javascript implementation which currently
bails out when files have more than 65535 lines.