Fix line diff by using rune index without a separator #136

kdarkhan · 2023-01-22T02:27:47Z

The suggested approach for doing line level diffing is the following set of steps:

ti1, ti2, linesIdx = DiffLinesToChars(t1, t2)
diffs = DiffMain(ti1, ti2)
DiffCharsToLines(diff, linesIdx)

The original implementation in google/diff-match-patch uses
unicode codepoints for storing indices in ti1 and ti2 joined by an empty string.
Current implementation in this repo stores them as integers joined by a
comma. While this implementation makes ti1 and ti2 more readable, it
introduces bugs when trying to rely on it when doing line level diffing
with DiffMain. The root cause of the issue is that an integer line
index might span more than one character/rune, and DiffMain can assume
that two different lines having the same index prefix match partially. For
example, indices 123 and 129 will have partial match 12. In that
example, the diff will show lines 3 and 9 which is not correct. A simple
failing test case demonstrating this issue is available at
TestDiffPartialLineIndex.

In this PR I am adjusting the algorithm to use the same approach as in
diff-match-patch by storing each line index as a rune.
While a rune in Golang is a type alias to uint32, not every uint32
can be a valid rune. During string to rune slice conversion invalid runes will
be replaced with utf.RuneError.

The integer to rune generation logic is based on the table in https://en.wikipedia.org/wiki/UTF-8#Encoding

The first 127 lines will work the fastest as they are represented as a
single bytes. Higher numbers are represented as 2-4 bytes.

In addition to that, the range U+D800 - U+DFFF contains
invalid codepoints.
and all codepoints higher or equal to 0xD800 are incremented by
0xDFFF - 0xD800.

The maximum representable integer using this approach is 1'112'060.
This improves on Javascript implementation which currently
bails out when files have more than 65535 lines.

Current implementation produces wrong result because it calls `DiffMain` on the following 2 arguments: * `1,2,3,4,5,6,7,8,9,10` * `1,2,3,4,5,6,7,8,9,11` This numbers represent indices into the lines array. The algorithm finds that equal part of those strings is `1,2,3,4,5,6,7,8,9,1` and which is followed by `Delete 0` and `Insert `1`.

diffmatchpatch/diff_test.go

kdarkhan · 2023-01-23T23:29:35Z

go.mod

@@ -8,4 +8,4 @@ require (
 	gopkg.in/yaml.v2 v2.4.0 // indirect
 )

-go 1.12
+go 1.13


Upgrading to go 1.13 in order to use binary and hex integer literals.

[The suggested approach](https://github.com/google/diff-match-patch/wiki/Line-or-Word-Diffs#line-mode ) for doing line level diffing is the following set of steps: 1. `ti1, ti2, linesIdx = DiffLinesToChars(t1, t2)` 2. `diffs = DiffMain(ti1, ti2)` 3. `DiffCharsToLines(diff, linesIdx)` The original implementation in `google/diff-match-patch` uses unicode codepoints for storing indices in `ti1` and `ti2` joined by an empty string. Current implementation in this repo stores them as integers joined by a comma. While this implementation makes `ti1` and `ti2` more readable, it introduces bugs when trying to rely on it when doing line level diffing with `DiffMain`. The root cause of the issue is that an integer line index might span more than one character/rune, and `DiffMain` can assume that two different lines having the same index prefix match partially. For example, indices 123 and 129 will have partial match `12`. In that example, the diff will show lines 3 and 9 which is not correct. A simple failing test case demonstrating this issue is available at `TestDiffPartialLineIndex`. In this PR I am adjusting the algorithm to use the same approach as in [diff-match-patch](https://github.com/google/diff-match-patch/blob/62f2e689f498f9c92dbc588c58750addec9b1654/javascript/diff_match_patch_uncompressed.js#L508-L510 ) by storing each line index as a rune. While a rune in Golang is a type alias to uint32, not every uint32 can be a valid rune. During string to rune slice conversion invalid runes will be replaced with `utf.RuneError`. The integer to rune generation logic is based on the table in https://en.wikipedia.org/wiki/UTF-8#Encoding The first 127 lines will work the fastest as they are represented as a single bytes. Higher numbers are represented as 2-4 bytes. In addition to that, the range `U+D800 - U+DFFF` contains [invalid codepoints](https://en.wikipedia.org/wiki/UTF-8#Invalid_sequences_and_error_handling). and all codepoints higher or equal to `0xD800` are incremented by `0xDFFF - 0xD800`. The maximum representable integer using this approach is 1'112'060. This improves on Javascript implementation which currently [bails out](https://github.com/google/diff-match-patch/blob/62f2e689f498f9c92dbc588c58750addec9b1654/javascript/diff_match_patch_uncompressed.js#L503-L505 ) when files have more than 65535 lines.

kdarkhan · 2023-02-20T23:35:24Z

Here is the benchmark comparison with the original code and the new code from this PR:

## Original code
BenchmarkDiffCommonPrefix-16                     6884680               153.6 ns/op
BenchmarkDiffCommonSuffix-16                     7614050               152.6 ns/op
BenchmarkCommonLength/prefix/empty-16           1000000000               0.7482 ns/op
BenchmarkCommonLength/prefix/short-16           468785274                2.285 ns/op
BenchmarkCommonLength/prefix/long-16             2215281               522.5 ns/op
BenchmarkCommonLength/suffix/empty-16           935044214                1.257 ns/op
BenchmarkCommonLength/suffix/short-16           390328588                3.023 ns/op
BenchmarkCommonLength/suffix/long-16             1622370               763.2 ns/op
BenchmarkDiffHalfMatch-16                          11480            107185 ns/op
BenchmarkDiffCleanupSemantic-16                      571           2186640 ns/op
BenchmarkDiffMain-16                                   1        1166282802 ns/op
BenchmarkDiffMainLarge-16                             16          94201231 ns/op
BenchmarkDiffMainRunesLargeLines-16                  394           2833419 ns/op
BenchmarkDiffMainRunesLargeDiffLines-16               24          52753756 ns/op

## Code in this PR
BenchmarkDiffCommonPrefix-16                     7782129               151.3 ns/op
BenchmarkDiffCommonSuffix-16                     7802674               155.7 ns/op
BenchmarkCommonLength/prefix/empty-16           1000000000               0.7727 ns/op
BenchmarkCommonLength/prefix/short-16           461628601                2.292 ns/op
BenchmarkCommonLength/prefix/long-16             2258307               514.1 ns/op
BenchmarkCommonLength/suffix/empty-16           939004714                1.222 ns/op
BenchmarkCommonLength/suffix/short-16           353210576                3.125 ns/op
BenchmarkCommonLength/suffix/long-16             1621677               765.0 ns/op
BenchmarkDiffHalfMatch-16                          10000            109821 ns/op
BenchmarkDiffCleanupSemantic-16                      658           1978913 ns/op
BenchmarkDiffMain-16                                   1        1010640421 ns/op
BenchmarkDiffMainLarge-16                             10         139771776 ns/op
BenchmarkDiffMainRunesLargeLines-16                 1632            722348 ns/op
BenchmarkDiffMainRunesLargeDiffLines-16               27          43902502 ns/op

kdarkhan · 2023-02-20T23:56:15Z

I did not realize that there is an existing PR #120 which implements somewhat similar solution.

My PR implements a solution closer to the Javascript version google/diff-match-patch.

sergi · 2023-08-02T21:04:16Z

LGTM

ffluk3 · 2023-08-24T21:05:35Z

@sergi could we publish a tag with this in it?

IOrlandoni · 2023-12-06T21:42:30Z

Re-ping for @sergi

- removed exclude for github.com/sergi/go-diff as it should have been fixed by sergi/go-diff#136

kdarkhan force-pushed the master branch from 0603518 to f73eabe Compare January 22, 2023 02:38

kdarkhan commented Jan 22, 2023

View reviewed changes

diffmatchpatch/diff_test.go Outdated Show resolved Hide resolved

kdarkhan force-pushed the master branch 2 times, most recently from a9d3bea to dfb3555 Compare January 23, 2023 23:27

kdarkhan commented Jan 23, 2023

View reviewed changes

kdarkhan force-pushed the master branch from dfb3555 to 289740a Compare January 23, 2023 23:35

kdarkhan force-pushed the master branch from 289740a to a674b30 Compare January 23, 2023 23:37

This was referenced Jan 23, 2023

Diff output is incorrect for some files twpayne/chezmoi#2706

Closed

[BUG] v1.2.0 seems to produce incorrect diff #123

Open

aymanbagabas mentioned this pull request Jan 30, 2023

DONTMERGE: feat(deps): bump github.com/sergi/go-diff from 1.1.0 to 1.3.1 charmbracelet/soft-serve#209

Closed

sergi merged commit 5b0b94c into sergi:master Aug 2, 2023

kdarkhan mentioned this pull request Aug 3, 2023

Fix the regressions introduced in the fix for #89 #120

Closed

ffluk3 mentioned this pull request Aug 24, 2023

Extinctions scan fails for large file diffs launchdarkly/ld-find-code-refs#351

Closed

jazanne mentioned this pull request Aug 24, 2023

fix: update go-diff package launchdarkly/ld-find-code-refs#386

Closed

Arakos mentioned this pull request Dec 28, 2023

Improved multiline string handling homeport/dyff#333

Merged

jajimajp mentioned this pull request May 22, 2024

Please tag a new version #144

Open

atiratree added a commit to atiratree/api that referenced this pull request Aug 2, 2024

[WIP] bump(k8s): tools: update k8s.io/* dependencies to v1.31.0

5760fba

- removed exclude for github.com/sergi/go-diff as it should have been fixed by sergi/go-diff#136

atiratree added a commit to atiratree/api that referenced this pull request Aug 5, 2024

[WIP] bump(k8s): tools: update k8s.io/* dependencies to v1.31.0

1e56438

- removed exclude for github.com/sergi/go-diff as it should have been fixed by sergi/go-diff#136

atiratree added a commit to atiratree/api that referenced this pull request Aug 6, 2024

[WIP] bump(k8s): tools: update k8s.io/* dependencies to v1.31.0

b00736e

- removed exclude for github.com/sergi/go-diff as it should have been fixed by sergi/go-diff#136

atiratree added a commit to atiratree/api that referenced this pull request Aug 6, 2024

[WIP] bump(k8s): tools: update k8s.io/* dependencies to v1.31.0

6ddf770

- removed exclude for github.com/sergi/go-diff as it should have been fixed by sergi/go-diff#136

atiratree added a commit to atiratree/api that referenced this pull request Aug 6, 2024

bump(k8s): tools: update k8s.io/* dependencies to v1.31.0

2ffbad2

- removed exclude for github.com/sergi/go-diff as it should have been fixed by sergi/go-diff#136

atiratree added a commit to atiratree/api that referenced this pull request Aug 6, 2024

bump(k8s): tools: update k8s.io/* dependencies to v1.31.0

8d0f3c9

- removed exclude for github.com/sergi/go-diff as it should have been fixed by sergi/go-diff#136

atiratree added a commit to atiratree/api that referenced this pull request Aug 15, 2024

bump(k8s): tools: update k8s.io/* dependencies to v1.31.0

de9a36b

- removed exclude for github.com/sergi/go-diff as it should have been fixed by sergi/go-diff#136

atiratree added a commit to atiratree/api that referenced this pull request Aug 20, 2024

bump(k8s): tools: update k8s.io/* dependencies to v1.31.0

9430005

- removed exclude for github.com/sergi/go-diff as it should have been fixed by sergi/go-diff#136

bertinatto pushed a commit to bertinatto/api that referenced this pull request Aug 22, 2024

bump(k8s): tools: update k8s.io/* dependencies to v1.31.0

52ba531

- removed exclude for github.com/sergi/go-diff as it should have been fixed by sergi/go-diff#136

bertinatto pushed a commit to bertinatto/api that referenced this pull request Aug 22, 2024

bump(k8s): tools: update k8s.io/* dependencies to v1.31.0

064fae7

- removed exclude for github.com/sergi/go-diff as it should have been fixed by sergi/go-diff#136

bertinatto pushed a commit to bertinatto/api that referenced this pull request Aug 23, 2024

bump(k8s): tools: update k8s.io/* dependencies to v1.31.0

f7c58ca

- removed exclude for github.com/sergi/go-diff as it should have been fixed by sergi/go-diff#136

atiratree added a commit to atiratree/api that referenced this pull request Aug 30, 2024

bump(k8s): tools: update k8s.io/* dependencies to v1.31.0

0cc33a6

- removed exclude for github.com/sergi/go-diff as it should have been fixed by sergi/go-diff#136

atiratree added a commit to atiratree/api that referenced this pull request Sep 11, 2024

bump(k8s): tools: update k8s.io/* dependencies to v1.31.0

c6131fd

- removed exclude for github.com/sergi/go-diff as it should have been fixed by sergi/go-diff#136

atiratree added a commit to atiratree/api that referenced this pull request Sep 20, 2024

bump(k8s): tools: update k8s.io/* dependencies to v1.31.1

56058d0

- removed exclude for github.com/sergi/go-diff as it should have been fixed by sergi/go-diff#136

atiratree added a commit to atiratree/api that referenced this pull request Sep 20, 2024

bump(k8s): tools: update k8s.io/* dependencies to v1.31.1

7fdc7c9

- removed exclude for github.com/sergi/go-diff as it should have been fixed by sergi/go-diff#136

atiratree added a commit to atiratree/api that referenced this pull request Sep 24, 2024

bump(k8s): tools: update k8s.io/* dependencies to v1.31.1

0e0a351

- removed exclude for github.com/sergi/go-diff as it should have been fixed by sergi/go-diff#136

atiratree added a commit to atiratree/api that referenced this pull request Sep 25, 2024

bump(k8s): tools: update k8s.io/* dependencies to v1.31.1

9d5eaf8

- removed exclude for github.com/sergi/go-diff as it should have been fixed by sergi/go-diff#136

atiratree added a commit to atiratree/api that referenced this pull request Sep 26, 2024

bump(k8s): tools: update k8s.io/* dependencies to v1.31.1

86ccfa2

- removed exclude for github.com/sergi/go-diff as it should have been fixed by sergi/go-diff#136

atiratree added a commit to atiratree/api that referenced this pull request Sep 27, 2024

bump(k8s): tools: update k8s.io/* dependencies to v1.31.1

b08e640

- removed exclude for github.com/sergi/go-diff as it should have been fixed by sergi/go-diff#136

atiratree added a commit to atiratree/api that referenced this pull request Oct 1, 2024

bump(k8s): tools: update k8s.io/* dependencies to v1.31.1

4d42279

- removed exclude for github.com/sergi/go-diff as it should have been fixed by sergi/go-diff#136

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix line diff by using rune index without a separator #136

Fix line diff by using rune index without a separator #136

kdarkhan commented Jan 22, 2023 •

edited

Loading

kdarkhan Jan 23, 2023 •

edited

Loading

kdarkhan commented Feb 20, 2023

kdarkhan commented Feb 20, 2023

sergi commented Aug 2, 2023

ffluk3 commented Aug 24, 2023

IOrlandoni commented Dec 6, 2023

Fix line diff by using rune index without a separator #136

Fix line diff by using rune index without a separator #136

Conversation

kdarkhan commented Jan 22, 2023 • edited Loading

kdarkhan Jan 23, 2023 • edited Loading

Choose a reason for hiding this comment

kdarkhan commented Feb 20, 2023

kdarkhan commented Feb 20, 2023

sergi commented Aug 2, 2023

ffluk3 commented Aug 24, 2023

IOrlandoni commented Dec 6, 2023

kdarkhan commented Jan 22, 2023 •

edited

Loading

kdarkhan Jan 23, 2023 •

edited

Loading