Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[contrib] Edit distance match finder #2036

Merged
merged 1 commit into from
Mar 23, 2020
Merged

Conversation

bimbashrestha
Copy link
Contributor

Turned out to not provide enough utility to make up for the additional CPU usage. Including this in contrib with some light docs in case this becomes interesting again.

The one place where this outperforms zstd-19 for compression ratio consistently is when compressing a small file (<10KB) with a small and similar in content dictionary (a use case which doesn't really exist as far as I can tell). There are very few overlapping matches in those cases so edit distance basically does as good of a job as can be done.

For large files, this is 5-10X slower than zstd-19 and 2-3X worse in terms of compression ratio. I Think most of the loss is in being unable to find overlapping matches.

References:

An O(ND) Difference Algorithm and Its Variations:
http://www.xmailserver.org/diff2.pdf

An Algorithm for Differential File Comparison:
https://www.cs.dartmouth.edu/~doug/diff.pdf

A File Comparison Program:
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.189.70&rep=rep1&type=pdf

More information and references on diff algorithms on this page:
https://wiki.c2.com/?DiffAlgorithm

@bimbashrestha
Copy link
Contributor Author

Ping. Should we merge this or leave it out? Both seems reasonable to me. Might be useful if someone ever wants to experiment with edit distance based match finding in the future.

@Cyan4973
Copy link
Contributor

I think it's fine to merge it.
We could still remove it in some future if it feels superfluous, but for the time being, the learning seems interesting to archive considering our activity on this topic.

@bimbashrestha bimbashrestha merged commit 40414f9 into facebook:dev Mar 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants