Skip to content

JuliaIO/BSDiff.jl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

73 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BSDiff

Build Status Codecov

The BSDiff package is a pure Julia implementation of the bsdiff tool for computing and applying binary diffs of files. It supports reading and writing both Colin Percival's classic bsdiff format and Matthew Endsley's modified format. The package offers the same API as the command-line tools:

bsdiff(old, new, [ patch ])
bspatch(old, [ new, ] patch)

The bsdiff command computes a patch file given old and new files. By default it generates patch files in the classic bsdiff format. This format emits control data, diff data and new data in three separately compressed sections and is typically more compact than the Endsley format. The Endsley format interleaves control, diff and new data in a single compressed section, which means that it can be can be written and applied in a fully streamed fashion, but the patch files tend to be slightly larger. The format can be selected by passing the format = :classic or format = :endsley option.

The bspatch command applies a patch file to an old file to produce a new file. It can auto-detect the patch file format from the magic string in the patch header, so it is generally not necessary to specifiy the format. If you only want to apply a specific format of patch, you can pass the same format option and bspatch will error unless the patch has the expected format.

API

The public API for the BSDiff package consists of the following functions:

bsdiff

bsdiff(old, new, [ patch ]; format = [ :classic | :endsley ]) -> patch

Compute a binary patch that will transform the content of old into the content of new. All arguments can be strings or IO handles. If no patch argument is provided, the patch data is written to a temporary file whose path is returned.

The old argument can also be a 2-tuple of strings and/or IO handles, in which case the first is used as the old data and the second is used as a precomputed index of the old data, as computed by bsindex. Since indexing the old data is the slowest part of generating a diff, precomputing this and reusing it can significantly speed up generting diffs from the same old file to multiple different new files.

The format keyword argument allows selecting a patch format to generate. The value must be one of the symbols :classic or :endsley indicating a bsdiff patch format. The classic patch format is generated by default, but the Endsley format can be selected with bsdiff(old, new, patch, format = :endsley).

bspatch

bspatch(old, [ new, ] patch; format = [ :classic | :endsley ]) -> new

Apply a binary patch given by the patch argument to the content of old to produce the content of new. All arguments can be strings or IO handles. If no new argument is provided, the new data is written to a temporary file whose path is returned.

Note that the optional argument is the middle argument, which is a bit unusual but makes the argument order when passing all three paths consistent with the bspatch command and with the bsdiff function.

By default bspatch auto-detects the patch format, so the format keyword argument is usually unnecessary. If you wish to restrict the format of patch that will be accepted, however, you can use this keyword argument: bspatch will raise an error unless the patch file has indicated format.

bsindex

bsindex(old, [ index ]) -> index

Save index data (a sorted suffix array) for the content of old into index. All arguments can be strings or IO handles. If no index argument is provided, the index data is saved to a temporary file whose path is returned.

The index can be passed to bsdiff to speed up the diff computation by passing (old, index) as the first argument instead of just old. Since indexing the old data is the slowest part of generating a diff, precomputing this and reusing it can significantly speed up generting diffs from the same old file to multiple different new files.

Usage Example

julia> cd(mktempdir())

julia> open("goodbye.txt", write=true) do io
           println(io, "Goodbye, world.")
       end

julia> open("hello.txt", write=true) do io
           println(io, "Hello, world!")
       end

julia> using BSDiff

julia> patch = bsdiff("goodbye.txt", "hello.txt");

julia> bspatch("goodbye.txt", "hello_copy.txt", patch)
"hello_copy.txt"

julia> read(ans, String)
"Hello, world!\n"

Reproducibility

Even though this package produces patch files that are compatible with the classic and Endsley bsdiff tools, the patch files it generates may not be identical for a few reasons:

  1. The bzip2 compression used by the package and by the commands may have different settings and produce different results—in general compression libraries like bzip2 don't guarantee perfect reproducibility.

  2. The uncompressed patch produced by this package is sometimes better than the one produced by the command line tool due to a bug in the way the command uses memcmp to do string comparison. See this pull request for details.

The exact output produced by this library will also not necessarily remain identical in the future—there are many valid patches for the same old and new data. Improvements to the speed and quality of the patch generation algorithm may lead to different outputs in the future. However, the patch format is simple and stable: it is guaranteed that newer versions of the package will be able to apply patches produced by older versions and vice versa.