INTRODUCTION
The files pinsketch.h, sketch.cpp, differ.cpp, io.cpp and bch.cpp contain an implementation of PinSketch, the BCH-based secure sketch from "Fuzzy Extractors: How to Generate Strong Keys from Biometrics and Other Noisy Data" by Dodis, Ostrovsky, Reyzin and Smith (SIAM Journal on Computing, volume 38, number 1, pages 97-139, 2008; on-line at http://eprint.iacr.org/2003/235 and http://arxiv.org/abs/cs/0602007). The implementation is by Kevin Harmon and Leonid Reyzin ([email protected]). It uses Victor Shoup's NTL (A Library for doing Number Theory), which must be installed before it will compile; see http://www.shoup.net.
Given an input set A of any number of (nonzero) m-bit strings and a parameter t, the program "sketch" will produce as output a sketch of A of size tm bits. Then, if the size of the symmetric difference between sets A and B is at most t, the program "differ" will find the symmetric difference between A and B given only B and the sketch of A (thus, in particular, recovering elements of A that are not in B without seeing A). In fact, "differ" can work given just the sketches of A and B (even if the sketches were computed with different values of t, as long as the number of differences is at most the smaller of the two).
These programs provide an efficient way to find differences between two sets without having to communicate the sets. They also allow for information-reconciliation with minimum information leakage. For more details and applications, see the aforementioned "Fuzzy Extractors" paper.
See also companion implementation of Improved Juels-Sudan (IJS) Secure Sketch, another secure sketch from the same paper, at http://www.cs.bu.edu/~reyzin/code/fuzzy.html. Note that the user interface for the two implementations is slightly different (in particular, because IJS, unlike this implementation, cannot produce symmetric differences from only two sketch files), that PinSketch is more efficient, and that PinSketch, unlike IJS, does not require the two sets of be of the same size.
The mathematical meat of the implementation is in bch.cpp, which is based on "Syndrome Encoding and Decoding of BCH Codes in Sublinear Time" (excerpted from the aforementioned "Fuzzy Extractors" paper). The code in bch.cpp is of greatest general interest; the rest of the code is specific to the formats chosen by this implementation.
WARNINGS AND LIMITATIONS
The programs will not behave correctly if any of the following occurs.
-
There are duplicate elements in A or duplicate elements in B (i.e., if A or B are multisets).
-
A or B contains a string of all zeroes.
-
t is greater than or equal to 2^{m-1}
-
The input or sketch files do not follow the prescribed format.
-
The size of the symmetric difference between the two sets is greater than t (then "differ" will output an error message unless there is a set C that has the same sketch as A and whose symmetric difference with B has no more than t elements, in which case C will be output).
Note that cases 1, 2, 3, and 4 can be taken care of at the input stage (but currently are not). Case 5 is impossible to fix completely, although there are techniques that will reduce the likelihood of an incorrect output at the expense of a larger sketch.
USAGE DETAILS
To compile, make sure you have NTL (A Library for doing Number Theory) installed (see http://www.shoup.net) and issue command
make
If the compilation fails, see the comments in the file called Makefile. Once everything compiles, invoking
./sketch A.set
will produce A.ss (note that file must end in .set). The format of A.set is
t= m= [ ... ]
where is a nonzero m-bit integer written in decimal.
For example,
t=2 m=10 [2 347 532 87 876 39]
(Note that the choice to have inputs be decimal integers was arbitrary, and the program can be easily modified to accept other inputs. Also note that no spaces around the equal signs are allowed; otherwise, spaces and linebreaks can be added anywhere.)
The sketch file A.ss will contain information on the input parameters and generating polynomial of GF(2^m), followed by the tm bits of the sketch in human-readable form as t elements of GF(2^m). Of course, in a system where the input parameters are fixed, only the tm bits are needed.
Invoking
./differ A.ss B.set
will output the symmetric difference between A and B if it has at most t elements. More generally, if a set C exists such that the symmetric difference between B and C is of size at most t and the sketch of C is equal to the sketch in A.ss, then the symmetric difference between B and C will be output (note that such a C is unique if it exists). If such a C does not exist, then an error message will be printed.
Note that A and B are sets, not sequences: the order of elements doesn't matter, and, therefore, the order of elements in the symmetric difference is arbitrary. The format of B.set is the same as of A.set; however, its t and m values will be ignored.
Invoking
./differ A.ss B.ss
(assuming B.ss was computed using sketch B.set) will also output the symmetric difference between A and B (with the same generalization). If A and B were computed with respect to different m (or the same m but different generating polynomials, which should not happen the way sketch is currently coded), an error message will be output. If A.ss was computed with respect to t1 and B.ss was computed with respect to t2, then the difference must have no more than min(t1, t2) elements; else an error message will be output, or possibly an incorrect answer.
Sample files X.set, Y.set X.ss, Y.ss are included with the implementation.