-
Notifications
You must be signed in to change notification settings - Fork 5
HashSplit4J/hashsplit-lib
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
This is a library for splitting files into chunks which are fairly stable with respect to file changes, which allows those chunks to be used as the basis for efficient file syncronisation. Latest build: 2.4.1 http://milton.io/maven/hashsplit4j/hashsplit4j-lib/2.4.1/hashsplit4j-lib-2.4.1.jar Maven: The hashsplit4j jar is published to the milton.io maven repo. To use it add the dependency as normal, and add the milton repo.. <dependencies> ... <dependency> <groupId>hashsplit4j</groupId> <artifactId>hashsplit4j-lib</artifactId> <version>2.4.1</version> </dependency> </dependencies> <repositories> <repository> <id>milton-repo</id> <url>http://milton.io/maven</url> </repository> </repositories> Efficiency: By "efficient" I mean: 1. Efficient with respect to network bandwidth because only changed data needs to be syncronized 2. Efficient with respect to storage, since file versions can be persisted with only changes across versions being stored 3. Efficient with respect to CPU usage, because files can be easily reconstituted This library is inspired by BUP - https://github.com/apenwarr/bup/blob/master/DESIGN One major design choice, different to BUP, is that instead of implementing a tree of fanouts (as per the BUP design documentation) I've opted for a single level of chunk grouping. I think this makes it much simpler to parse a file, and also much simpler to use the result of the parse. I also think that a single level of grouping will be sufficient for minimising netwok traffic, provided we are more selective then BUP with grouping... BUP looks for boundaries by matching the lowest 13 bits of the rolling checksum, and then groups those chunks when the lowest 17bits are set. This means that the grouping is on average every 15 chunks (4bits=15), which seems to be too granular. I think we would probably want to group sets of 256, ie matching 21 bits. Given that the chunks information ends up 0.25% of the total file size, this means that the chunk groups will be 1/256 * 0.25% = 0.001% of total file size, about 10k for a 1Gb file. So I don't think we need to be any more granular then that. The main parsing class is Parser, which you use like this: InputStream in = Scratch.class.getResourceAsStream(fname); //InputStream in = Scratch.class.getResourceAsStream("test1.txt"); MemoryHashStore hashStore = new MemoryHashStore(); MemoryBlobStore blobStore = new MemoryBlobStore(); Parser parser = new Parser(); List<Long> megaCrcs = parser.parse(in, blobStore, hashStore); Note that there are 2 abstraction layers for storing the parse result - HashStore: for storing the tree of blobs in groups which is the series of chunks or blobs in the file - BlobStore: for storing and retrieving the actual chunks/blobs public interface HashStore { public void onChunk(long crc, int start, int finish, byte[] bytes); public void onFanout(long crc, List<Long> childCrcs); public List<Long> getCrcsFromFanout(Long fanoutCrc); public byte[] getBlob(Long crc); } public interface BlobStore { void setBlob(String hash, byte[] bytes); byte[] getBlob(String hash); boolean hasBlob(String hash); } Files can be reconstituted with the Combiner class: Combiner combiner = new Combiner(); MemoryHashStore hashStore = new MemoryHashStore(); MemoryBlobStore blobStore = new MemoryBlobStore(); ByteArrayOutputStream bout = new ByteArrayOutputStream(); combiner.combine(megaCrcs, blobStore, hashStore, bout); There is now a HttpBlobStore which can store blobs to a HTTP server. There is also a module in the hashsplit4j project which provides such an HTTP server. Work is currently under way (as of 1/1/2014) to implement efficient syncronisation between these http servers (rsync is hell with terabytes of blobs)
About
Implementation of the BUP-like hashsplitting algorithm in hava
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published