Hadoop Snappy Reader

Small library that provides a reader for reading Hadoop Snappy encoded data. See the Go Package documentation for more information and examples of how to use the reader.

There are not currently plans to implement a writer, as the main utility of this library is to read and use data already produced by the Hadoop ecosystem. However, we are open to extending this library to support a writer or other use cases if there is interest.

Developing

Prerequisites

Install Go

Run Tests

go test ./...

Creating Test Data

Install snzip
- Mac: brew install snzip
- Other: Instructions
Add the uncompressed file to testdata/
Create the compressed file with snzip -t hadoop-snappy -k testfile/{uncompressed file}

Release

Be sure to understand how Go Module publishing works, especially semantic versioning.

To release simply create a new semantically versioned tag and push it.

# Create a new semantic versioned tag with release notes
git tag -a v1.0.0 -m "release notes"

# Push the tag to the remote repository
git push origin v1.0.0

Hadoop-Snappy Stream Encoding Format

The Hadoop format of snappy is similar to regular snappy block encoding, except that instead of compressing into one big block, Hadoop will create a stream of frames where each frame contains blocks that can each be independently decoded. A frame can contain 1 or more blocks and a stream can contain 1 or more frames.

Each FRAME begins with a 4 byte header, which represents the total length of the frame after being DECOMPRESSED (i.e. once we're done decompressing the frame, this is how long the decompressed frame will be). This 4 byte header is a big endian encoded uint32. The header is not included in the total length of the frame.

Each BLOCK in the frame also begins with a 4 byte header that is the COMPRESSED length of the block (i.e. how many bytes we need to read from the stream to get the entire block before we can decompress it). This header is also a big endian encoded uint32. The header is not included in the total length of the block.

The stream structure is as follows

'['   == start of stream
']'   == end of stream
'|'   == component separator (symbolic only as the actual data has no padding or separators)
'...' == abbreviated

[ frame 1 header | block 1 header | block 1 | block 2 header | block 2 | ... | frame 2 header | block 1 header | block 1 | ... ]

The format of each individual snappy block can be found here.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
.github/workflows		.github/workflows
testdata		testdata
.gitignore		.gitignore
LICENSE		LICENSE
NOTICE.md		NOTICE.md
README.md		README.md
example_test.go		example_test.go
go.mod		go.mod
go.sum		go.sum
snappy.go		snappy.go
snappy_test.go		snappy_test.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Hadoop Snappy Reader

Developing

Prerequisites

Run Tests

Creating Test Data

Release

Hadoop-Snappy Stream Encoding Format

About

Releases 1

Packages

Contributors 2

Languages

License

qualtrics/hadoop-snappy

Folders and files

Latest commit

History

Repository files navigation

Hadoop Snappy Reader

Developing

Prerequisites

Run Tests

Creating Test Data

Release

Hadoop-Snappy Stream Encoding Format

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 2

Languages

Packages