Bioinformatics_Project

Compression Algorithms based on BWT and Wavelet Tree for Large K-mers Storage Problem

Calculating k-mers, a substring of length k in DNA sequence data, is an important part of many methods in bioinformatics. Although this rule is simple to understand, computing k-mers in large modern sequential data sets can easily exceed the storage capacity of standard computers. When k is large, the amount of storage is too large and thus calculation speed becomes slower because the value is too large.

In our project, we first studied and applied the BWT transform algorithm and Huffman coding to pre-process k-mer data, and then implemented wavelet tree structure and statistically analyzed the distribution of continuous 0 or 1 vectors in wavelet tree nodes. We designed a suitable Hybrid coding storage structure, in which we applied direct coding, run length and Elias Gamma coding methods, and conducted experiments on different single coding methods regarding to compression rate. Furthermore, k-mers compression was realized by adopting the multi-threaded parallel thinking and compression algorithm principle, and also we adjusted parameters of our program through test data to achieve better compression performance.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
bin/main		bin/main
src/main		src/main
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Bioinformatics_Project

Compression Algorithms based on BWT and Wavelet Tree for Large K-mers Storage Problem

About

Releases

Packages

Languages

b1b2ttt/Bioinformatics_Project

Folders and files

Latest commit

History

Repository files navigation

Bioinformatics_Project

Compression Algorithms based on BWT and Wavelet Tree for Large K-mers Storage Problem

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages