Skip to content

Commit

Permalink
Merge pull request #8 from KirillKryukov/develop
Browse files Browse the repository at this point in the history
Version 1.2.0
  • Loading branch information
KirillKryukov authored Sep 1, 2020
2 parents 4e737ce + 57c0235 commit 357c79f
Show file tree
Hide file tree
Showing 78 changed files with 2,322 additions and 139 deletions.
7 changes: 7 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,12 @@
# NAF Changelog

## 1.2.0 - 2020-09-01
- Added `--sequences` option to _unnaf_.
- Added `--binary-stdout` option to _unnaf_.
- Added `--binary-stderr` option to both _ennaf_ and _unnaf_.
- Updated zstd to v1.4.5.
- Improved compatibility with MinGW.

## 1.1.0 - 2019-10-01
- Added support for RNA, protein and text sequences, enabled with `--rna`, `--protein` and `--text` switches.
- Added report for number of unknown characters at the end of compression.
Expand Down
4 changes: 3 additions & 1 deletion Compress.md
Original file line number Diff line number Diff line change
Expand Up @@ -78,6 +78,8 @@ where they would be otherwise not created due to data being small.
**--no-mask** - Don't store sequence mask (lower/upper characters).
Converts the sequences to upper case before compression.

**--binary-stderr** - Set stderr stream to binary mode. Mainly useful for running test suite on Windows.

**-h**, **--help** - Show usage help.

**-V**, **--version** - Show version.
Expand Down Expand Up @@ -195,4 +197,4 @@ you have to switch to text mode (`--text`).

Since both `--dna` and `--text` modes can be used for DNA data, which is better?
Short answer: `--dna` is faster and has stronger compression.
For details, see <a href="http://kirill-kryukov.com/study/naf/benchmark-text-vs-dna-Spur.html">this benchmark page</a>.
For details, see [this benchmark page](http://kirill-kryukov.com/study/naf/benchmark-text-vs-dna-Spur.html).
10 changes: 8 additions & 2 deletions Decompress.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,9 @@ Only one of these options should be specified:

**--fastq** - FASTQ format. Will fail if input has no qualities.

**--seq** - All sequences concatenated into one, without names or line breaks.
**--sequences** - One sequence per line, without names or qualities.

**--seq** - All sequences concatenated into one, without names, qualities, or line breaks.

**--number** - Number of sequences.

Expand Down Expand Up @@ -59,9 +61,13 @@ Only one of these options should be specified:
**--line-length N** - Divide sequences into lines of N bp, ignoring line length stored in the NAF file.
Effective only for `--fasta` output. Line length of 0 means unlimited lines, i.e., each sequence printed in single line.

**--no-mask** - Ignore mask, useful only for `--fasta` and `--seq` outputs.
**--no-mask** - Ignore mask, useful only for `--fasta`, `--sequences` and `--seq` outputs.
Supported only for DNA and RNA sequences.

**--binary-stderr** - Set stderr stream to binary mode. Mainly useful for running test suite on Windows.

**--binary-stdout** - Set stdout stream to binary mode. Useful for piping decompressed sequences to md5sum on Windows.

**-h**, **--help** - Show usage help.

**-V**, **--version** - Show version.
2 changes: 1 addition & 1 deletion LICENSE
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
Copyright (c) 2018-2019 Kirill Kryukov
Copyright (c) 2018-2020 Kirill Kryukov

This software is provided 'as-is', without any express or implied
warranty. In no event will the authors be held liable for any damages
Expand Down
13 changes: 12 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,13 +5,19 @@ It's based on [zstd](http://www.zstd.net/), and features strong compression and
It can store DNA, RNA, protein or text sequences, with or without qualities.
It supports FASTA and FASTQ-formatted sequences, ambiguous IUPAC codes, masked sequence,
and has no limit on sequence length or number of sequences.
It supports Unix pipes which allows easy integration into pipelines.
See [NAF homepage](http://kirill-kryukov.com/study/naf/) for details.

| Example benchmark: SILVA 132 LSURef database (610 MB): |
|---------------------------------------------|
| <img src="http://kirill-kryukov.com/study/naf/images/SILVA-132-LSURef-ratio-vs-cd-speed-lin-log.svg" width="49%"> <img src="http://kirill-kryukov.com/study/naf/images/SILVA-132-LSURef-ratio-vs-d-speed-lin-log.svg" width="49%"> |
| From [Sequence Compression Benchmark](http://kirr.dyndns.org/sequence-compression-benchmark/) project - visit for details and more benchmarks. |

More examples:
* [Compactness on DNA data](http://kirr.dyndns.org/sequence-compression-benchmark/?d=Mitochondrion+%28245+MB%29&amp;d=Influenza+%281.22+GB%29&amp;d=Helicobacter+%282.76+GB%29&amp;doagg=1&amp;agg=average&amp;cs=1&amp;cg=1&amp;com=yes&amp;src=all&amp;nt=4&amp;only-best=1&amp;bn=1&amp;bm=ratio&amp;sm=same&amp;tn=10&amp;bs=100&amp;rr=gzip-9&amp;tm0=name&amp;tm1=size&amp;tm2=ratio&amp;tm3=ctime&amp;tm4=dtime&amp;tm5=cdtime&amp;tm6=tdtime&amp;tm7=empty&amp;gm=same&amp;cyl=lin&amp;ccw=1500&amp;cch=500&amp;sxm=ratio&amp;sxmin=0&amp;sxmax=0&amp;sxl=lin&amp;sym=dspeed&amp;symin=0&amp;symax=0&amp;syl=lin&amp;button=Show+column+chart)
* [Compactness vs decompression speed, on human genome](http://kirr.dyndns.org/sequence-compression-benchmark/?d=Homo+sapiens+GCA_000001405.28+(3.31+GB)&amp;doagg=1&amp;agg=sum&amp;cs=1&amp;cg=1&amp;com=yes&amp;src=all&amp;nt=4&amp;bn=1&amp;bm=tdspeed&amp;sm=same&amp;tn=10&amp;bs=100&amp;rr=gzip-9&amp;tm0=name&amp;tm1=size&amp;tm2=ratio&amp;tm3=ctime&amp;tm4=dtime&amp;tm5=cdtime&amp;tm6=tdtime&amp;tm7=empty&amp;gm=same&amp;cyl=lin&amp;ccw=1500&amp;cch=500&amp;sxm=ratio&amp;sxmin=0&amp;sxmax=0&amp;sxl=lin&amp;sym=dspeed&amp;symin=0&amp;symax=0&amp;syl=lin&amp;button=Show+scatterplot)


## Format specification

NAF specification is in public domain: [NAFv2.pdf](NAFv2.pdf)
Expand Down Expand Up @@ -78,4 +84,9 @@ If you use NAF, please cite:
[Bioinformatics, 35(19), 3826-3828](https://academic.oup.com/bioinformatics/article/35/19/3826/5364265),
doi: [10.1093/bioinformatics/btz144](https://doi.org/10.1093/bioinformatics/btz144).

Previous preprint: bioRxiv 501130; http://biorxiv.org/cgi/content/short/501130v2, doi: [10.1101/501130](https://doi.org/10.1101/501130).
For compressor benchmark, please cite:

* Kirill Kryukov, Mahoko Takahashi Ueda, So Nakagawa, Tadashi Imanishi (2020)
**"Sequence Compression Benchmark (SCB) database — A comprehensive evaluation of reference-free compressors for FASTA-formatted sequences"**
[GigaScience, 9(7), giaa072](https://academic.oup.com/gigascience/article/9/7/giaa072/5867695),
doi: [10.1093/gigascience/giaa072](https://doi.org/10.1093/gigascience/giaa072).
2 changes: 1 addition & 1 deletion ennaf/src/compressor.c
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
/*
* NAF compressor
* Copyright (c) 2018-2019 Kirill Kryukov
* Copyright (c) 2018-2020 Kirill Kryukov
* See README.md and LICENSE files of this repository
*/

Expand Down
2 changes: 1 addition & 1 deletion ennaf/src/encoders.c
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
/*
* NAF compressor
* Copyright (c) 2018-2019 Kirill Kryukov
* Copyright (c) 2018-2020 Kirill Kryukov
* See README.md and LICENSE files of this repository
*/

Expand Down
2 changes: 1 addition & 1 deletion ennaf/src/encoders.h
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
/*
* NAF compressor
* Copyright (c) 2018-2019 Kirill Kryukov
* Copyright (c) 2018-2020 Kirill Kryukov
* See README.md and LICENSE files of this repository
*/

Expand Down
14 changes: 8 additions & 6 deletions ennaf/src/ennaf.c
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
/*
* NAF compressor
* Copyright (c) 2018-2019 Kirill Kryukov
* Copyright (c) 2018-2020 Kirill Kryukov
* See README.md and LICENSE files of this repository
*/

#define VERSION "1.1.0"
#define DATE "2019-10-01"
#define COPYRIGHT_YEARS "2018-2019"
#define VERSION "1.2.0"
#define DATE "2020-09-01"
#define COPYRIGHT_YEARS "2018-2020"

#include "platform.h"
#include "encoders.h"
Expand All @@ -18,6 +18,7 @@
static const unsigned char naf_magic_number[3] = { 0x01u, 0xF9u, 0xECu };

static bool verbose = false;
static bool binary_stderr = false;
static bool keep_temp_files = false;
static bool no_mask = false;

Expand Down Expand Up @@ -351,6 +352,7 @@ static void parse_command_line(int argc, char **argv)
if (!strcmp(argv[i], "--help")) { show_help(); exit(0); }
if (!strcmp(argv[i], "--version")) { print_version = true; continue; }
if (!strcmp(argv[i], "--verbose")) { verbose = true; continue; }
if (!strcmp(argv[i], "--binary-stderr")) { if (!binary_stderr) { binary_stderr = true; change_stderr_to_binary(); } continue; }
if (!strcmp(argv[i], "--keep-temp-files")) { keep_temp_files = true; continue; }
if (!strcmp(argv[i], "--no-mask")) { no_mask = true; continue; }
if (!strcmp(argv[i], "--fasta")) { set_input_format_from_command_line("fasta"); continue; }
Expand Down Expand Up @@ -518,7 +520,7 @@ int main(int argc, char **argv)
fputc_or_die(' ', OUT);

unsigned long long out_line_length = line_length_is_specified ? requested_line_length : longest_line_length;
if (verbose) { msg("Output line length: %" PRINT_ULL "\n", out_line_length); }
if (verbose) { msg("Output line length: %llu\n", out_line_length); }
write_variable_length_encoded_number(OUT, out_line_length);
write_variable_length_encoded_number(OUT, n_sequences);

Expand Down Expand Up @@ -558,7 +560,7 @@ int main(int argc, char **argv)

if (!assume_well_formed_input) { report_unexpected_input_char_stats(); }

if (verbose) { msg("Processed %" PRINT_ULL " sequences\n", n_sequences); }
if (verbose) { msg("Processed %llu sequences\n", n_sequences); }
success = true;

return 0;
Expand Down
12 changes: 11 additions & 1 deletion ennaf/src/files.c
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
/*
* NAF compressor
* Copyright (c) 2018-2019 Kirill Kryukov
* Copyright (c) 2018-2020 Kirill Kryukov
* See README.md and LICENSE files of this repository
*/

Expand Down Expand Up @@ -28,6 +28,16 @@ static void open_input_file(void)
}


static void change_stderr_to_binary(void)
{
#ifdef __MINGW32__
if (_setmode(_fileno(stderr), O_BINARY) == -1) { die("can't set error stream to binary mode\n"); }
#else
if (!freopen(NULL, "wb", stderr)) { die("can't set error stream to binary mode\n"); }
#endif
}


static void open_output_file(void)
{
assert(OUT == NULL);
Expand Down
19 changes: 8 additions & 11 deletions ennaf/src/platform.h
Original file line number Diff line number Diff line change
@@ -1,18 +1,25 @@
/*
* NAF compressor
* Copyright (c) 2018-2020 Kirill Kryukov
* See README.md and LICENSE files of this repository
*/

#ifndef ENNAF_PLATFORM_H
#define ENNAF_PLATFORM_H


#define NDEBUG

#define __USE_MINGW_ANSI_STDIO 1

#include <assert.h>
#include <ctype.h>
#include <stdio.h>
#include <stdlib.h>
#include <stdbool.h>
#include <string.h>
#include <stdarg.h>
#include <time.h>
#include <ctype.h>
#include <unistd.h>
#include <sys/stat.h>

Expand All @@ -25,16 +32,6 @@



#if defined(__MINGW32__) || defined(__MINGW64__) || defined(_WIN32) || defined(_WIN64) || defined(WIN32) || defined(WIN64)
#define PRINT_ULL "I64u"
#define PRINT_SIZE_T "Iu"
#else
#define PRINT_ULL "llu"
#define PRINT_SIZE_T "zu"
#endif



#if defined(__MINGW32__) || defined(__MINGW64__)
#define HAVE_NO_CHMOD
#define HAVE_NO_CHOWN
Expand Down
28 changes: 14 additions & 14 deletions ennaf/src/process.c
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
/*
* NAF compressor
* Copyright (c) 2018-2019 Kirill Kryukov
* Copyright (c) 2018-2020 Kirill Kryukov
* See README.md and LICENSE files of this repository
*
* The FASTA/Q parser was originally based on Heng Li's kseq.h.
Expand Down Expand Up @@ -78,11 +78,11 @@ static void report_unexpected_char_stats(unsigned long long *n, const char *seq_
for (unsigned i = 0; i < 257; i++) { total += n[i]; }
if (total > 0)
{
msg("input has %" PRINT_ULL " unexpected %s characters:\n", total, seq_type_name);
for (unsigned i = 0; i < 32; i++) { if (n[i] != 0) { msg(" '\\x%02X': %" PRINT_ULL "\n", i, n[i]); } }
for (unsigned i = 32; i < 127; i++) { if (n[i] != 0) { msg(" '%c': %" PRINT_ULL "\n", (unsigned char)i, n[i]); } }
for (unsigned i = 127; i < 256; i++) { if (n[i] != 0) { msg(" '\\x%02X': %" PRINT_ULL "\n", i, n[i]); } }
if (n[256] != 0) { msg(" EOF: %" PRINT_ULL "\n", n[256]); }
msg("input has %llu unexpected %s characters:\n", total, seq_type_name);
for (unsigned i = 0; i < 32; i++) { if (n[i] != 0) { msg(" '\\x%02X': %llu\n", i, n[i]); } }
for (unsigned i = 32; i < 127; i++) { if (n[i] != 0) { msg(" '%c': %llu\n", (unsigned char)i, n[i]); } }
for (unsigned i = 127; i < 256; i++) { if (n[i] != 0) { msg(" '\\x%02X': %llu\n", i, n[i]); } }
if (n[256] != 0) { msg(" EOF: %llu\n", n[256]); }
}
}

Expand All @@ -101,7 +101,7 @@ static void unexpected_id_char(unsigned c)
{
if (abort_on_unexpected_code)
{
die("unexpected character '%c' in ID of sequence %" PRINT_ULL "\n", (unsigned char)c, n_sequences + 1);
die("unexpected character '%c' in ID of sequence %llu\n", (unsigned char)c, n_sequences + 1);
}
else { n_unexpected_id_characters[c]++; }
}
Expand All @@ -112,7 +112,7 @@ static void unexpected_comment_char(unsigned c)
{
if (abort_on_unexpected_code)
{
die("unexpected character '%c' in comment of sequence %" PRINT_ULL "\n", (unsigned char)c, n_sequences + 1);
die("unexpected character '%c' in comment of sequence %llu\n", (unsigned char)c, n_sequences + 1);
}
else { n_unexpected_comment_characters[c]++; }
}
Expand All @@ -123,7 +123,7 @@ static void unexpected_input_char(unsigned c)
{
if (abort_on_unexpected_code)
{
die("unexpected %s code '%c' in sequence %" PRINT_ULL "\n", in_seq_type_name, (unsigned char)c, n_sequences + 1);
die("unexpected %s code '%c' in sequence %llu\n", in_seq_type_name, (unsigned char)c, n_sequences + 1);
}
else { n_unexpected_seq_characters[c]++; }
}
Expand All @@ -134,7 +134,7 @@ static void unexpected_quality_char(unsigned c)
{
if (abort_on_unexpected_code)
{
die("unexpected quality code '%c' in sequence %" PRINT_ULL "\n", (unsigned char)c, n_sequences + 1);
die("unexpected quality code '%c' in sequence %llu\n", (unsigned char)c, n_sequences + 1);
}
else { n_unexpected_qual_characters[c]++; }
}
Expand Down Expand Up @@ -438,7 +438,7 @@ static void process_well_formed_fastq(void)
c = in_get_until_specific_char('\n', &qual);
if (QUAL.uncompressed_size + qual.length - old_len != read_length)
{
die("quality length of sequence %" PRINT_ULL " doesn't match sequence length\n", n_sequences + 1);
die("quality length of sequence %llu doesn't match sequence length\n", n_sequences + 1);
}

add_length(read_length);
Expand Down Expand Up @@ -491,7 +491,7 @@ static void process_non_well_formed_fastq(void)

do { c = in_get_char(); } while (is_eol_arr[c]);
if (c == INEOF) { die("truncated FASTQ input: last sequence has no quality\n"); }
if (c != '+') { die("invalid FASTQ input: can't find '+' line of sequence %" PRINT_ULL "\n", n_sequences + 1); }
if (c != '+') { die("invalid FASTQ input: can't find '+' line of sequence %llu\n", n_sequences + 1); }

c = in_skip_until(is_eol_arr);
if (c == INEOF) { die("truncated FASTQ input: last sequence has no quality\n"); }
Expand All @@ -510,7 +510,7 @@ static void process_non_well_formed_fastq(void)
unsigned long long qual_length = QUAL.uncompressed_size + qual.length - old_len;
if (qual_length != read_length)
{
die("quality length of sequence %" PRINT_ULL " (%" PRINT_ULL ") doesn't match sequence length (%" PRINT_ULL ")\n",
die("quality length of sequence %llu (%llu) doesn't match sequence length (%llu)\n",
n_sequences + 1, qual_length, read_length);
}

Expand All @@ -519,7 +519,7 @@ static void process_non_well_formed_fastq(void)

do { c = in_get_char(); } while (is_eol_arr[c]);
if (c == INEOF) { break; }
if (c != '@') { die("invalid FASTQ input: Can't find '@' after sequence %" PRINT_ULL "\n", n_sequences); }
if (c != '@') { die("invalid FASTQ input: Can't find '@' after sequence %llu\n", n_sequences); }
}
}

Expand Down
2 changes: 1 addition & 1 deletion ennaf/src/tables.c
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
/*
* NAF compressor
* Copyright (c) 2018-2019 Kirill Kryukov
* Copyright (c) 2018-2020 Kirill Kryukov
* See README.md and LICENSE files of this repository
*/

Expand Down
12 changes: 6 additions & 6 deletions ennaf/src/utils.c
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
/*
* NAF compressor
* Copyright (c) 2018-2019 Kirill Kryukov
* Copyright (c) 2018-2020 Kirill Kryukov
* See README.md and LICENSE files of this repository
*/


__attribute__ ((format (printf, 1, 2)))
//__attribute__ ((format (printf, 1, 2)))
static void msg(const char *format, ...)
{
va_list argptr;
Expand All @@ -16,7 +16,7 @@ static void msg(const char *format, ...)


__attribute__ ((cold))
__attribute__ ((format (printf, 1, 2)))
//__attribute__ ((format (printf, 1, 2)))
static void warn(const char *format, ...)
{
fputs("ennaf warning: ", stderr);
Expand All @@ -28,7 +28,7 @@ static void warn(const char *format, ...)


__attribute__ ((cold))
__attribute__ ((format (printf, 1, 2)))
//__attribute__ ((format (printf, 1, 2)))
static void err(const char *format, ...)
{
fputs("ennaf error: ", stderr);
Expand All @@ -40,7 +40,7 @@ static void err(const char *format, ...)


__attribute__ ((cold))
__attribute__ ((format (printf, 1, 2)))
//__attribute__ ((format (printf, 1, 2)))
__attribute__ ((noreturn))
static void die(const char *format, ...)
{
Expand All @@ -57,7 +57,7 @@ __attribute__ ((cold))
__attribute__ ((noreturn))
static void out_of_memory(const size_t size)
{
die("can't allocate %" PRINT_SIZE_T " bytes\n", size);
die("can't allocate %zu bytes\n", size);
}


Expand Down
Loading

0 comments on commit 357c79f

Please sign in to comment.