Skip to content

Commit

Permalink
First write-up read_id and read_filter
Browse files Browse the repository at this point in the history
  • Loading branch information
hcdenbakker authored Nov 12, 2018
1 parent 029f6a8 commit f43128b
Showing 1 changed file with 59 additions and 0 deletions.
59 changes: 59 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -134,6 +134,65 @@ OPTIONS:
-t, --threads <threads> number of threads to use, if not set the maximum available number threads will be
used
```
### 1. Create index

Use a index you created previously based on k-mers:
```./target/release/colorid build -r ref_file_example.txt -b test -k 31 -s 50000000 -n 4```
Or build a much smaller (and thus more compute efficient) index based on minimizers:
```./target/release/colorid build -r ref_file_example.txt -b test -k 27 -mv 21 -s 50000000 -n 4```

### 2. Classify reads

```./target/release/colorid read_id -b test.mxi -q your_reads_forward.fastq.gz your_reads_reverse.fastq.gz -n your_reads```

This will classify your reads using the minimizer index (indicated by the .mxi extension) and default parameters. You can speed up the classifier by using less kmers/minimizers per read as input using the `-d` flag, e.g., `-d 10` will use every 10th k-mer as input for the classifier. The `-n` flag indicates the prefix that is used for your output. The output consists of 2 files a `PREFIX_reads.txt` file and a `PREFIX_counts.txt` file. The `PREFIX_reads.txt` will give the results of the classifier per read(-pair):
```
@ERR2505816.7 HWI-H217:72:C5RKWACXX:4:1213:1068:60918 length=101 Escherichia_coli_A 10 14 accept
@ERR2505816.8 HWI-H217:72:C5RKWACXX:4:1114:3341:39942 length=101 Nannocystis_exedens 4 23 reject
```
The first column contains the name of the read(-pair), the second column the taxonomic classification, the third column the number of k-mers/minimers supporting this classification, the fourth column the total number of k-mers/minimers used as input for the classification and the fifth column indicates if this classification is rejected or accepted given the false positive probability associated wuth the organism and a p-value (default is 0.001). The `PREFIX_counts.txt` summarizes the total counts per taxon. I like to use `sort` and `head` to get the top hits:
```sort -grk2 PREFIX_counts.txt|head -10```

For a poultry associated metagenome (ERR2505816) and an index based on the GTDB (http://gtdb.ecogenomic.org/), this is the top 10:

```
reject 2574683
Anaerotignum_lactatifermentans 46565
Pseudoflavonifractor_capillosus 27420
Escherichia_coli_B 25555
Angelakisella_massiliensis 22917
Oscillibacter_sp6 22465
Escherichia_coli 20033
Fournierella_massiliensis 17941
Dorea_faecis 17337
Subdoligranulum_variabile 13333
```

### 3. Read filtering based on your classification

The `read_filter` subcommand can be used to create files that either consist of a single taxon, or which have a single taxon excluded:

```USAGE:
colorid read_filter [FLAGS] --classification <classification> --files <files> --prefix <prefix> --taxon <taxon>
FLAGS:
-e, --exclude If set('-e or --exclude'), reads for which the classification contains the taxon name will be
excluded
-h, --help Prints help information
-V, --version Prints version information
OPTIONS:
-c, --classification <classification> tab delimited read classification file generated with the read_id
subcommand
-f, --files <files> query file(-s)fastq.gz
-p, --prefix <prefix> prefix for output file(-s)
-t, --taxon <taxon> taxon to be in- or excluded from the read file(-s)
```
Here is an example:
```./target/release/colorid read_filter -c PREFIX_reads.txt -f your_reads_forward.fastq.gz your_reads_reverse.fastq.gz -p your_reads -t Dorea```

This will generate a set of paired-end files(`your_reads_Dorea_1.fq.gz, your_reads_Dorea_2.fq.gz`) containing reads with a classification containing `Dorea`. If the `-e` flag is added, the files will consist of all reads, except those that contain `Dorea` in the classification. The current version of the read_filter command does not take into account if a result has been statistically accepted or rejected, this option will be added in the near future.


## Acknowledgements
Lee Katz (https://github.com/lskatz) for his help with setting up Travis CI!

0 comments on commit f43128b

Please sign in to comment.