First write-up read_id and read_filter

hcdenbakker · Nov 12, 2018 · f43128b · f43128b
1 parent 029f6a8
commit f43128b
Showing 1 changed file with 59 additions and 0 deletions.
diff --git a/README.md b/README.md
@@ -134,6 +134,65 @@ OPTIONS:
     -t, --threads <threads>            number of threads to use, if not set the maximum available number threads will be
                                        used
 ```
+### 1. Create index
+
+Use a index you created previously based on k-mers:
+```./target/release/colorid build -r ref_file_example.txt -b test -k 31 -s 50000000 -n 4```
+Or build a much smaller (and thus more compute efficient) index based on minimizers:
+```./target/release/colorid build -r ref_file_example.txt -b test -k 27 -mv 21 -s 50000000 -n 4```
+
+### 2. Classify reads
+
+```./target/release/colorid read_id -b test.mxi -q your_reads_forward.fastq.gz your_reads_reverse.fastq.gz -n your_reads```
+
+This will classify your reads using the minimizer index (indicated by the .mxi extension) and default parameters. You can speed up the classifier by using less kmers/minimizers per read as input using the `-d` flag, e.g., `-d 10` will use every 10th k-mer as input for the classifier. The `-n` flag indicates the prefix that is used for your output. The output consists of 2 files a `PREFIX_reads.txt` file and a `PREFIX_counts.txt` file. The `PREFIX_reads.txt` will give the results of the classifier per read(-pair):
+```
+@ERR2505816.7 HWI-H217:72:C5RKWACXX:4:1213:1068:60918 length=101        Escherichia_coli_A      10      14      accept
+@ERR2505816.8 HWI-H217:72:C5RKWACXX:4:1114:3341:39942 length=101        Nannocystis_exedens     4       23      reject
+```
+The first column contains the name of the read(-pair), the second column the taxonomic classification, the third column the number of k-mers/minimers supporting this classification, the fourth column the total number of k-mers/minimers used as input for the classification and the fifth column indicates if this classification is rejected or accepted given the false positive probability associated wuth the organism and a p-value (default is 0.001). The `PREFIX_counts.txt` summarizes the total counts per taxon. I like to use `sort` and `head` to get the top hits:
+```sort -grk2 PREFIX_counts.txt|head -10```
+
+For a poultry associated metagenome (ERR2505816) and an index based on the GTDB (http://gtdb.ecogenomic.org/), this is the top 10: 
+
+```
+reject	2574683
+Anaerotignum_lactatifermentans	46565
+Pseudoflavonifractor_capillosus	27420
+Escherichia_coli_B	25555
+Angelakisella_massiliensis	22917
+Oscillibacter_sp6	22465
+Escherichia_coli	20033
+Fournierella_massiliensis	17941
+Dorea_faecis	17337
+Subdoligranulum_variabile	13333
+```
+
+### 3. Read filtering based on your classification
+
+The `read_filter` subcommand can be used to create files that either consist of a single taxon, or which have a single taxon excluded:
+
+```USAGE:
+    colorid read_filter [FLAGS] --classification <classification> --files <files> --prefix <prefix> --taxon <taxon>
+
+FLAGS:
+    -e, --exclude    If set('-e or --exclude'), reads for which the classification contains the taxon name will be
+                     excluded
+    -h, --help       Prints help information
+    -V, --version    Prints version information
+
+OPTIONS:
+    -c, --classification <classification>    tab delimited read classification file generated with the read_id
+                                             subcommand
+    -f, --files <files>                      query file(-s)fastq.gz
+    -p, --prefix <prefix>                    prefix for output file(-s)
+    -t, --taxon <taxon>                      taxon to be in- or excluded from the read file(-s)
+```
+Here is an example:
+```./target/release/colorid read_filter -c PREFIX_reads.txt -f your_reads_forward.fastq.gz your_reads_reverse.fastq.gz -p your_reads -t Dorea```
+
+This will generate a set of paired-end files(`your_reads_Dorea_1.fq.gz, your_reads_Dorea_2.fq.gz`) containing reads with a classification containing `Dorea`. If the `-e` flag is added, the files will consist of all reads, except those that contain `Dorea` in the classification. The current version of the read_filter command does not take into account if a result has been statistically accepted or rejected, this option will be added in the near future. 
+
 
 ## Acknowledgements
 Lee Katz (https://github.com/lskatz) for his help with setting up Travis CI!