caporaso-lab · gregcaporaso · Aug 25, 2016 · Jul 13, 2016 · lkursell · Jul 18, 2016
diff --git a/.coveragerc b/.coveragerc
@@ -0,0 +1,19 @@
+# this file is based on the examples provided on scikit-learn's .coveragerc
+# and is adapted from scikit-bio's .coveragerc
+
+[run]
+omit =
+    */tests*
+    */__init__.py
+source = sourcetracker
+branch = True
+include = */sourcetracker/*
+
+[report]
+exclude_lines =
+    pragma: no cover
+    raise NotImplementedError
+    if __name__ == .__main__.:
+omit =
+    */tests*
+    */__init__.py
diff --git a/.gitignore b/.gitignore
@@ -69,3 +69,4 @@ target/
 # example output
 data/tiny-test/mixing_proportions
 data/tiny-test/source_loo
+data/tiny-test/example*
diff --git a/.travis.yml b/.travis.yml
@@ -12,11 +12,13 @@ before_install:
 install:
   - conda create --yes -n st2 python=$PYTHON_VERSION nose numpy scipy h5py flake8 scikit-bio=0.4.0
   - source activate st2
+  - pip install coveralls
   - pip install .
 script:
-  - nosetests
+  - nosetests --with-coverage
   - flake8 sourcetracker setup.py
   - sourcetracker2 gibbs --help
-  - cd data/tiny-test/
-  - sourcetracker2 gibbs -i otu_table.biom -m map.txt -o example1/
-  - sourcetracker2 gibbs -i otu_table.biom -m alt-map.txt -o example2/ --source_sink_column source-or-sink --source_column_value src --sink_column_value snk --source_category_column sample-type
+  - sourcetracker2 gibbs -i data/tiny-test/otu_table.biom -m data/tiny-test/map.txt -o example1/
+  - sourcetracker2 gibbs -i data/tiny-test/otu_table.biom -m data/tiny-test/alt-map.txt -o example2/ --source_sink_column source-or-sink --source_column_value src --sink_column_value snk --source_category_column sample-type
+after_success:
+  - coveralls
diff --git a/ChangeLog.md b/ChangeLog.md
@@ -2,6 +2,13 @@
 
 ## 2.0.1-dev (changes since 2.0.1 go here)
 
+ * A candidate public API has been created for both normal sink/source
+   prediction and leave-one-out (LOO) classification. These calls are 
+   ``_gibbs`` and ``_gibbs_loo``.
+ * The per-sink feature assignments are recorded for every run and written to
+   the output directory. They are named as X.contingency.txt where X is the
+   name of a sink.
+
 ## 2.0.1
 
   * Initial alpha release.

diff --git a/README.md b/README.md
@@ -1,20 +1,101 @@
-[![Build Status](https://travis-ci.org/biota/sourcetracker2.svg?branch=master)](https://travis-ci.org/biota/sourcetracker2)
-
 # SourceTracker2
+[![Build Status](https://travis-ci.org/biota/sourcetracker2.svg?branch=master)](https://travis-ci.org/biota/sourcetracker2) [![Coverage Status](https://coveralls.io/repos/github/biota/sourcetracker2/badge.svg)](https://coveralls.io/github/biota/sourcetracker2)
 
 SourceTracker was originally described in [Knights et al., 2011](http://www.ncbi.nlm.nih.gov/pubmed/21765408).
 If you use this package, please cite the original SourceTracker paper linked
 above pending publication of SourceTracker 2.
 
+# API vs. CLI
+
+There are two ways to access the SourceTracker 2 functionality, via the command
+line (CLI) or the python API. Users seeking to replicate the functionality
+of SourceTracker 1 should use the command line functionality (`sourcetracker2 gibbs`)
+Programmatic users are encouraged to use the API (exposed via `gibbs` and `gibbs_loo`).
+
+The help documentation is broken down into sections with separate subsections for
+API and CLI usage. 
+
+# File Formats
+
+## Command Line 
+For descriptions of all file formats and options, please see the help
+documentation, available with the command `sourcetracker2 gibbs --help`. 
+
+This script requires a feature X sample contingency table (traditionally an OTU
+table) and sample metadata (traditionally a mapping file). 
+
+The feature X sample table is an `nXm` table (`n` rows, `m` columns) with sample IDs in the first column, and feature IDs in the first row. The values in each 'cell' of the table
+must be integer counts. 
+
+The sample metadata file is an `sXk` table (`s` rows, `k` columns) with sample IDs in
+the first column, and metadata headers in the first row. The values in each 'cell' of the
+table can be any type of data, and respresent information about each sample.
+
+Any feature table that can be read by the `biom-format >= 2.1.4` package will be
+acceptable input. For example file formats please look at the test mapping file
+and feature (OTU) tables we have included [here](https://github.com/biota/sourcetracker2/tree/master/data/tiny-test).
+
+## API
+For descriptions of the requirements, please see documentation in the `gibbs`
+function. Very briefly, this function wraps the main workhorse function `_gibbs_sampler`
+and exposes all the parameters necessary to control the behavior of the Gibb's sampling
+as well as the parallel functionality etc. 
+
+A superficial but important difference from the CLI framework is that, internally, 
+SourceTracker 2 represents all tables as sample X feature (samples are rows,
+columns are features). This reflects choices in Dan's original code, as well as
+eases metadata based subsetting of tables. The API functions expect data in
+sample X column format.
+
+# Preprocessing
+
+## Command line
+Input feature data should be counts. If non-count data (e.g. the count of
+feature i in sample j was 4.63) is passed, the 'ceiling' of the data will be
+taken. This means that each non-integer count will be rounded up to the nearest
+larger integer.
+
+Rarefaction is performed by default at 1000 seqs/sample for both sinks and
+sources. This is done to prevent samples with more counts from dominating the
+contributions. Rarefaction depth can be set (or entirely disabled) with the ``--source_rarefaction_depth`` and ``--sink_rarefaction_depth`` parameters. Source
+samples which are collapsed are rarefied after collapse.
+
+Samples which are not present in both the input feature table and the metadata
+are excluded from the analysis.
+
+Samples which come from the same source environment are 'collapsed', meaning
+their mean value for each feature is computed and used in the analysis. See the
+'Theory' section below for a discussion of this approach.
+
+## API
+The `gibbs` and `gibbs_loo` functions due minimal preprocessing on the input data. 
+The source and sink dataframes are treated as final (no collapsing is done on them),
+i.e. each sink is treated independently. The data is **not** rarified, the tables
+are expected to have the desired row sums.
+
+
+# Output
+
+## Command line
+There are two default output files, the `mixing_proporitions.txt` and
+`mixing_proportion_stds.txt`. `mixing_proporitions.txt` is a tab-separated contingency table with sinks as rows and sources as columns. The values in the table are the
+mean fractional contributions of each source to each sink. `mixing_proporitions_stds`
+has the same format, but contains the standard deviation of each fractional contribution.
+
+Optionally, you can create per-sink feature X sample tables with the `--per_sink_feature_assignments` flag. The per-sink feature tables are labeled with
+the name of the sink. For example, if we have a sink called 'hand_sample3' the
+output feature table would be 'hand_sample3.feature_table.txt'. These tables record the
+origin source of each sink sequence (count of a feature).
+
+## API
+The outputs of the `gibbs` and `gibbs_loo` functions are identical to the command line
+outputs, just in dataframe form. 
+
+
 # Documentation
 
 This script replicates and extends the functionality of Dan Knights's
-SourceTracker R package.
-
-The `mapping file` which describes the `sources` and `sinks` must be
-formatted in the same way it was for the SourceTracker R package. Specifically,
-there must be a column `SourceSink` and a column `Env`. For an example, look
-at `sourcetracker2/data/tiny-test/`.
+SourceTracker R package. 
 
 A major improvement in this version of SourceTracker is the ability to run it in parallel.
 Currently, parallelization across a single machine is
@@ -35,7 +116,7 @@ SourceTracker2 is Python 3 software. The easiest way to install it is using Anac
 To install SourceTracker 2 using Anaconda, run the following commands:
 
 ```bash
-conda create -n st2 python=3.5 numpy scipy h5py hdf5 matplotlib
+conda create -n st2 python=3.5 numpy scipy scikit-bio=0.4.3 biom-format h5py hdf5
 source activate st2
 pip install sourcetracker
 ```
@@ -141,7 +222,7 @@ These usage examples expect that you are in the directory
 `sourcetracker2 gibbs -i otu_table.biom -m map.txt -o example1/`
 
 **Calculate the proportion of each source in each sink using an alternate sample metadata mapping file where samples are described differently.**  
-`sourcetracker2 gibbs -i otu_table.biom -m alt-map.txt -o example2/ --source_sink_column source-or-sink --source_column_value src --sink_column_value snk --source_category_column sample-type`
+`sourcetracker2 gibbs -i otu_table.biom -m alt-map.txt -o example2/ --source_sink_column source-or-sink --source_column_value src --sink_column_value sink --source_category_column sample-type`
 
 **Calculate the class label (i.e. 'Env') of each source using a leave one out
 strategy**    
@@ -151,19 +232,21 @@ strategy**
 `sourcetracker2 gibbs -i otu_table.biom -m map.txt -o example4/ --burnin 100`
 
 **Calculate the proportion of each source in each sink, using a sink
-rarefaction depth of 1500**    
-`sourcetracker2 gibbs -i otu_table.biom -m map.txt -o example5/ --sink_rarefaction_depth 2500`
+rarefaction depth of 1700 and source rarefaction depth of 2000**    
+`sourcetracker2 gibbs -i otu_table.biom -m map.txt -o example5/ --sink_rarefaction_depth 1700 --source_rarefaction_depth 2000`
 
 **Calculate the proportion of each source in each sink, using ipyparallel to run in parallel with 5 jobs**  
 `sourcetracker2 gibbs -i otu_table.biom -m map.txt -o example6/ --jobs 5`
 
+**Calculate the proportion of each source in each sink, using ipyparallel to run in parallel with 5 jobs. Write the per sink feature tables (what SourceTracker 1 called
+'full output'). These are feature by sample table indicating the origin source of each sequence (each count of a feature).**  
+`sourcetracker2 gibbs -i otu_table.biom -m map.txt -o example7/ --jobs 5 --per_sink_feature_assignments`
+
+
 # Miscellaneous
 
 The current implementation of SourceTracker 2 does not contain functionality for
 visualization of results or auto-tuning of the parameters (`alpha1`, `alpha1`,
 etc.). For an example of how you might visualize the data, please see
 this [Juypter notebook](https://github.com/biota/SourceTracker2/blob/master/ipynb/Visualizing%20results.ipynb).
 For auto-tuning functionality, please see the original R code.
-
-Like the old SourceTracker, SourceTracker2 rarifies the source environments it
-collapses by default.